Create ScMags Object¶

Creating the ScMags object is very simple.

Data matrix and labels are mandatory gene names are optional. * In the data matrix, rows must correspond to cells, columns must correspond to genes, and must be in one of three formats: numpy.ndarry, scipy.sparse.csr_matrix, or scipy.sparse.csc_matrix.

Labels and gene names should be in numpy.ndarray format

An example for the pollen dataset

In any location in the folder named pollen the data matrix get the labels and gene names.

[1]:

%%bash
ls Pollen

Pollen_Data.csv
Pollen_Data_markers_res_ann.csv
Pollen_Data_markers_res_ind.csv
Pollen_Gene_Ann.csv
Pollen_Labels.csv

Let’s read the data and convert it to numpy.ndarray format

[2]:

import pandas as pd

pollen_data = pd.read_csv('Pollen/Pollen_Data.csv',sep=',', header = 0,  index_col = 0).to_numpy().T
pollen_labels = pd.read_csv('Pollen/Pollen_Labels.csv', sep=',', header = 0,  index_col = 0).to_numpy()
gene_names = pd.read_csv('Pollen/Pollen_Gene_Ann.csv', sep=',', header = 0,  index_col = 0).to_numpy()

pollen_labels = pollen_labels.reshape(pollen_data.shape[0])
gene_names = gene_names.reshape(pollen_data.shape[1])

Sizes of data labels and gene names must match.
In addition, labels and gene names must be a one-dimensional array.

[3]:

print(pollen_data.shape)
print(type(pollen_data))

(301, 23730)
<class 'numpy.ndarray'>

[4]:

print(pollen_labels.shape)
print(type(pollen_labels))

(301,)
<class 'numpy.ndarray'>

[5]:

print(pollen_data.shape)
print(type(pollen_labels))

(301, 23730)
<class 'numpy.ndarray'>

Now let’s create the ScMags object

[6]:

import scmags as sm
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels, gene_ann=gene_names)

Then the desired operations can be performed.

[7]:

pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()

-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster

[7]:

	Marker_1	Marker_2	Marker_3
C_Hi_2338	KRT86	KRT83	S100A9
C_Hi_2339	ELK2AP	CD27	HLA-DQA2
C_Hi_BJ	COL6A3	DCN	GREM1
C_Hi_GW16	NNAT	CDO1	SETBP1
C_Hi_GW21	GRIA2	PLXNA4	MAPT
C_Hi_HL60	MPO	PRTN3	PRG2
C_Hi_K562	HBG1	GAGE4	HBG2
C_Hi_Kera	FGFBP1	KRT6C	PTHLH
C_Hi_NPC	COL2A1	LRP2	COL9A1
C_Hi_iPS	CRABP1	LECT1	LOC100505817

If gene names are not given, they are created from indexes inside.

[8]:

pollen = sm.ScMags(data=pollen_data, labels=pollen_labels)
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()

-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster

[8]:

	Marker_1	Marker_2	Marker_3
C_Hi_2338	Gene_9590	Gene_9587	Gene_18197
C_Hi_2339	Gene_5452	Gene_3074	Gene_8163
C_Hi_BJ	Gene_3870	Gene_4560	Gene_7639
C_Hi_GW16	Gene_14484	Gene_3258	Gene_18602
C_Hi_GW21	Gene_7646	Gene_16255	Gene_11768
C_Hi_HL60	Gene_13612	Gene_16874	Gene_16683
C_Hi_K562	Gene_7897	Gene_7008	Gene_7898
C_Hi_Kera	Gene_6529	Gene_9572	Gene_17013
C_Hi_NPC	Gene_3855	Gene_11381	Gene_3878
C_Hi_iPS	Gene_4022	Gene_9819	Gene_10525

These names are actually indices of genes in the data matrix.

[9]:

pollen.get_markers(ind_return=True)

[9]:

	Marker_1	Marker_2	Marker_3
C_Hi_2338	9590	9587	18197
C_Hi_2339	5452	3074	8163
C_Hi_BJ	3870	4560	7639
C_Hi_GW16	14484	3258	18602
C_Hi_GW21	7646	16255	11768
C_Hi_HL60	13612	16874	16683
C_Hi_K562	7897	7008	7898
C_Hi_Kera	6529	9572	17013
C_Hi_NPC	3855	11381	3878
C_Hi_iPS	4022	9819	10525

Data matrix can be in sparse matrix other than numpy.ndarray For example:

[10]:

from scipy import sparse
pollen_data = sparse.csr_matrix(pollen_data)
print(pollen_data.shape)
print(type(pollen_data))
pollen_data

(301, 23730)
<class 'scipy.sparse.csr.csr_matrix'>

[10]:

<301x23730 sparse matrix of type '<class 'numpy.float64'>'
        with 2347117 stored elements in Compressed Sparse Row format>

[11]:

pollen = sm.ScMags(data=pollen_data, labels=pollen_labels, gene_ann=gene_names)
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()

-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster

[11]:

	Marker_1	Marker_2	Marker_3
C_Hi_2338	KRT86	KRT83	S100A9
C_Hi_2339	ELK2AP	CD27	HLA-DQA2
C_Hi_BJ	COL6A3	DCN	GREM1
C_Hi_GW16	NNAT	CDO1	SETBP1
C_Hi_GW21	GRIA2	PLXNA4	MAPT
C_Hi_HL60	MPO	PRTN3	PRG2
C_Hi_K562	HBG1	GAGE4	HBG2
C_Hi_Kera	FGFBP1	KRT6C	PTHLH
C_Hi_NPC	COL2A1	LRP2	COL9A1
C_Hi_iPS	CRABP1	LECT1	LOC100505817