Create ScMags Object

Creating the ScMags object is very simple.
Data matrix and labels are mandatory gene names are optional. * In the data matrix, rows must correspond to cells, columns must correspond to genes, and must be in one of three formats: numpy.ndarry, scipy.sparse.csr_matrix, or scipy.sparse.csc_matrix.
  • Labels and gene names should be in numpy.ndarray format

An example for the pollen dataset
In any location in the folder named pollen the data matrix get the labels and gene names.
[1]:
%%bash
ls Pollen
Pollen_Data.csv
Pollen_Data_markers_res_ann.csv
Pollen_Data_markers_res_ind.csv
Pollen_Gene_Ann.csv
Pollen_Labels.csv

Let’s read the data and convert it to numpy.ndarray format

[2]:
import pandas as pd

pollen_data = pd.read_csv('Pollen/Pollen_Data.csv',sep=',', header = 0,  index_col = 0).to_numpy().T
pollen_labels = pd.read_csv('Pollen/Pollen_Labels.csv', sep=',', header = 0,  index_col = 0).to_numpy()
gene_names = pd.read_csv('Pollen/Pollen_Gene_Ann.csv', sep=',', header = 0,  index_col = 0).to_numpy()

pollen_labels = pollen_labels.reshape(pollen_data.shape[0])
gene_names = gene_names.reshape(pollen_data.shape[1])
  • Sizes of data labels and gene names must match.

  • In addition, labels and gene names must be a one-dimensional array.

[3]:
print(pollen_data.shape)
print(type(pollen_data))
(301, 23730)
<class 'numpy.ndarray'>
[4]:
print(pollen_labels.shape)
print(type(pollen_labels))
(301,)
<class 'numpy.ndarray'>
[5]:
print(pollen_data.shape)
print(type(pollen_labels))
(301, 23730)
<class 'numpy.ndarray'>

Now let’s create the ScMags object

[6]:
import scmags as sm
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels, gene_ann=gene_names)

Then the desired operations can be performed.

[7]:
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster
[7]:
Marker_1 Marker_2 Marker_3
C_Hi_2338 KRT86 KRT83 S100A9
C_Hi_2339 ELK2AP CD27 HLA-DQA2
C_Hi_BJ COL6A3 DCN GREM1
C_Hi_GW16 NNAT CDO1 SETBP1
C_Hi_GW21 GRIA2 PLXNA4 MAPT
C_Hi_HL60 MPO PRTN3 PRG2
C_Hi_K562 HBG1 GAGE4 HBG2
C_Hi_Kera FGFBP1 KRT6C PTHLH
C_Hi_NPC COL2A1 LRP2 COL9A1
C_Hi_iPS CRABP1 LECT1 LOC100505817
  • If gene names are not given, they are created from indexes inside.

[8]:
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels)
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster
[8]:
Marker_1 Marker_2 Marker_3
C_Hi_2338 Gene_9590 Gene_9587 Gene_18197
C_Hi_2339 Gene_5452 Gene_3074 Gene_8163
C_Hi_BJ Gene_3870 Gene_4560 Gene_7639
C_Hi_GW16 Gene_14484 Gene_3258 Gene_18602
C_Hi_GW21 Gene_7646 Gene_16255 Gene_11768
C_Hi_HL60 Gene_13612 Gene_16874 Gene_16683
C_Hi_K562 Gene_7897 Gene_7008 Gene_7898
C_Hi_Kera Gene_6529 Gene_9572 Gene_17013
C_Hi_NPC Gene_3855 Gene_11381 Gene_3878
C_Hi_iPS Gene_4022 Gene_9819 Gene_10525
  • These names are actually indices of genes in the data matrix.

[9]:
pollen.get_markers(ind_return=True)
[9]:
Marker_1 Marker_2 Marker_3
C_Hi_2338 9590 9587 18197
C_Hi_2339 5452 3074 8163
C_Hi_BJ 3870 4560 7639
C_Hi_GW16 14484 3258 18602
C_Hi_GW21 7646 16255 11768
C_Hi_HL60 13612 16874 16683
C_Hi_K562 7897 7008 7898
C_Hi_Kera 6529 9572 17013
C_Hi_NPC 3855 11381 3878
C_Hi_iPS 4022 9819 10525

Data matrix can be in sparse matrix other than numpy.ndarray For example:

[10]:
from scipy import sparse
pollen_data = sparse.csr_matrix(pollen_data)
print(pollen_data.shape)
print(type(pollen_data))
pollen_data
(301, 23730)
<class 'scipy.sparse.csr.csr_matrix'>
[10]:
<301x23730 sparse matrix of type '<class 'numpy.float64'>'
        with 2347117 stored elements in Compressed Sparse Row format>
[11]:
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels, gene_ann=gene_names)
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster
[11]:
Marker_1 Marker_2 Marker_3
C_Hi_2338 KRT86 KRT83 S100A9
C_Hi_2339 ELK2AP CD27 HLA-DQA2
C_Hi_BJ COL6A3 DCN GREM1
C_Hi_GW16 NNAT CDO1 SETBP1
C_Hi_GW21 GRIA2 PLXNA4 MAPT
C_Hi_HL60 MPO PRTN3 PRG2
C_Hi_K562 HBG1 GAGE4 HBG2
C_Hi_Kera FGFBP1 KRT6C PTHLH
C_Hi_NPC COL2A1 LRP2 COL9A1
C_Hi_iPS CRABP1 LECT1 LOC100505817