Create ScMags Object¶
Creating the ScMags object is very simple.
Data matrix and labels are mandatory gene names are optional. * In the data matrix, rows must correspond to cells, columns must correspond to genes, and must be in one of three formats:
numpy.ndarry
, scipy.sparse.csr_matrix
, or scipy.sparse.csc_matrix
.Labels and gene names should be in numpy.ndarray format
An example for the pollen dataset
In any location in the folder named pollen the data matrix get the labels and gene names.
[1]:
%%bash
ls Pollen
Pollen_Data.csv
Pollen_Data_markers_res_ann.csv
Pollen_Data_markers_res_ind.csv
Pollen_Gene_Ann.csv
Pollen_Labels.csv
Let’s read the data and convert it to numpy.ndarray format
[2]:
import pandas as pd
pollen_data = pd.read_csv('Pollen/Pollen_Data.csv',sep=',', header = 0, index_col = 0).to_numpy().T
pollen_labels = pd.read_csv('Pollen/Pollen_Labels.csv', sep=',', header = 0, index_col = 0).to_numpy()
gene_names = pd.read_csv('Pollen/Pollen_Gene_Ann.csv', sep=',', header = 0, index_col = 0).to_numpy()
pollen_labels = pollen_labels.reshape(pollen_data.shape[0])
gene_names = gene_names.reshape(pollen_data.shape[1])
Sizes of data labels and gene names must match.
In addition, labels and gene names must be a one-dimensional array.
[3]:
print(pollen_data.shape)
print(type(pollen_data))
(301, 23730)
<class 'numpy.ndarray'>
[4]:
print(pollen_labels.shape)
print(type(pollen_labels))
(301,)
<class 'numpy.ndarray'>
[5]:
print(pollen_data.shape)
print(type(pollen_labels))
(301, 23730)
<class 'numpy.ndarray'>
Now let’s create the ScMags
object
[6]:
import scmags as sm
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels, gene_ann=gene_names)
Then the desired operations can be performed.
[7]:
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting markers for each cluster
[7]:
Marker_1 | Marker_2 | Marker_3 | |
---|---|---|---|
C_Hi_2338 | KRT86 | KRT83 | S100A9 |
C_Hi_2339 | ELK2AP | CD27 | HLA-DQA2 |
C_Hi_BJ | COL6A3 | DCN | GREM1 |
C_Hi_GW16 | NNAT | CDO1 | SETBP1 |
C_Hi_GW21 | GRIA2 | PLXNA4 | MAPT |
C_Hi_HL60 | MPO | PRTN3 | PRG2 |
C_Hi_K562 | HBG1 | GAGE4 | HBG2 |
C_Hi_Kera | FGFBP1 | KRT6C | PTHLH |
C_Hi_NPC | COL2A1 | LRP2 | COL9A1 |
C_Hi_iPS | CRABP1 | LECT1 | LOC100505817 |
If gene names are not given, they are created from indexes inside.
[8]:
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels)
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting markers for each cluster
[8]:
Marker_1 | Marker_2 | Marker_3 | |
---|---|---|---|
C_Hi_2338 | Gene_9590 | Gene_9587 | Gene_18197 |
C_Hi_2339 | Gene_5452 | Gene_3074 | Gene_8163 |
C_Hi_BJ | Gene_3870 | Gene_4560 | Gene_7639 |
C_Hi_GW16 | Gene_14484 | Gene_3258 | Gene_18602 |
C_Hi_GW21 | Gene_7646 | Gene_16255 | Gene_11768 |
C_Hi_HL60 | Gene_13612 | Gene_16874 | Gene_16683 |
C_Hi_K562 | Gene_7897 | Gene_7008 | Gene_7898 |
C_Hi_Kera | Gene_6529 | Gene_9572 | Gene_17013 |
C_Hi_NPC | Gene_3855 | Gene_11381 | Gene_3878 |
C_Hi_iPS | Gene_4022 | Gene_9819 | Gene_10525 |
These names are actually indices of genes in the data matrix.
[9]:
pollen.get_markers(ind_return=True)
[9]:
Marker_1 | Marker_2 | Marker_3 | |
---|---|---|---|
C_Hi_2338 | 9590 | 9587 | 18197 |
C_Hi_2339 | 5452 | 3074 | 8163 |
C_Hi_BJ | 3870 | 4560 | 7639 |
C_Hi_GW16 | 14484 | 3258 | 18602 |
C_Hi_GW21 | 7646 | 16255 | 11768 |
C_Hi_HL60 | 13612 | 16874 | 16683 |
C_Hi_K562 | 7897 | 7008 | 7898 |
C_Hi_Kera | 6529 | 9572 | 17013 |
C_Hi_NPC | 3855 | 11381 | 3878 |
C_Hi_iPS | 4022 | 9819 | 10525 |
Data matrix can be in sparse matrix other than numpy.ndarray
For example:
[10]:
from scipy import sparse
pollen_data = sparse.csr_matrix(pollen_data)
print(pollen_data.shape)
print(type(pollen_data))
pollen_data
(301, 23730)
<class 'scipy.sparse.csr.csr_matrix'>
[10]:
<301x23730 sparse matrix of type '<class 'numpy.float64'>'
with 2347117 stored elements in Compressed Sparse Row format>
[11]:
pollen = sm.ScMags(data=pollen_data, labels=pollen_labels, gene_ann=gene_names)
pollen.filter_genes()
pollen.sel_clust_marker()
pollen.get_markers()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting markers for each cluster
[11]:
Marker_1 | Marker_2 | Marker_3 | |
---|---|---|---|
C_Hi_2338 | KRT86 | KRT83 | S100A9 |
C_Hi_2339 | ELK2AP | CD27 | HLA-DQA2 |
C_Hi_BJ | COL6A3 | DCN | GREM1 |
C_Hi_GW16 | NNAT | CDO1 | SETBP1 |
C_Hi_GW21 | GRIA2 | PLXNA4 | MAPT |
C_Hi_HL60 | MPO | PRTN3 | PRG2 |
C_Hi_K562 | HBG1 | GAGE4 | HBG2 |
C_Hi_Kera | FGFBP1 | KRT6C | PTHLH |
C_Hi_NPC | COL2A1 | LRP2 | COL9A1 |
C_Hi_iPS | CRABP1 | LECT1 | LOC100505817 |