Select Markers¶
This tutorial shows how to select markers with the scmags package.
Let’s perform these operations with the baron_h1 data set in the package.
For this, we first import the package.
[1]:
import scmags as mg
Then we can start the operations by loading the dataset.
[2]:
baron_h1 = mg.datasets.baron_h1()
Filter Genes¶
First, redundant genes need to be filtered out for computational efficiency.
[3]:
baron_h1.filter_genes()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
[4]:
rem_genes = baron_h1.get_filter_genes()
dict(list(rem_genes.items())[0:2])
[4]:
{'acinar': ['ALB',
'ALDOB',
'CEL',
'CELA2A',
'CUZD1',
'GP2',
'KLK1',
'PDIA2',
'PNLIPRP1',
'PNLIPRP2'],
'activated_stellate': ['ADAMTS12',
'COL6A3',
'CRLF1',
'FBN1',
'FMOD',
'GGT5',
'LAMC3',
'SFRP2',
'THBS2',
'VCAN']}
If you want, you can display the corresponding indexes in the data matrix.
[5]:
baron_h1.get_filter_genes(ind_return=True)
[5]:
{'acinar': array([ 550, 572, 3055, 3057, 4062, 6872, 8968, 12577, 13082,
13083]),
'activated_stellate': array([ 276, 3637, 3836, 5878, 6179, 6624, 9231, 15293, 17233,
18923]),
'alpha': array([ 2390, 6534, 7966, 8442, 8713, 10734, 11417, 12416, 14361,
15835]),
'beta': array([ 68, 322, 1974, 4675, 5342, 6383, 7240, 11265, 13322,
15686]),
'delta': array([ 1449, 4516, 6407, 6543, 9345, 10318, 12504, 14210, 16021,
16794]),
'ductal': array([ 3383, 4203, 9048, 9077, 10397, 13242, 13639, 15205, 16828,
17311]),
'endothelial': array([ 357, 3402, 4968, 5360, 6159, 8756, 12396, 14586, 16167,
19048]),
'epsilon': array([ 6027, 6299, 6358, 6637, 9723, 11320, 12939, 15274, 16388,
19024]),
'gamma': array([ 932, 6357, 9325, 10138, 11153, 12375, 13822, 16601, 18319,
18672]),
'macrophage': array([ 1971, 1973, 3888, 4127, 7307, 7541, 8490, 12935, 15095,
15487]),
'mast': array([ 63, 252, 2020, 2816, 3734, 6502, 8896, 14929, 17869,
17870]),
'quiescent_stellate': array([ 315, 372, 3930, 4987, 6668, 8266, 8633, 11057, 14362,
19198]),
'schwann': array([ 1728, 3859, 4579, 6356, 6911, 7961, 8585, 15146, 16160,
16509]),
't_cell': array([ 2729, 2842, 2860, 6503, 8385, 12389, 16339, 17887, 18058,
19329])}
If you have not set an intra-cluster expression rate threshold, you can also view the automatically determined thresholds.
[6]:
baron_h1.get_filt_cluster_thresholds
[6]:
{'acinar': 0.6545454561710358,
'activated_stellate': 0.7058823555707932,
'alpha': 0.6525423675775528,
'beta': 0.6416284441947937,
'delta': 0.6355140209197998,
'ductal': 0.6791666597127914,
'endothelial': 0.6192307695746422,
'epsilon': 0.6538461595773697,
'gamma': 0.6285714358091354,
'macrophage': 0.6785714328289032,
'mast': 0.625,
'quiescent_stellate': 0.614130437374115,
'schwann': 0.7000000029802322,
't_cell': 0.75}
[7]:
baron_h1.filter_genes(in_cls_thres=0.7)
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
[8]:
baron_h1.get_filt_cluster_thresholds
[8]:
{'acinar': 0.7,
'activated_stellate': 0.7,
'alpha': 0.7,
'beta': 0.7,
'delta': 0.7,
'ductal': 0.7,
'endothelial': 0.7,
'epsilon': 0.7,
'gamma': 0.7,
'macrophage': 0.7,
'mast': 0.7,
'quiescent_stellate': 0.7,
'schwann': 0.7,
't_cell': 0.7}
[9]:
baron_h1.filter_genes(nof_sel=20)
baron_h1.get_filter_genes(ind_return=True)
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
[9]:
{'acinar': array([ 550, 572, 799, 3055, 3057, 3733, 4020, 4062, 6872,
7103, 7104, 8968, 10655, 10732, 12577, 13081, 13082, 13083,
13620, 16742]),
'activated_stellate': array([ 276, 405, 1990, 3632, 3633, 3637, 3836, 5878, 6179,
6182, 6624, 9231, 9504, 9691, 10387, 15293, 17233, 17850,
18923, 19025]),
'alpha': array([ 2390, 3235, 6534, 6730, 7966, 8442, 8713, 10734, 11417,
12291, 12339, 12416, 12508, 12637, 14361, 15275, 15835, 15991,
17366, 18944]),
'beta': array([ 68, 322, 1974, 2995, 3533, 4675, 4993, 5342, 6383,
6420, 7240, 10627, 11265, 12321, 13322, 13959, 14098, 14999,
15686, 18304]),
'delta': array([ 959, 1449, 1557, 2353, 4516, 6407, 6543, 9345, 10068,
10318, 11279, 12285, 12504, 12506, 14210, 15098, 16021, 16729,
16794, 17839]),
'ductal': array([ 458, 809, 3383, 3401, 4203, 4835, 5129, 6855, 8818,
9048, 9077, 9840, 10397, 13242, 13639, 15205, 15912, 16828,
17129, 17311]),
'endothelial': array([ 357, 2879, 3402, 4968, 5356, 5360, 6159, 7892, 8756,
12361, 12396, 12627, 13041, 13115, 14586, 14892, 16167, 16564,
17277, 19048]),
'epsilon': array([ 190, 878, 2383, 2979, 6027, 6212, 6299, 6358, 6637,
9723, 11148, 11320, 11473, 12939, 13229, 13895, 15211, 15274,
16388, 19024]),
'gamma': array([ 932, 2159, 3236, 6357, 7869, 9325, 10138, 11153, 11391,
12296, 12375, 12507, 12630, 13386, 13822, 14998, 16601, 17665,
18319, 18672]),
'macrophage': array([ 482, 1971, 1973, 2014, 3888, 4127, 7307, 7535, 7536,
7541, 7545, 7547, 8490, 11002, 12935, 15095, 15160, 15487,
15841, 16307]),
'mast': array([ 63, 252, 290, 1026, 2020, 2816, 3734, 6502, 7308,
8896, 9696, 13773, 14118, 14396, 14929, 15163, 15544, 15773,
17869, 17870]),
'quiescent_stellate': array([ 274, 287, 315, 372, 1713, 1823, 3930, 4987, 6668,
7906, 8247, 8266, 8469, 8633, 8922, 11057, 11455, 14362,
17186, 19198]),
'schwann': array([ 1728, 3859, 4579, 4988, 5199, 5551, 6356, 6911, 7961,
8585, 10122, 10469, 11171, 11492, 12598, 13240, 14221, 15146,
16160, 16509]),
't_cell': array([ 2729, 2840, 2842, 2854, 2860, 5280, 5692, 6503, 8385,
9730, 11167, 12389, 13946, 14003, 14086, 16339, 17887, 18058,
19329, 19436])}
Select Markers¶
After the filtering process, you can select the markers.
[10]:
baron_h1.sel_clust_marker()
-> Selecting markers for each cluster
You can view the selected markers as follows.
[11]:
baron_h1.get_markers()
[11]:
Marker_1 | Marker_2 | Marker_3 | |
---|---|---|---|
C_acinar | PNLIPRP1 | CEL | CPA2 |
C_activated_stellate | CRLF1 | SFRP2 | COL6A3 |
C_alpha | IRX2 | GC | NPNT |
C_beta | ADCYAP1 | HADH | G6PC2 |
C_delta | LEPR | MIR7.3HG | PCP4 |
C_ductal | MMP7 | KRT19 | TACSTD2 |
C_endothelial | PLVAP | PECAM1 | CD93 |
C_epsilon | GHRL | FRZB | NNMT |
C_gamma | PPY | AQP3 | STMN2 |
C_macrophage | PLA2G7 | C1QC | ITGB2 |
C_mast | TPSB2 | TPSAB1 | CPA3 |
C_quiescent_stellate | RGS5 | NDUFA4L2 | EDNRA |
C_schwann | SOX10 | SEMA3C | GPM6B |
C_t_cell | CD3D | ZAP70 | TRAC |
Or you can see the corresponding indexes in the data matrix.
[12]:
baron_h1.get_markers(ind_return = True)
[12]:
Marker_1 | Marker_2 | Marker_3 | |
---|---|---|---|
C_acinar | 13082 | 3055 | 3733 |
C_activated_stellate | 3836 | 15293 | 3637 |
C_alpha | 8442 | 6534 | 11417 |
C_beta | 322 | 7240 | 6383 |
C_delta | 9345 | 10318 | 12504 |
C_ductal | 10397 | 9048 | 16828 |
C_endothelial | 13041 | 12627 | 2879 |
C_epsilon | 6637 | 6299 | 11320 |
C_gamma | 13386 | 932 | 16601 |
C_macrophage | 12935 | 1973 | 8490 |
C_mast | 17870 | 17869 | 3734 |
C_quiescent_stellate | 14362 | 11057 | 4987 |
C_schwann | 16160 | 15146 | 6911 |
C_t_cell | 2842 | 19329 | 17887 |
You can also pull selected markers from the data matrix.
[13]:
mark_data = baron_h1.get_marker_data()
mark_data
[13]:
{'acinar': array([[4.9068906 , 3.5849625 , 6.89481776],
[2.80735492, 3.169925 , 6.82017896],
[4.39231742, 4.70043972, 5.4918531 ],
...,
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ]]),
'activated_stellate': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
'alpha': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
'beta': array([[0., 2., 0.],
[0., 1., 0.],
[0., 2., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
'delta': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 1., 1.],
[0., 0., 0.]]),
'ductal': array([[0. , 0. , 0. ],
[0. , 2.80735492, 4.24792751],
[0. , 0. , 2.5849625 ],
...,
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ]]),
'endothelial': array([[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
...,
[3.169925, 2. , 3. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ]]),
'epsilon': array([[0., 0., 0.],
[0., 0., 0.],
[1., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
'gamma': array([[2.32192809, 0. , 0. ],
[1. , 1.5849625 , 0. ],
[1.5849625 , 0. , 0. ],
...,
[0. , 0. , 0. ],
[0. , 0. , 1. ],
[2. , 0. , 0. ]]),
'macrophage': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
'mast': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
'quiescent_stellate': array([[0. , 0. , 0. ],
[0. , 0. , 0. ],
[0. , 0. , 0. ],
...,
[0. , 0. , 0. ],
[0. , 0. , 0. ],
[2.32192809, 1.5849625 , 1. ]]),
'schwann': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]),
't_cell': array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])}
5 markers are selected by default for each cluster. You can access the selected markers for each cluster with the dictionary keys.
[14]:
schwann = mark_data['schwann']
schwann
[14]:
array([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
...,
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
[15]:
schwann.shape
[15]:
(1937, 3)
You can also perform marker selection with dynamic programming.
[16]:
baron_h1.sel_clust_marker(dyn_prog=True)
-> Selecting markers for each cluster
-> |⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛| 100% Number of Clusters With Selected Markers : 14
If you want, you can increase the number of markers to be selected.
Note
If you are going to increase the number of markers, make sure that the number of genes remaining after filtering is more than the number of markers to be selected.
[17]:
baron_h1.filter_genes(nof_sel=20)
baron_h1.sel_clust_marker(nof_markers=10)
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting markers for each cluster
[18]:
baron_h1.get_markers(ind_return = True)
[18]:
Marker_1 | Marker_2 | Marker_3 | Marker_4 | Marker_5 | Marker_6 | Marker_7 | Marker_8 | Marker_9 | Marker_10 | |
---|---|---|---|---|---|---|---|---|---|---|
C_acinar | 13082 | 3055 | 3733 | 8968 | 16742 | 4020 | 3057 | 12577 | 13083 | 13081 |
C_activated_stellate | 3836 | 15293 | 3637 | 18923 | 17233 | 6179 | 5878 | 3632 | 9231 | 405 |
C_alpha | 8442 | 6534 | 11417 | 10734 | 2390 | 8713 | 14361 | 17366 | 15991 | 12291 |
C_beta | 322 | 7240 | 6383 | 5342 | 11265 | 68 | 13322 | 1974 | 15686 | 4675 |
C_delta | 9345 | 10318 | 12504 | 1449 | 14210 | 6407 | 4516 | 6543 | 959 | 15098 |
C_ductal | 10397 | 9048 | 16828 | 3383 | 13242 | 9077 | 13639 | 17311 | 4203 | 809 |
C_endothelial | 13041 | 12627 | 2879 | 12396 | 6159 | 8756 | 16167 | 13115 | 17277 | 4968 |
C_epsilon | 6637 | 6299 | 11320 | 16388 | 19024 | 9723 | 6358 | 12939 | 6027 | 190 |
C_gamma | 13386 | 932 | 16601 | 7869 | 10138 | 18672 | 12375 | 18319 | 11391 | 2159 |
C_macrophage | 12935 | 1973 | 8490 | 15095 | 1971 | 3888 | 7541 | 4127 | 15487 | 7307 |
C_mast | 17870 | 17869 | 3734 | 2816 | 14929 | 14396 | 2020 | 6502 | 8896 | 9696 |
C_quiescent_stellate | 14362 | 11057 | 4987 | 8266 | 6668 | 315 | 19198 | 372 | 3930 | 1823 |
C_schwann | 16160 | 15146 | 6911 | 12598 | 3859 | 1728 | 16509 | 6356 | 4579 | 7961 |
C_t_cell | 2842 | 19329 | 17887 | 2729 | 18058 | 6503 | 8385 | 12389 | 2840 | 2854 |