Select Markers

  • This tutorial shows how to select markers with the scmags package.

Let’s perform these operations with the baron_h1 data set in the package.

For this, we first import the package.

[1]:
import scmags as mg

Then we can start the operations by loading the dataset.

[2]:
baron_h1 = mg.datasets.baron_h1()

Filter Genes

First, redundant genes need to be filtered out for computational efficiency.

[3]:
baron_h1.filter_genes()
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
Here’s how you can view the remaining genes after filtering
At the output of this function, a dictionary structure is returned, and each key corresponds to a cluster.
[4]:
rem_genes = baron_h1.get_filter_genes()
dict(list(rem_genes.items())[0:2])
[4]:
{'acinar': ['ALB',
  'ALDOB',
  'CEL',
  'CELA2A',
  'CUZD1',
  'GP2',
  'KLK1',
  'PDIA2',
  'PNLIPRP1',
  'PNLIPRP2'],
 'activated_stellate': ['ADAMTS12',
  'COL6A3',
  'CRLF1',
  'FBN1',
  'FMOD',
  'GGT5',
  'LAMC3',
  'SFRP2',
  'THBS2',
  'VCAN']}

If you want, you can display the corresponding indexes in the data matrix.

[5]:
baron_h1.get_filter_genes(ind_return=True)
[5]:
{'acinar': array([  550,   572,  3055,  3057,  4062,  6872,  8968, 12577, 13082,
        13083]),
 'activated_stellate': array([  276,  3637,  3836,  5878,  6179,  6624,  9231, 15293, 17233,
        18923]),
 'alpha': array([ 2390,  6534,  7966,  8442,  8713, 10734, 11417, 12416, 14361,
        15835]),
 'beta': array([   68,   322,  1974,  4675,  5342,  6383,  7240, 11265, 13322,
        15686]),
 'delta': array([ 1449,  4516,  6407,  6543,  9345, 10318, 12504, 14210, 16021,
        16794]),
 'ductal': array([ 3383,  4203,  9048,  9077, 10397, 13242, 13639, 15205, 16828,
        17311]),
 'endothelial': array([  357,  3402,  4968,  5360,  6159,  8756, 12396, 14586, 16167,
        19048]),
 'epsilon': array([ 6027,  6299,  6358,  6637,  9723, 11320, 12939, 15274, 16388,
        19024]),
 'gamma': array([  932,  6357,  9325, 10138, 11153, 12375, 13822, 16601, 18319,
        18672]),
 'macrophage': array([ 1971,  1973,  3888,  4127,  7307,  7541,  8490, 12935, 15095,
        15487]),
 'mast': array([   63,   252,  2020,  2816,  3734,  6502,  8896, 14929, 17869,
        17870]),
 'quiescent_stellate': array([  315,   372,  3930,  4987,  6668,  8266,  8633, 11057, 14362,
        19198]),
 'schwann': array([ 1728,  3859,  4579,  6356,  6911,  7961,  8585, 15146, 16160,
        16509]),
 't_cell': array([ 2729,  2842,  2860,  6503,  8385, 12389, 16339, 17887, 18058,
        19329])}

If you have not set an intra-cluster expression rate threshold, you can also view the automatically determined thresholds.

[6]:
baron_h1.get_filt_cluster_thresholds
[6]:
{'acinar': 0.6545454561710358,
 'activated_stellate': 0.7058823555707932,
 'alpha': 0.6525423675775528,
 'beta': 0.6416284441947937,
 'delta': 0.6355140209197998,
 'ductal': 0.6791666597127914,
 'endothelial': 0.6192307695746422,
 'epsilon': 0.6538461595773697,
 'gamma': 0.6285714358091354,
 'macrophage': 0.6785714328289032,
 'mast': 0.625,
 'quiescent_stellate': 0.614130437374115,
 'schwann': 0.7000000029802322,
 't_cell': 0.75}
If you don’t want automatic threshold determination for filtering, you can set a threshold yourself.
This value should be between 0-1.
[7]:
baron_h1.filter_genes(in_cls_thres=0.7)
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
As can be seen, the threshold was set as 70% for all cells.
In this case, genes expressed in less than 70% of the cells in the cluster are filtered out.
[8]:
baron_h1.get_filt_cluster_thresholds
[8]:
{'acinar': 0.7,
 'activated_stellate': 0.7,
 'alpha': 0.7,
 'beta': 0.7,
 'delta': 0.7,
 'ductal': 0.7,
 'endothelial': 0.7,
 'epsilon': 0.7,
 'gamma': 0.7,
 'macrophage': 0.7,
 'mast': 0.7,
 'quiescent_stellate': 0.7,
 'schwann': 0.7,
 't_cell': 0.7}
You can also set the number of genes that will remain after filtering.
This may be necessary when selecting more marker genes.Because marker selection is carried out on the remaining genes after filtering.
As can be seen in the example, when the parameter is set to 20, 20 genes remain after filtering for each cluster.
[9]:
baron_h1.filter_genes(nof_sel=20)
baron_h1.get_filter_genes(ind_return=True)
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
[9]:
{'acinar': array([  550,   572,   799,  3055,  3057,  3733,  4020,  4062,  6872,
         7103,  7104,  8968, 10655, 10732, 12577, 13081, 13082, 13083,
        13620, 16742]),
 'activated_stellate': array([  276,   405,  1990,  3632,  3633,  3637,  3836,  5878,  6179,
         6182,  6624,  9231,  9504,  9691, 10387, 15293, 17233, 17850,
        18923, 19025]),
 'alpha': array([ 2390,  3235,  6534,  6730,  7966,  8442,  8713, 10734, 11417,
        12291, 12339, 12416, 12508, 12637, 14361, 15275, 15835, 15991,
        17366, 18944]),
 'beta': array([   68,   322,  1974,  2995,  3533,  4675,  4993,  5342,  6383,
         6420,  7240, 10627, 11265, 12321, 13322, 13959, 14098, 14999,
        15686, 18304]),
 'delta': array([  959,  1449,  1557,  2353,  4516,  6407,  6543,  9345, 10068,
        10318, 11279, 12285, 12504, 12506, 14210, 15098, 16021, 16729,
        16794, 17839]),
 'ductal': array([  458,   809,  3383,  3401,  4203,  4835,  5129,  6855,  8818,
         9048,  9077,  9840, 10397, 13242, 13639, 15205, 15912, 16828,
        17129, 17311]),
 'endothelial': array([  357,  2879,  3402,  4968,  5356,  5360,  6159,  7892,  8756,
        12361, 12396, 12627, 13041, 13115, 14586, 14892, 16167, 16564,
        17277, 19048]),
 'epsilon': array([  190,   878,  2383,  2979,  6027,  6212,  6299,  6358,  6637,
         9723, 11148, 11320, 11473, 12939, 13229, 13895, 15211, 15274,
        16388, 19024]),
 'gamma': array([  932,  2159,  3236,  6357,  7869,  9325, 10138, 11153, 11391,
        12296, 12375, 12507, 12630, 13386, 13822, 14998, 16601, 17665,
        18319, 18672]),
 'macrophage': array([  482,  1971,  1973,  2014,  3888,  4127,  7307,  7535,  7536,
         7541,  7545,  7547,  8490, 11002, 12935, 15095, 15160, 15487,
        15841, 16307]),
 'mast': array([   63,   252,   290,  1026,  2020,  2816,  3734,  6502,  7308,
         8896,  9696, 13773, 14118, 14396, 14929, 15163, 15544, 15773,
        17869, 17870]),
 'quiescent_stellate': array([  274,   287,   315,   372,  1713,  1823,  3930,  4987,  6668,
         7906,  8247,  8266,  8469,  8633,  8922, 11057, 11455, 14362,
        17186, 19198]),
 'schwann': array([ 1728,  3859,  4579,  4988,  5199,  5551,  6356,  6911,  7961,
         8585, 10122, 10469, 11171, 11492, 12598, 13240, 14221, 15146,
        16160, 16509]),
 't_cell': array([ 2729,  2840,  2842,  2854,  2860,  5280,  5692,  6503,  8385,
         9730, 11167, 12389, 13946, 14003, 14086, 16339, 17887, 18058,
        19329, 19436])}

Select Markers

After the filtering process, you can select the markers.

[10]:
baron_h1.sel_clust_marker()
-> Selecting  markers for each cluster

You can view the selected markers as follows.

[11]:
baron_h1.get_markers()
[11]:
Marker_1 Marker_2 Marker_3
C_acinar PNLIPRP1 CEL CPA2
C_activated_stellate CRLF1 SFRP2 COL6A3
C_alpha IRX2 GC NPNT
C_beta ADCYAP1 HADH G6PC2
C_delta LEPR MIR7.3HG PCP4
C_ductal MMP7 KRT19 TACSTD2
C_endothelial PLVAP PECAM1 CD93
C_epsilon GHRL FRZB NNMT
C_gamma PPY AQP3 STMN2
C_macrophage PLA2G7 C1QC ITGB2
C_mast TPSB2 TPSAB1 CPA3
C_quiescent_stellate RGS5 NDUFA4L2 EDNRA
C_schwann SOX10 SEMA3C GPM6B
C_t_cell CD3D ZAP70 TRAC

Or you can see the corresponding indexes in the data matrix.

[12]:
baron_h1.get_markers(ind_return = True)
[12]:
Marker_1 Marker_2 Marker_3
C_acinar 13082 3055 3733
C_activated_stellate 3836 15293 3637
C_alpha 8442 6534 11417
C_beta 322 7240 6383
C_delta 9345 10318 12504
C_ductal 10397 9048 16828
C_endothelial 13041 12627 2879
C_epsilon 6637 6299 11320
C_gamma 13386 932 16601
C_macrophage 12935 1973 8490
C_mast 17870 17869 3734
C_quiescent_stellate 14362 11057 4987
C_schwann 16160 15146 6911
C_t_cell 2842 19329 17887

You can also pull selected markers from the data matrix.

[13]:
mark_data = baron_h1.get_marker_data()
mark_data
[13]:
{'acinar': array([[4.9068906 , 3.5849625 , 6.89481776],
        [2.80735492, 3.169925  , 6.82017896],
        [4.39231742, 4.70043972, 5.4918531 ],
        ...,
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]),
 'activated_stellate': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 'alpha': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 'beta': array([[0., 2., 0.],
        [0., 1., 0.],
        [0., 2., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 'delta': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 1., 1.],
        [0., 0., 0.]]),
 'ductal': array([[0.        , 0.        , 0.        ],
        [0.        , 2.80735492, 4.24792751],
        [0.        , 0.        , 2.5849625 ],
        ...,
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ]]),
 'endothelial': array([[0.      , 0.      , 0.      ],
        [0.      , 0.      , 0.      ],
        [0.      , 0.      , 0.      ],
        ...,
        [3.169925, 2.      , 3.      ],
        [0.      , 0.      , 0.      ],
        [0.      , 0.      , 0.      ]]),
 'epsilon': array([[0., 0., 0.],
        [0., 0., 0.],
        [1., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 'gamma': array([[2.32192809, 0.        , 0.        ],
        [1.        , 1.5849625 , 0.        ],
        [1.5849625 , 0.        , 0.        ],
        ...,
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 1.        ],
        [2.        , 0.        , 0.        ]]),
 'macrophage': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 'mast': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 'quiescent_stellate': array([[0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        ...,
        [0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        ],
        [2.32192809, 1.5849625 , 1.        ]]),
 'schwann': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]),
 't_cell': array([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        ...,
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])}

5 markers are selected by default for each cluster. You can access the selected markers for each cluster with the dictionary keys.

[14]:
schwann = mark_data['schwann']
schwann
[14]:
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       ...,
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])
[15]:
schwann.shape
[15]:
(1937, 3)

You can also perform marker selection with dynamic programming.

[16]:
baron_h1.sel_clust_marker(dyn_prog=True)
-> Selecting  markers for each cluster
-> |⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛| 100% Number of Clusters With Selected Markers : 14

If you want, you can increase the number of markers to be selected.

Note

If you are going to increase the number of markers, make sure that the number of genes remaining after filtering is more than the number of markers to be selected.

[17]:
baron_h1.filter_genes(nof_sel=20)
baron_h1.sel_clust_marker(nof_markers=10)
-> Eliminating low expression genes
-> Selecting cluster-specific candidate marker genes
-> Selecting  markers for each cluster
[18]:
baron_h1.get_markers(ind_return = True)
[18]:
Marker_1 Marker_2 Marker_3 Marker_4 Marker_5 Marker_6 Marker_7 Marker_8 Marker_9 Marker_10
C_acinar 13082 3055 3733 8968 16742 4020 3057 12577 13083 13081
C_activated_stellate 3836 15293 3637 18923 17233 6179 5878 3632 9231 405
C_alpha 8442 6534 11417 10734 2390 8713 14361 17366 15991 12291
C_beta 322 7240 6383 5342 11265 68 13322 1974 15686 4675
C_delta 9345 10318 12504 1449 14210 6407 4516 6543 959 15098
C_ductal 10397 9048 16828 3383 13242 9077 13639 17311 4203 809
C_endothelial 13041 12627 2879 12396 6159 8756 16167 13115 17277 4968
C_epsilon 6637 6299 11320 16388 19024 9723 6358 12939 6027 190
C_gamma 13386 932 16601 7869 10138 18672 12375 18319 11391 2159
C_macrophage 12935 1973 8490 15095 1971 3888 7541 4127 15487 7307
C_mast 17870 17869 3734 2816 14929 14396 2020 6502 8896 9696
C_quiescent_stellate 14362 11057 4987 8266 6668 315 19198 372 3930 1823
C_schwann 16160 15146 6911 12598 3859 1728 16509 6356 4579 7961
C_t_cell 2842 19329 17887 2729 18058 6503 8385 12389 2840 2854