Training a simple Compocyte classifier with PBMC data.¶
In this tutorial, you will learn how to train a Compocyte classifier on any data. For simplicity, we will use published PBMC data. However, for your understanding we will go through the process of labelling these cells and fitting a hierarchy to the labels so that Compocyte has everything it needs to work.
Preprocessing and analysis¶
[1]:
import scanpy as sc
import numpy as np
# Load the 10x PBMC dataset
adata = sc.datasets.pbmc3k()
[2]:
adata
[2]:
AnnData object with n_obs × n_vars = 2700 × 32738
var: 'gene_ids'
We will make sure to save the count data to .raw, split off 33 % of cells for testing, then normalize, log-transform our data and subset to highly-variable genes. This will improve clustering results.
[ ]:
import os
# Preprocess the data
adata.raw = adata.copy()
# Generate holdout data for testing.
rng = np.random.default_rng(seed=0)
test_adata = adata[rng.choice(adata.obs_names, size=900, replace=False), :].copy()
if not os.path.exists("./exclude"):
os.makedirs("./exclude")
test_adata.write("./exclude/test_adata.h5ad")
adata = adata[~adata.obs_names.isin(test_adata.obs_names)].copy()
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
# Log-transform the data
sc.pp.log1p(adata)
# Select the top 2000 highly variable genes for better signal-to-noise ratio upon clustering
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var.highly_variable]
print('X after preprocessing: ', adata.X[:10, :10])
X after preprocessing: <Compressed Sparse Row sparse matrix of dtype 'float32'
with 7 stored elements and shape (10, 10)>
Coords Values
(0, 2) 1.111715316772461
(0, 4) 1.111715316772461
(1, 2) 1.4292607307434082
(2, 2) 1.5663871765136719
(4, 8) 1.7219784259796143
(7, 2) 1.6449213027954102
(8, 2) 1.4576793909072876
[4]:
# Cluster cells using Leiden clustering after principal component analysis
sc.tl.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50)
/usr/local/lib/python3.14/site-packages/scanpy/preprocessing/_pca/__init__.py:359: ImplicitModificationWarning: Setting element `.obsm['X_pca']` of view, initializing view as actual.
adata.obsm[key_obsm] = x_pca
/usr/local/lib/python3.14/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
[5]:
sc.tl.leiden(adata, resolution=0.8)
/tmp/ipykernel_17908/3412306151.py:1: FutureWarning: The `igraph` implementation of leiden clustering is *orders of magnitude faster*. Set the flavor argument to (and install if needed) 'igraph' to use it.
In the future, the default backend for leiden will be igraph instead of leidenalg. To achieve the future defaults please pass: `flavor='igraph'` and `n_iterations=2`. `directed` must also be `False` to work with igraph’s implementation.
sc.tl.leiden(adata, resolution=0.8)
Labelling cell types¶
[6]:
sc.tl.umap(adata)
[7]:
sc.pl.umap(
adata,
color=["leiden"],
size=30,
)
[8]:
sc.pl.dotplot(adata, var_names=['CD3D', 'CD4', 'CD8A', 'KLRB1', 'NCAM1', 'FCGR3A', 'CD19', 'CD38', 'CD14', 'VCAN', 'FCER1A', 'CLEC4C', 'HBB', 'ITGB3'], groupby='leiden')
The above is a very simplistic overview of some information that will help us classify the cells present in the dataset: spatial relationships in the gene space by dimensionality reduction with UMAP and gene expression on an aggregated per-cluster level in the dotplot. While one must be careful assigning meaning to spatial relationships on a UMAP during cell-type labelling, these two plots shall suffice for the purpose of cell-type labelling in this short tutorial without any claim to completeness.
[9]:
# Map Leiden clusters to cell type labels based on the dotplot and known marker genes for each cell type
adata.obs['label'] = adata.obs['leiden'].map(
{
'0': 'CD4 T cells',
'1': 'Classical monocytes',
'2': 'CD8 T cells', # probably includes both antigen-naive and antigen-experienced B cells
'3': 'B cells',
'4': 'Non-classical monocytes',
'5': 'ILCs', # in the sense of both NK cells and other ILCs
'6': 'mixed'
}
)
[10]:
adata.obs['label'].isna().any()
[10]:
np.False_
[11]:
# Removed mixed cluster since it is not a well-defined cell type and would likely introduce noise into the classifier.
adata = adata[adata.obs.label != 'mixed'].copy()
Building a hierarchy¶
This is where it gets interesting. We have assigned, by cluster, one label per cell. This is the input data cell type classifiers usually receive. To harness the potential of Compocyte’s structure, we need to explicitly define a hierarchy on which all above labels can be found. This will help the classifier weight relationships between different cell type labels and define branching points in the classification process that can be modified by exchange or extension down the road.
There is one very important assumption that Compocyte works with that is important to keep in mind when building a hierarchy. For the labels at the bottom of the hierarchy, also called leaf nodes, all prior labels of this branch must also be true. I. e. a dendritic cell is also a myeloid cell and it is also a blood cell. A CD8 T cell is also a T cell and it is also a lymphoid cell.
Violating this assumption will lead to performance issues should you try to build your hierarchy from a more developmental point of view. The point of the hierarchy is to group transcriptomically similar cells into shared classification branches.
A simple such hierarchy would be:
[12]:
from Compocyte.core.tools import make_graph_from_edges
import networkx as nx
hierarchy = {
'Blood': {
'Lymphoid': {
'T cells': {'CD4 T cells': {}, 'CD8 T cells': {}},
'B cells': {},
'ILCs': {}
},
'Myeloid': {
'Classical monocytes': {}, 'Non-classical monocytes': {},
},
}
}
graph = nx.DiGraph()
make_graph_from_edges(hierarchy, graph)
[13]:
from networkx.drawing.nx_agraph import graphviz_layout
# Plot the graph structure we gave to the classifier during training.
pos = graphviz_layout(
graph,prog="dot",
root='Blood',
args='-Gsplines=curved -Gnodesep=8 -Goverlap=scalexy -Gbeautify=false'
)
nx.draw(
graph, pos,
with_labels=True,
node_color="#9ecae1",
node_size=1200,
edge_color="#888",
width=1.5,
font_size=10,
font_weight="bold",
)
Training the classifier¶
We have now defined cell type labels and a hierarchy that fits these labels and that can tell how we want our classifier structure to be set up. However there is still a small task to be completed before we can begin training.
To provide training labels for training local classifiers at every branching point, we need to infer the intermediate labels of each cell in the hierarchy. This is done by using the infer_levels function takes in the hierarchy, the name of the column in adata.obs that contains the cell type labels, the name of the root node in the hierarchy, and the adata object itself. It adds the level annnotations as separate obs columns in the provided AnnData object and returns as obs_names
the column names it used. These can then be passed to the classifier so it knows where to look.
[14]:
from Compocyte.core.tools import infer_levels
# Save level labels to adata.obs and receive the list of obs columns where they have been saved.
_, obs_names = infer_levels(
hierarchy=hierarchy,
labels='label',
root_node='Blood',
adata=adata
)
[15]:
obs_names
[15]:
['Level_0', 'Level_1', 'Level_2', 'Level_3']
[16]:
adata.obs
[16]:
| leiden | label | Level_0 | Level_1 | Level_2 | Level_3 | |
|---|---|---|---|---|---|---|
| index | ||||||
| AAACATTGAGCTAC-1 | 3 | B cells | Blood | Lymphoid | B cells | |
| AAACATTGATCAGC-1 | 0 | CD4 T cells | Blood | Lymphoid | T cells | CD4 T cells |
| AAACCGTGCTTCCG-1 | 1 | Classical monocytes | Blood | Myeloid | Classical monocytes | |
| AAACGCACTGGTAC-1 | 0 | CD4 T cells | Blood | Lymphoid | T cells | CD4 T cells |
| AAACGCTGACCAGT-1 | 2 | CD8 T cells | Blood | Lymphoid | T cells | CD8 T cells |
| ... | ... | ... | ... | ... | ... | ... |
| TTTCGAACACCTGA-1 | 3 | B cells | Blood | Lymphoid | B cells | |
| TTTCGAACTCTCAT-1 | 1 | Classical monocytes | Blood | Myeloid | Classical monocytes | |
| TTTCTACTGAGGCA-1 | 3 | B cells | Blood | Lymphoid | B cells | |
| TTTCTACTTCCTCG-1 | 3 | B cells | Blood | Lymphoid | B cells | |
| TTTGCATGAGAGGC-1 | 3 | B cells | Blood | Lymphoid | B cells |
1792 rows × 6 columns
[17]:
from Compocyte.core.hierarchical_classifier import HierarchicalClassifier
from Compocyte.core.models.log_reg import LogisticRegression
# Train the hierarchical classifier. This will train a separate classifier for each parent node in the hierarchy.
classifier = HierarchicalClassifier(
save_path="./exclude/pbmc_classifier",
adata=adata,
root_node='Blood',
dict_of_cell_relations=hierarchy,
obs_names=obs_names)
# For training speed, set the classifier type for all nodes to logistic regression.
# In practice, one would likely want to experiment with different classifier types for different nodes in the hierarchy.
# The default is a 2-layer FCNN with 64 nodes each.
for node in classifier.graph.nodes:
classifier.set_classifier_type(node, LogisticRegression)
classifier.train_all_child_nodes()
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:110: UserWarning: Features [ 40 197 229 426 440 484 554 882 951 978 1084 1195 1341 1414
1446 1467 1639 1712 1864 1870 1942] are constant.
warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: invalid value encountered in divide
f = msb / msw
Training at Blood.
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:110: UserWarning: Features [ 24 40 67 127 151 168 186 197 207 216 229 233 243 277
283 313 321 339 412 426 440 455 473 484 516 549 554 560
568 661 681 707 726 794 851 882 901 902 937 940 951 961
969 978 990 1015 1019 1024 1054 1084 1123 1131 1150 1184 1195 1196
1206 1207 1219 1259 1289 1321 1341 1349 1397 1411 1414 1431 1446 1467
1491 1504 1524 1535 1607 1639 1652 1674 1677 1697 1699 1712 1713 1755
1765 1832 1864 1870 1873 1882 1899 1906 1915 1942] are constant.
warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: invalid value encountered in divide
f = msb / msw
Training at Lymphoid.
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:110: UserWarning: Features [ 11 12 22 24 40 47 52 67 81 90 107 115 127 151
168 186 189 197 207 216 226 229 231 233 237 239 243 267
277 280 283 304 313 315 317 318 321 325 328 339 347 354
362 364 398 399 412 423 426 437 440 455 473 484 494 515
516 529 532 549 554 560 562 568 573 587 591 629 661 681
690 707 719 722 726 744 746 749 767 781 788 794 805 807
843 851 870 882 885 889 896 901 902 924 937 940 949 951
961 969 974 978 982 990 999 1014 1015 1018 1019 1022 1024 1050
1053 1054 1059 1060 1076 1084 1100 1123 1131 1132 1150 1151 1152 1153
1154 1182 1184 1194 1195 1196 1198 1199 1206 1207 1214 1219 1223 1230
1240 1259 1273 1278 1280 1289 1301 1321 1341 1343 1344 1349 1375 1385
1386 1397 1410 1411 1414 1426 1431 1436 1446 1450 1461 1467 1474 1475
1491 1504 1520 1524 1535 1541 1542 1551 1561 1562 1570 1586 1589 1607
1639 1652 1669 1674 1677 1695 1697 1699 1712 1713 1734 1755 1765 1783
1799 1832 1837 1842 1843 1864 1870 1873 1876 1882 1899 1906 1912 1915
1918 1926 1933 1940 1942 1957 1978] are constant.
warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: invalid value encountered in divide
f = msb / msw
Training at T cells.
Training at Myeloid.
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:110: UserWarning: Features [ 1 6 11 12 18 26 30 40 41 48 52 57 61 77
79 84 90 92 96 100 101 103 104 107 108 115 117 131
136 166 170 176 178 179 185 188 189 191 197 208 209 219
226 229 231 237 239 253 254 258 266 267 275 276 278 285
286 289 306 315 317 318 325 328 337 354 364 381 398 399
403 406 407 411 422 423 424 426 429 438 440 441 447 459
467 471 472 480 484 494 508 510 530 541 546 551 554 557
565 573 584 587 591 598 611 619 620 629 631 671 672 692
711 719 722 732 742 744 749 756 758 761 764 767 781 784
788 805 807 808 812 814 838 843 846 847 848 849 850 852
861 864 870 880 882 885 887 894 895 896 909 912 941 945
949 951 954 964 974 978 979 982 991 1002 1014 1018 1022 1029
1032 1056 1057 1059 1060 1063 1069 1074 1076 1084 1093 1100 1103 1108
1109 1115 1151 1153 1160 1168 1179 1186 1187 1190 1191 1194 1195 1204
1215 1221 1224 1228 1230 1234 1242 1243 1247 1248 1250 1273 1279 1280
1284 1287 1300 1301 1313 1331 1341 1344 1359 1360 1365 1372 1375 1376
1377 1385 1386 1388 1390 1392 1398 1405 1409 1410 1413 1414 1418 1421
1426 1430 1436 1439 1446 1452 1453 1455 1458 1461 1466 1467 1474 1475
1489 1501 1518 1520 1526 1533 1534 1542 1543 1547 1551 1556 1562 1567
1568 1570 1578 1579 1586 1587 1618 1623 1632 1637 1639 1642 1660 1669
1671 1672 1684 1689 1695 1701 1702 1703 1708 1712 1715 1720 1724 1734
1739 1740 1754 1759 1768 1775 1777 1783 1788 1798 1800 1801 1808 1813
1823 1825 1837 1849 1851 1861 1864 1866 1870 1871 1876 1883 1894 1897
1904 1908 1911 1912 1918 1924 1926 1933 1935 1938 1942 1955 1957 1971
1978 1982] are constant.
warnings.warn("Features %s are constant." % constant_features_idx, UserWarning)
/usr/local/lib/python3.14/site-packages/sklearn/feature_selection/_univariate_selection.py:111: RuntimeWarning: invalid value encountered in divide
f = msb / msw
[18]:
# Make sure the classifier has been trained by predicting on the test data.
# This will add columns with predicted labels to adata.obs for each parent node in the hierarchy.
classifier.load_adata(test_adata)
classifier.predict_all_child_nodes('Blood')
Predicting at Blood.
Predicting at Lymphoid.
Predicting at T cells.
Predicting at Myeloid.
[19]:
classifier.adata.obs
[19]:
| Level_1_pred | Level_2_pred | Level_3_pred | |
|---|---|---|---|
| index | |||
| GCTCAAGAACCATG-1 | Myeloid | Non-classical monocytes | |
| TATTTCCTGGTGTT-1 | Lymphoid | T cells | CD4 T cells |
| TATAAGTGTGGTGT-1 | Myeloid | Non-classical monocytes | |
| AGCACTGATGCTTT-1 | Myeloid | Classical monocytes | |
| GAAACAGACATTCT-1 | Lymphoid | T cells | CD4 T cells |
| ... | ... | ... | ... |
| CTTAGACTAAACGA-1 | Lymphoid | T cells | CD4 T cells |
| CTAGAGACTTTGGG-1 | Myeloid | Non-classical monocytes | |
| TCATCAACTGTTCT-1 | Myeloid | Classical monocytes | |
| TAAGATTGTTGCTT-1 | Lymphoid | T cells | CD4 T cells |
| CTTGAACTACGCAT-1 | Lymphoid | T cells | CD4 T cells |
900 rows × 3 columns
[20]:
classifier.save()
Congratulations. You have trained your own Compocyte classifier. Check out the other tutorials to see how it can be used!