Data Analysis Methods
To build our Human Immune Health Atlas dataset, we utilized data and analysis environments within the Human Immune System Explorer (HISE) framework to trace data processing, analytical code, and analysis environments from the original, raw FASTQ data to our final, assembled reference atlas. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.
This analysis consisted of the following major steps:
- Consistent pipeline processing of hashed 10x scRNA-seq data
- Sample selection from HISE data storage
- Labeling of cells with external reference datasets and doublet scoring
- Separation of major cell classes and subclustering to identify marker genes
- Expert annotation of the cell type identity of each cluster
- Assembly of annotations and the final reference dataset
- Generation of cell labeling models
These steps and some additional analyses are available at:
aifimmunology/aifi-healthy-pbmc-reference
Pipeline Processing
After sequencing, Gene Expression and Hash Tag Oligonucleotide libraries from pooled samples in our pipeline batches were demultiplexed and assembled as individual files for each biological sample. See our Multiplexed scRNA-seq pipeline documentation for additional details.
Data selection from HISE storage
View Notebook: 00-R_Sample_Selection.ipynb
To assemble the input data for our reference dataset, we selected samples from all subjects in our BR1 (Sound Life Young Adult, age 25-35) and BR2 (Sound Life Older Adult, age 55-65) cohorts. We then filtered to remove a batch of samples with previously identified quality problems (B004), as well as subjects with non-healthy or abnormal disease states recorded at the time of any of their sample collection visits. We selected the baseline samples (labeled "Flu Year 1 Day 0" in the sample.visitName field). In cases where aliquots from the same blood draw were processed through our pipeline multiple times, we selected the most recently generated data (highest batch number) available at the time of file selection (maximum batch id was B145).
In addition to our adult samples, we manually selected data from 16 samples from the UP1 cohort (University of Pennsylvania Healthy Pediatric cohort, age 11-13) that were previously utilized in Thomson, et al., 2023 based on their unique sample identifiers (sample.sampleKitGuid).
Unique file identifiers (file.id) and sample metadata were saved for data retrieval and analysis steps. In total, 108 samples were selected for use in our reference (Pediatric, n = 16; Young Adult, n = 47; Older Adult, n = 45), consisting of 2,093,078 cells before additional QC filtering.
View Notebook: 01-Python_retrieve_cmv_bmi.ipynb
In addition to processed data and sample metadata identification, we retrieved clinical lab results and clinical subject data to assemble CMV infection status for all subjects, as well as BMI for all subjects.
Labeling and doublet detection
View Notebook: 02-Python_Label_Predictions_Celltypist.ipynb
To guide cell type identification by our domain experts, we next labeled cells from each sample using CellTypist (v1.6.1, Domínguez Conde, et al., 2022) using the following reference models: Immune_All_High (32 cell types), Immune_All_Low (98 cell types), and Healthy_COVID19_PBMC (51 cell types) by following the approach described in the CellTypist reference documentation at https://celltypist.org .
View Notebook: 03-R_Label_Predictions_Seurat.ipynb
We also labeled our cells using Seurat (v5.0.1) using the PBMC reference dataset described in Hao and Hao, et al., 2021, which was downloaded from the Zenodo repository (DOI: 10.5281/zenodo.7779016 ). For Seurat labeling, we utilized the SCTransform function to transform data to match the reference dataset, then used FindTransferAnchors and MapQuery to assign cell types based on the 3-level cell type labels provided in the reference dataset.
View Notebook: 04-Python_Doublet_detection.ipynb
While our cell hashing pipeline identifies doublets that occur between multiple samples, we are blind to doublets that occur with the same cell droplet. To assist in identifying these remaining doublets, we performed doublet detection using the scrublet package (v0.2.3, Wolock, et al., 2019) as implemented in the scanpy.external module of scanpy (v1.9.6, Wolf, et al., 2018).
View Notebook: 05-Python_combine_data.ipynb
After cell type labeling and doublet detection were performed on each sample, we assembled data, labels, scrublet calls, sample metadata, CMV status, and BMI data across all samples to facilitate downstream analysis.
QC filtering, clustering, and class subsetting
View Notebook: 06-Python_qc_and_clustering.ipynb
After assembly of all cells across our 108 samples, we filtered our data against a set of QC criteria designed to remove possible doublets (based on scrublet and high gene detection, > 5,000 genes), low-quality cells (based on low gene detection, < 200 genes), and dying or dead cells (based on mitochondrial UMI content, > 10%). Mitochondrial content was assessed using scanpy by identifying mitochondrial genes (prefixed with "MT-"), and using the scanpy function calculate_qc_metrics, as demonstrated in the scanpy tutorials (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). QC criteria were applied to the dataset in series, as shown in the table below. In total, 140,950 cells were removed (6.73%), and 1,952,128 cells passed QC filtering (93.27%).
QC Criteria | Cells Removed | % Removed | Cells Remaining | % Remaining |
---|---|---|---|---|
Pre-QC filtering | 2,093,079 (100%) | 100.00% | ||
Scrublet doublet call | 27,907 | 1.33% | 2,065,171 | 98.67% |
Mitochondrial UMIs > 10% | 109,940 | 5.25% | 1,955,231 | 93.41% |
N Genes Detected < 200 | 1,306 | 0.06% | 1,953,925 | 93.35% |
N Genes Detected > 2,500 | 1,797 | 0.09% | 1,952,128 | 93.27% |
Total Removed | 140,950 | 6.37% | 1,952,128 | 93.27% |
After QC filtering, all remaining cells were clustered using a scanpy workflow that consisted of normalization using scanpy.pp.normalize_total with the parameter target_sum = 1e4; log transformation using scanpy.pp.log1p; highly variable gene detection using scanpy.pp.highly_variable_genes; scaling of highly variable gene data using sc.pp.scale; Dimensionality reduction using scanpy.tl.pca with the parameter svd_solver = 'arpack'; Cohort integration using the Harmony method (harmonypy v0.0.9; Korsunsky, et al., 2019) implemented in scanpy.external.harmony_integrate with the parameter key = 'cohort.cohortGuid'; Nearest neighbor calculation with scanpy.pp.neighbors and parameters n_neighbors = 50, use_rep = 'X_pca_harmony', and n_pcs = 30; UMAP embedding with scanpy.tl.umap; and Leiden clustering (leidenalg v0.10.1; Traag, et al., 2018) using scanpy.tl.leiden. Unless specified, default parameters provided by scanpy v1.9.6 were used.
After clustering, clusters were assigned to major cell classes based on marker gene detection thresholds. For each cluster, the fraction of cells with detected gene expression for each marker was computed, and clusters meeting a set of cell class-specific criteria were selected for downstream annotation by our domain experts. These thresholds are shown in the table below.
Population | Thresholds | N cells |
---|---|---|
B cells 07a-Python_B_cells_without_Igs.ipynb | (MS4A1 > 0.5) OR (TNFRSF13B > 0.8) | 177,994 |
Myeloid cells 08-Python_Myeloid_cells.ipynb | (FCN1 > 0.5) OR (IL3A > 0.5) | 397,356 |
NK cells 09-Python_NK_cells.ipynb | (NCAM1 > 0.2) | 160,848 |
T cells 11-Python_T_cells.ipynb | (CD3D > 0.5) AND NOT ( (HBB > 0.8) OR (FCN1 > 0.5) OR (PPBP > 0.6) ) | 1,191,327 |
Other cells 10-Python_Others.ipynb | ( (HBB > 0.8) OR (PPBP > 0.3) OR (PRSS57 > 0.2) ) AND NOT (FCN1 > 0.5) | 24,603 |
After selection of cells in each cell class, each population of cells was separately reprocessed using the scanpy workflow shown above, with the addition of Leiden clustering with multiple resolution parameters for each class: B cells, resolution = 1, 1.5, and 2; Myeloid cells, resolution = 1, 1.5, 2, 2.5, and 3; NK cells, resolution = 1, 1.5, and 2; T cells, resolution = 1, 1.5, and 2; and Other cells, resolution = 1, 1.5, and 2. In the case of B cells, this iterative clustering was carried out after excluding Immunoglobulin-related genes from the list of highly variable genes by removing gene symbols that started with "IGH", "IGL", or "IGK" so that the strong, binary switching of these genes did not drive clustering.
Iterative clustering of specific populations
To assist in identification of specific populations within large cell classes, we performed selective iterative subset identification and re-clustering.
Subpopulation | Parent (resolution) | Criteria | N cells |
---|---|---|---|
Dendritic cells 08a-Python_Myeloid_cells_DCs.ipynb | Myeloid cells (2) | (IRF8 > 0.6) OR(FCER1A > 0.1) | 34,641 |
CD56dim NK cells 09a-Python_NK_CD56dim.ipynb | NK cells (1.5) | NOT (GZMK > 0.4) AND NOT (IL32 > 0.6) AND NOT (IL7R > 0.8) AND NOT (MKI67 > 0.6) AND NOT (ISG15 > 0.6) AND NOT (HBB > 0.2) AND NOT (PPBP > 0.2) | 105,676 |
CD4 Naive T cells 11a-Python_T_cells_cd4-naive.ipynb | T cells (1.5) | (FHIT > 0.3) | 379,659 |
MAIT cells 11b-Python_T_cells_cd8-mait.ipynb | T cells (1.5) | (SLC4A10 > 0.3) | 50,823 |
CD8 CM T cells 11c-Python_T_cells_cd8-cm.ipynb | T cells (1.5) | (CD8A > 0.5) AND NOT (GZMK > 0.7) AND NOT (GZMB > 0.5) AND NOT (CCR7 > 0.7) | 43,289 |
CD8 EM T cells 11d-Python_T_cells_cd8-em.ipynb | T cells (1.5) | (CD8A > 0.5) AND (GZMB > 0.5) | 117,904 |
Treg cells 11e-Python_T_cells_treg.ipynb | T cells (1.5) | (IKZF2 > 0.3) | 39,087 |
CD8 Naive cells 11f-Python_T_cells_cd8-naive.ipynb | T cells (1.5) | (CD8A > 0.5) AND (CCR7 > 0.6) | 121,643 |
Proliferating T cells 11g-Python_T_cells_proliferating.ipynb | T cells (1.5) | (MKI67 > 0.5) | 4,330 |
ISG-high T cells 11i-Python_T_cells_isg-high.ipynb | T cells (1.5) | (IFI44L > 0.5) | 14,140 |
Other T cells 11j-Python_T_cells_other.ipynb | T cells (1.5) | Not in any other T subset | 376,762 |
After subsetting, each subpopulation was iteratively processed using the same scanpy workflow described above. Resolutions used for Leiden clustering these subpopulations were as follows: Dendritic cells, resolution = 1 and 2; CD56dim NK cells, resolution = 1.5; CD4 Naive, resolution = 1.5; MAIT cells, resolution = 3; CD8 CM T cells, resolution = 1.5; CD8 EM T cells, resolution = 3; Treg cells, resolution = 1.5; CD8 Naive cells, resolution = 1.5; Proliferating T cells, resolution = 1.5; ISG-high T cells, resolution = 2; Other T cells, resolution = 1.5.
View Notebook: 11h-Python_T_cells_gd.ipynb
To subset gdT cells, we incorporated cells expressing the TRDC gene from the set of all T cells, as well as those from iteratively clustered CD8 CM, CD8 EM, and MAIT cells that did not separate in global T cell clustering.
Subpopulation | Parent (resolution) | Criteria | N cells |
---|---|---|---|
gdT cells | T cells (1.5) | (TRDC > 0.5) | 33,088 |
gdT cells | CD8 CM T cells (1.5) | (TRDC > 0.5) | 5,721 |
gdT cells | CD8 EM T cells (3) | (TRDC > 0.5) | 12,565 |
gdT cells | MAIT (3) | (TRDC > 0.5) | 2,739 |
As with other subtypes, gdT cells were reprocessed and clustered with a Leiden resolution of 1.5 to identify specific cell types.
Marker gene computation
To assist with cell labeling, marker genes were identified for each cluster at each level of iteration and at each clustering resolution using the scanpy function scanpy.tl.rank_genes_groups with the parameter method = 'wilcoxon'. Results were obtained using scanpy.get.rank_genes_groups_df, and were utilized by our annotation team.
Expert annotation of cell types
Following high-level and iterative clustering, teams of domain experts within the Allen Institute for Immunology examined marker gene expression for clustered datasets, and assigned cell type identities to each cluster. As part of this cell type identification exercise, any remaining doublets or low-quality clusters were also identified for later removal during dataset assembly.
Cell type nomenclature and multiple levels of cell type resolution were harmonized across our domain experts. We identified 9 low-resolution cell classes (AIFI_L1), 29 mid-level cell classes (AIFI_L2), and 71 high-resolution cell classes (AIFI_L3).
Assembly of atlas dataset
After annotation, cluster labels were transferred to individual cell barcodes to assemble a final set of labels across all cells in our dataset. We then generated a "clean" version of the reference datsaet by removing all cells labeled as doublets or contaminated by high levels of Erythrocyte-specific gene expression based on their cell type annotations.
View Notebook: 22-Python_assemble_final_labels.ipynb
View Notebook: 23-Python_generate_clean_version.ipynb
After assignment of labels to individual cell barcodes for each major cell class, we finally assembled labels across all cell classes and generated the full dataset with all labels and without any cells labeled as doublets or low quality.
Training cell type labeling models
View Notebook: 30-Python_celltypist_L1_model.ipynb
View Notebook: 31-Python_celltypist_L2_model.ipynb
View Notebook: 32-Python_celltypist_L3_model.ipynb
In order to utilize our cell type atlas to label other PBMC datasets, we used CellTypist (v1.6.2; Domínguez Conde, et al., 2022) to generate a cell type labeling models. CellTypist utilizes logistic regression as part of its model generation process with a One-vs-Rest (OvR) approach to multiple classes. We slightly modified CellTypist v1.6.2 to also allow the multinomial method provided by the LogisticRegression() function in the scikit.learn Python package, available at aifimmunology/multicelltypist . We used this modified version of CellTypist to train models for each level of our cell type labels using these two approaches. We utilized OvR regression for AIFI_L1 and AIFI_L2 labels, and Multinomial regression for AIFI_L3.
References
Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197
Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184: 3573–3587.e29.
doi:10.1016/j.cell.2021.04.048
Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16: 1289–1296.
doi:10.1038/s41592-019-0619-0
Thomson Z, He Z, Swanson E, Henderson K, Phalen C, Zaim SR, et al. Trimodal single-cell profiling reveals a novel pediatric CD8αα+ T cell subset and broad age-related molecular reprogramming across the T cell compartment. Nat Immunol. 2023;24: 1947–1959.
doi:10.1038/s41590-023-01641-8
Traag V, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. arXiv [cs.SI]. 2018. Available: http://arxiv.org/abs/1810.08473
doi:10.48550/arXiv.1810.08473
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0
Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281–291.e9.
doi:10.1016/j.cels.2018.11.005