Data analysis methods

Data Analysis Methods

To build our Human Immune Health Atlas dataset, we utilized data and analysis environments within the Human Immune System Explorer (HISE) framework to trace data processing, analytical code, and analysis environments from the original, raw FASTQ data to our final, assembled reference atlas. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.

This analysis consisted of the following major steps:

  • Consistent pipeline processing of hashed 10x scRNA-seq data
  • Sample selection from HISE data storage
  • Labeling of cells with external reference datasets and doublet scoring
  • Separation of major cell classes and subclustering to identify marker genes
  • Expert annotation of the cell type identity of each cluster
  • Assembly of annotations and the final reference dataset
  • Generation of cell labeling models

These steps and some additional analyses are available at: 
aifimmunology/aifi-healthy-pbmc-reference

Pipeline Processing

After sequencing, Gene Expression and Hash Tag Oligonucleotide libraries from pooled samples in our pipeline batches were demultiplexed and assembled as individual files for each biological sample. See our Multiplexed scRNA-seq pipeline documentation for additional details.

Data selection from HISE storage

View Notebook: 00-R_Sample_Selection.ipynb

To assemble the input data for our reference dataset, we selected samples from all subjects in our BR1 (Sound Life Young Adult, age 25-35) and BR2 (Sound Life Older Adult, age 55-65) cohorts. We then filtered to remove a batch of samples with previously identified quality problems (B004), as well as subjects with non-healthy or abnormal disease states recorded at the time of any of their sample collection visits. We selected the baseline samples (labeled "Flu Year 1 Day 0" in the sample.visitName field). In cases where aliquots from the same blood draw were processed through our pipeline multiple times, we selected the most recently generated data (highest batch number) available at the time of file selection (maximum batch id was B145).

In addition to our adult samples, we manually selected data from 16 samples from the UP1 cohort (University of Pennsylvania Healthy Pediatric cohort, age 11-13) that were previously utilized in Thomson, et al., 2023 based on their unique sample identifiers (sample.sampleKitGuid).

Unique file identifiers (file.id) and sample metadata were saved for data retrieval and analysis steps. In total, 108 samples were selected for use in our reference (Pediatric, n = 16; Young Adult, n = 47; Older Adult, n = 45), consisting of 2,093,078 cells before additional QC filtering.

View Notebook: 01-Python_retrieve_cmv_bmi.ipynb

In addition to processed data and sample metadata identification, we retrieved clinical lab results and clinical subject data to assemble CMV infection status for all subjects, as well as BMI for all subjects.

Labeling and doublet detection

View Notebook: 02-Python_Label_Predictions_Celltypist.ipynb

To guide cell type identification by our domain experts, we next labeled cells from each sample using CellTypist (v1.6.1, Domínguez Conde, et al., 2022) using the following reference models: Immune_All_High (32 cell types), Immune_All_Low (98 cell types), and Healthy_COVID19_PBMC (51 cell types) by following the approach described in the CellTypist reference documentation at https://celltypist.org .

View Notebook: 03-R_Label_Predictions_Seurat.ipynb

We also labeled our cells using Seurat (v5.0.1) using the PBMC reference dataset described in Hao and Hao, et al., 2021, which was downloaded from the Zenodo repository (DOI: 10.5281/zenodo.7779016 ). For Seurat labeling, we utilized the SCTransform function to transform data to match the reference dataset, then used FindTransferAnchors and MapQuery to assign cell types based on the 3-level cell type labels provided in the reference dataset.

View Notebook: 04-Python_Doublet_detection.ipynb

While our cell hashing pipeline identifies doublets that occur between multiple samples, we are blind to doublets that occur with the same cell droplet. To assist in identifying these remaining doublets, we performed doublet detection using the scrublet package (v0.2.3, Wolock, et al., 2019) as implemented in the scanpy.external module of scanpy (v1.9.6, Wolf, et al., 2018).

View Notebook: 05-Python_combine_data.ipynb

After cell type labeling and doublet detection were performed on each sample, we assembled data, labels, scrublet calls, sample metadata, CMV status, and BMI data across all samples to facilitate downstream analysis.

QC filtering, clustering, and class subsetting

View Notebook: 06-Python_qc_and_clustering.ipynb

After assembly of all cells across our 108 samples, we filtered our data against a set of QC criteria designed to remove possible doublets (based on scrublet and high gene detection, > 5,000 genes), low-quality cells (based on low gene detection, < 200 genes), and dying or dead cells (based on mitochondrial UMI content, > 10%). Mitochondrial content was assessed using scanpy by identifying mitochondrial genes (prefixed with "MT-"), and using the scanpy function calculate_qc_metrics, as demonstrated in the scanpy tutorials (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). QC criteria were applied to the dataset in series, as shown in the table below. In total, 140,950 cells were removed (6.73%), and 1,952,128 cells passed QC filtering (93.27%).

QC CriteriaCells Removed% RemovedCells Remaining% Remaining
Pre-QC filtering2,093,079 (100%)100.00%
Scrublet doublet call27,9071.33%2,065,17198.67%
Mitochondrial UMIs > 10%109,9405.25%1,955,23193.41%
N Genes Detected < 2001,3060.06%1,953,92593.35%
N Genes Detected > 2,5001,7970.09%1,952,12893.27%
Total Removed140,9506.37%1,952,12893.27%

After QC filtering, all remaining cells were clustered using a scanpy workflow that consisted of normalization using scanpy.pp.normalize_total with the parameter target_sum = 1e4; log transformation using scanpy.pp.log1p; highly variable gene detection using scanpy.pp.highly_variable_genes; scaling of highly variable gene data using sc.pp.scale; Dimensionality reduction using scanpy.tl.pca with the parameter svd_solver = 'arpack'; Cohort integration using the Harmony method (harmonypy v0.0.9; Korsunsky, et al., 2019) implemented in scanpy.external.harmony_integrate with the parameter key = 'cohort.cohortGuid'; Nearest neighbor calculation with scanpy.pp.neighbors and parameters n_neighbors = 50, use_rep = 'X_pca_harmony', and n_pcs = 30; UMAP embedding with scanpy.tl.umap; and Leiden clustering (leidenalg v0.10.1; Traag, et al., 2018) using scanpy.tl.leiden. Unless specified, default parameters provided by scanpy v1.9.6 were used.

After clustering, clusters were assigned to major cell classes based on marker gene detection thresholds. For each cluster, the fraction of cells with detected gene expression for each marker was computed, and clusters meeting a set of cell class-specific criteria were selected for downstream annotation by our domain experts. These thresholds are shown in the table below.

PopulationThresholdsN cells
B cells
07a-Python_B_cells_without_Igs.ipynb
(MS4A1 > 0.5)
OR (TNFRSF13B > 0.8)
177,994
Myeloid cells
08-Python_Myeloid_cells.ipynb
(FCN1 > 0.5)
OR (IL3A > 0.5)
397,356
NK cells
09-Python_NK_cells.ipynb
(NCAM1 > 0.2)160,848
T cells
11-Python_T_cells.ipynb
(CD3D > 0.5)
AND NOT (
    (HBB > 0.8)
    OR (FCN1 > 0.5)
    OR (PPBP > 0.6)
)
1,191,327
Other cells
10-Python_Others.ipynb
(
   (HBB > 0.8)
   OR (PPBP > 0.3)
   OR (PRSS57 > 0.2)
)
AND NOT (FCN1 > 0.5)
24,603

After selection of cells in each cell class, each population of cells was separately reprocessed using the scanpy workflow shown above, with the addition of Leiden clustering with multiple resolution parameters for each class: B cells, resolution = 1, 1.5, and 2; Myeloid cells, resolution = 1, 1.5, 2, 2.5, and 3; NK cells, resolution = 1, 1.5, and 2; T cells, resolution = 1, 1.5, and 2; and Other cells, resolution = 1, 1.5, and 2. In the case of B cells, this iterative clustering was carried out after excluding Immunoglobulin-related genes from the list of highly variable genes by removing gene symbols that started with "IGH", "IGL", or "IGK" so that the strong, binary switching of these genes did not drive clustering.

Iterative clustering of specific populations

To assist in identification of specific populations within large cell classes, we performed selective iterative subset identification and re-clustering. 

SubpopulationParent (resolution)CriteriaN cells
Dendritic cells
08a-Python_Myeloid_cells_DCs.ipynb
Myeloid cells (2)(IRF8 > 0.6)
OR(FCER1A > 0.1)
34,641
CD56dim NK cells
09a-Python_NK_CD56dim.ipynb
NK cells (1.5)NOT (GZMK > 0.4)
AND NOT (IL32 > 0.6)
AND NOT (IL7R > 0.8)
AND NOT (MKI67 > 0.6)
AND NOT (ISG15 > 0.6)
AND NOT (HBB > 0.2)
AND NOT (PPBP > 0.2)
105,676
CD4 Naive T cells
11a-Python_T_cells_cd4-naive.ipynb
T cells (1.5)(FHIT > 0.3)379,659
MAIT cells
11b-Python_T_cells_cd8-mait.ipynb
T cells (1.5)(SLC4A10 > 0.3)50,823
CD8 CM T cells
11c-Python_T_cells_cd8-cm.ipynb
T cells (1.5)(CD8A > 0.5)
AND NOT (GZMK > 0.7)
AND NOT (GZMB > 0.5)
AND NOT (CCR7 > 0.7)
43,289
CD8 EM T cells
11d-Python_T_cells_cd8-em.ipynb
T cells (1.5)(CD8A > 0.5)
AND (GZMB > 0.5)
117,904
Treg cells
11e-Python_T_cells_treg.ipynb
T cells (1.5)(IKZF2 > 0.3)39,087
CD8 Naive cells
11f-Python_T_cells_cd8-naive.ipynb
T cells (1.5)(CD8A > 0.5)
AND (CCR7 > 0.6)
121,643
Proliferating T cells
11g-Python_T_cells_proliferating.ipynb
T cells (1.5)(MKI67 > 0.5)4,330
ISG-high T cells
11i-Python_T_cells_isg-high.ipynb
T cells (1.5)(IFI44L > 0.5)14,140
Other T cells
11j-Python_T_cells_other.ipynb
T cells (1.5)Not in any other T subset376,762

After subsetting, each subpopulation was iteratively processed using the same scanpy workflow described above. Resolutions used for Leiden clustering these subpopulations were as follows: Dendritic cells, resolution = 1 and 2; CD56dim NK cells, resolution = 1.5; CD4 Naive, resolution = 1.5; MAIT cells, resolution = 3; CD8 CM T cells, resolution = 1.5; CD8 EM T cells, resolution = 3; Treg cells, resolution = 1.5; CD8 Naive cells, resolution = 1.5; Proliferating T cells, resolution = 1.5; ISG-high T cells, resolution = 2; Other T cells, resolution = 1.5.

View Notebook: 11h-Python_T_cells_gd.ipynb

To subset gdT cells, we incorporated cells expressing the TRDC gene from the set of all T cells, as well as those from iteratively clustered CD8 CM, CD8 EM, and MAIT cells that did not separate in global T cell clustering.

SubpopulationParent (resolution)CriteriaN cells
gdT cellsT cells (1.5)(TRDC > 0.5)33,088
gdT cellsCD8 CM T cells (1.5)(TRDC > 0.5)5,721
gdT cellsCD8 EM T cells (3)(TRDC > 0.5)12,565
gdT cellsMAIT (3)(TRDC > 0.5)2,739

As with other subtypes, gdT cells were reprocessed and clustered with a Leiden resolution of 1.5 to identify specific cell types.

Marker gene computation

To assist with cell labeling, marker genes were identified for each cluster at each level of iteration and at each clustering resolution using the scanpy function scanpy.tl.rank_genes_groups with the parameter method = 'wilcoxon'. Results were obtained using scanpy.get.rank_genes_groups_df, and were utilized by our annotation team.

Expert annotation of cell types

Following high-level and iterative clustering, teams of domain experts within the Allen Institute for Immunology examined marker gene expression for clustered datasets, and assigned cell type identities to each cluster. As part of this cell type identification exercise, any remaining doublets or low-quality clusters were also identified for later removal during dataset assembly.

Cell type nomenclature and multiple levels of cell type resolution were harmonized across our domain experts. We identified 9 low-resolution cell classes (AIFI_L1), 29 mid-level cell classes (AIFI_L2), and 71 high-resolution cell classes (AIFI_L3).

Assembly of atlas dataset

After annotation, cluster labels were transferred to individual cell barcodes to assemble a final set of labels across all cells in our dataset. We then generated a "clean" version of the reference datsaet by removing all cells labeled as doublets or contaminated by high levels of Erythrocyte-specific gene expression based on their cell type annotations.

View Notebook: 22-Python_assemble_final_labels.ipynb
View Notebook: 23-Python_generate_clean_version.ipynb

After assignment of labels to individual cell barcodes for each major cell class, we finally assembled labels across all cell classes and generated the full dataset with all labels and without any cells labeled as doublets or low quality.

Training cell type labeling models

View Notebook: 30-Python_celltypist_L1_model.ipynb
View Notebook: 31-Python_celltypist_L2_model.ipynb
View Notebook: 32-Python_celltypist_L3_model.ipynb

In order to utilize our cell type atlas to label other PBMC datasets, we used CellTypist (v1.6.2; Domínguez Conde, et al., 2022) to generate a cell type labeling models. CellTypist utilizes logistic regression as part of its model generation process with a One-vs-Rest (OvR) approach to multiple classes. We slightly modified CellTypist v1.6.2 to also allow the multinomial method provided by the LogisticRegression() function in the scikit.learn Python package, available at aifimmunology/multicelltypist . We used this modified version of CellTypist to train models for each level of our cell type labels using these two approaches. We utilized OvR regression for AIFI_L1 and AIFI_L2 labels, and Multinomial regression for AIFI_L3.

References

Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197

Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021;184: 3573–3587.e29.
doi:10.1016/j.cell.2021.04.048

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16: 1289–1296.
doi:10.1038/s41592-019-0619-0

Thomson Z, He Z, Swanson E, Henderson K, Phalen C, Zaim SR, et al. Trimodal single-cell profiling reveals a novel pediatric CD8αα+ T cell subset and broad age-related molecular reprogramming across the T cell compartment. Nat Immunol. 2023;24: 1947–1959.
doi:10.1038/s41590-023-01641-8

Traag V, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. arXiv [cs.SI]. 2018. Available: http://arxiv.org/abs/1810.08473
doi:10.48550/arXiv.1810.08473

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0

Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281–291.e9.
doi:10.1016/j.cels.2018.11.005