To build our longitudinally sampled PBMC scRNA-seq dataset from 95 healthy subjects, we utilized data and analysis environments within the Human Immune System Explorer (HISE) system to trace data processing, analytical code, and analysis environments from raw FASTQ data to a labeled, high-quality final dataset. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.
This analysis consisted of the following major steps:
- Consistent pipeline processing of hashed 10x scRNA-seq data
- Sample selection from HISE data storage
- Labeling of cells using our CellTypist models and doublet scoring
- Separation of cell classes (AIFI_L2 labels) and subclustering to identify cross-class doublets and low-quality clusters
- Clustering and label refinement for high-resolution population subset labels (AIFI_L3 labels)
- Assembly of final datasets
- Differential expression tests
These steps and some additional analyses are available at:
aifimmunology/sound-life-scrna-analysis
Pipeline processing
After sequencing, Gene Expression and Hash Tag Oligonucleotide libraries from pooled samples in our pipeline batches were demultiplexed and assembled as individual files for each biological sample. See our Multiplexed scRNA-seq pipeline documentation for additional details.
Data selection from HISE storage
View Notebook: 00-R_Sample_Selection.ipynb
To assemble the input data for our dataset, we selected all available samples from all subjects in our SoundLife Young Adult (age 25-35) and SoundLife Older Adult (age 55-65) cohorts. We then filtered to remove a batch of samples with previously identified quality problems (B004), as well as subjects with non-healthy or abnormal disease states recorded at the time of any of their sample collection visits. In cases where aliquots from the same blood draw were processed through our pipeline multiple times, we selected the most recently generated data (highest batch number).
In total, 868 samples were selected for use in our dataset (Young Adult, n = 49 subjects; n = 418 samples; Older Adult, n = 47 subjects, n = 450 samples), consisting of 15,781,886 cells before additional QC filtering.
View Notebook: 01-Python_retrieve_cmv_bmi.ipynb
In addition to processed data and sample metadata identification, we retrieved clinical lab results and clinical subject data to assemble CMV infection status for all subjects, as well as BMI for all adult subjects (Sound Life Cohorts).
Labeling and doublet detection
To add cell subset labels, we utilized CellTypist (v1.6.1, Dominguez Conde, et al.) and the CellTypist models we generated using our Immune Health PBMC Atlas dataset at 3 levels of subset label resolution (AIFI_L1, 9 broad cell classes; AIFI_L2, 29 intermediate cell classes, and AIFI_L3, 71 high-resolution cell classes) by following the approach described in the CellTypist reference documentation at https://celltypist.org.
Subset labels were assigned in two stages. See the Github repository for additional details.
L1 and L2 labels:
View Notebook: 02a-Python_label_predictions_celltypist_L1-L2_BR1_Female.ipynb
View Notebook: 02b-Python_label_predictions_celltypist_L1-L2_BR1_Male.ipynb
View Notebook: 02c-Python_label_predictions_celltypist_L1-L2_BR2_Female.ipynb
View Notebook: 02d-Python_label_predictions_celltypist_L1-L2_BR2_Male.ipynb
L3 labels:
View Notebook: 08a-Python_label_predictions_celltypist_L3_BR1_Female.ipynb
View Notebook: 08b-Python_label_predictions_celltypist_L3_BR1_Male.ipynb
View Notebook: 08c-Python_label_predictions_celltypist_L3_BR2_Female.ipynb
View Notebook: 08d-Python_label_predictions_celltypist_L3_BR2_Male.ipynb
View Notebook: 03-Python_doublet_detection.ipynb
While our cell hashing pipeline identifies doublets that occur between multiple samples, we are blind to doublets that occur with the same cell droplet. To assist in identifying these remaining doublets, we performed doublet detection using the scrublet package (v0.2.3, Wolock, et al.) as implemented in the scanpy.external module of scanpy (v1.9.6, Alexander Wolf, et al.).
View Notebook: 04-Python_assembly_of_subsets.ipynb
After subset labeling and doublet detection were performed on each sample, we assembled data, labels, scrublet calls, sample metadata, CMV status, and BMI data across 8 sets of samples defined by the cohort, biological sex, and CMV status of subjects to facilitate downstream analyses.
QC filtering
View Notebook: 05-Python_subset_qc_filtering.ipynb
After assembly of all cells across our samples, we filtered our data against a set of QC criteria designed to remove possible doublets (based on scrublet and high gene detection, > 5,000 genes), low-quality cells (based on low gene detection, < 200 genes), and dying or dead cells (based on mitochondrial UMI content, > 10%). Mitochondrial content was assessed using scanpy by identifying mitochondrial genes (prefixed with "MT-"), and using the scanpy function calculate_qc_metrics, as demonstrated in the scanpy tutorials (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). QC criteria were applied to the dataset in series, as shown in the table below. In total, 951,864 cells were removed (6.03%), and 14,830,022 cells passed QC filtering (93.97%).
QC Criteria | Cells Removed | Cells Remaining |
---|---|---|
Pre-QC filtering | NA | 15,781,886 (100%) |
Scrublet doublet call | 156,381 (0.99%) | 15,625,505 (99.00%) |
Mitochondrial UMIs > 10% | 770,582 (4.88%) | 14,854,923 (94.13%) |
N Genes Detected < 200 | 6,302 (0.04%) | 14,848,621 (94.09%) |
N Genes Detected > 2,500 | 18,599 (0.12%) | 14,830,022 (93.97%) |
Total Removed | 951,864 (6.03%) | 14,830,022 (93.97%) |
Clustering, class subsetting, and marker-based doublet removal
After QC filtering, all remaining cells were clustered within AIFI_L2 labels using a scanpy workflow that consisted of normalization using scanpy.pp.normalize_total with the parameter target_sum = 1e4; log transformation using scanpy.pp.log1p; highly variable gene detection using scanpy.pp.highly_variable_genes; scaling of highly variable gene data using sc.pp.scale; Dimensionality reduction using scanpy.tl.pca with the parameter svd_solver = 'arpack'; Cohort integration using the Harmony method (harmonypy v0.0.9; Korsunsky, et al.) implemented in scanpy.external.harmony_integrate with the parameter key = 'cohort.cohortGuid'; Nearest neighbor calculation with scanpy.pp.neighbors and parameters n_neighbors = 50, use_rep = 'X_pca_harmony', and n_pcs = 30; UMAP embedding with scanpy.tl.umap; and Leiden clustering (leidenalg v0.10.1; Traag, et al.) using scanpy.tl.leiden. Unless specified, default parameters provided by scanpy v1.9.6 were used.
For most AIFI_L2 cell classes, the processing and clustering described above was performed using cells from all samples. However, 5 cell classes consisting of > 1 million cells were processed separately as 8 sets of cells based on cohort, sex, and CMV status to make file handling and clustering processes more manageable: CD4 naive T cells, CD4 memory T cells, CD8 memory T cells, CD56dim NK cells, and CD14 Monocytes.
Large subset clustering at Intermediate (L2) subset resolution:
View Notebook: 06a-Python_partition_large_cell_classes_BR1_Female.ipynb
View Notebook: 06b-Python_partition_large_cell_classes_BR1_Male.ipynb
View Notebook: 06c-Python_partition_large_cell_classes_BR2_Female.ipynb
View Notebook: 06d-Python_partition_large_cell_classes_BR2_Male.ipynb
All other subsets
View Notebook: 06e-Python_partition_small_cell_classes.ipynb
View Notebook: 07-Python_filter_cell_classes.ipynb
View Notebook: 07a-R_filter_review_plots.ipynb
After clustering, clusters were filtered to remove doublets based on marker gene expression. For example, CD4 naive T cell clusters with a high fraction of MS4A1 expression were removed as B cell doublets. For each cluster, the fraction of cells with detected gene expression for each marker was computed, and clusters meeting a set of cell class-specific criteria were removed. These thresholds are shown in the table below.
Population | Reason | Threshhold | N cells |
---|---|---|---|
ASDC | B cell doublet | MS4A1 > 0.2 | 18 |
ASDC | Platelet doublet | PPBP > 0.4 | 37 |
ASDC | T cell doublet | CD3E > 0.4 | 303 |
CD14 monocyte | B cell doublet | MS4A1 > 0.2 | 6,076 |
CD14 monocyte | Platelet doublet | PPBP > 0.4 | 56,817 |
CD14 monocyte | T cell doublet | CD3E > 0.1 OR IL7R > 0.2 | 115,867 |
CD16 monocyte | B cell doublet | MS4A1 > 0.4 | 4,366 |
CD16 monocyte | Erythrocyte doublet | HBB > 0.2 | 1,677 |
CD16 monocyte | Platelet doublet | PPBP > 0.4 | 15,564 |
CD16 monocyte | T cell doublet | CD3E > 0.2 | 34,617 |
CD56bright NK cell | B cell doublet | MS4A1 > 0.4 | 444 |
CD56bright NK cell | Erythrocyte doublet | HBB > 0.2 | 380 |
CD56bright NK cell | Myeloid doublet | FCN1 > 0.4 | 738 |
CD56bright NK cell | Platelet doublet | PPBP > 0.4 | 1,277 |
CD56bright NK cell | T cell doublet | CD3D > 0.4 | 4,863 |
CD56dim NK cell | B cell doublet | MS4A1 > 0.4 | 4,861 |
CD56dim NK cell | Myeloid doublet | FCN1 > 0.4 | 10,610 |
CD56dim NK cell | Platelet doublet | PPBP > 0.4 | 19,301 |
CD56dim NK cell | T cell doublet | IL7R > 0.4 | 30,614 |
CD8aa | Myeloid doublet | FCN1 > 0.2 | 139 |
CD8aa | Platelet doublet | PPBP > 0.4 | 88 |
cDC1 | Platelet doublet | PPBP > 0.2 | 158 |
cDC1 | T cell doublet | CD3D > 0.2 OR IL7R > 0.2 | 821 |
cDC2 | Erythrocyte doublet | HBB > 0.4 | 391 |
cDC2 | Platelet doublet | PPBP > 0.2 | 1,440 |
cDC2 | T cell doublet | CD3D > 0.2 | 10,075 |
DN T cell | Erythrocyte doublet | HBB > 0.4 | 53 |
DN T cell | Myeloid doublet | FCN1 > 0.2 | 144 |
DN T cell | Platelet doublet | PPBP > 0.4 | 98 |
Effector B cell | Erythrocyte doublet | HBB > 0.4 | 333 |
Effector B cell | Platelet doublet | PPBP > 0.4 | 641 |
Effector B cell | T cell doublet | CD3D > 0.2 OR IL7R > 0.2 | 4,751 |
Erythrocyte | B cell doublet | MS4A1 > 0.4 | 1,985 |
Erythrocyte | Myeloid doublet | FCN1 > 0.2 | 3,138 |
Erythrocyte | Platelet doublet | PPBP > 0.4 | 2,069 |
gdT | Erythrocyte doublet | HBB > 0.2 | 595 |
gdT | Myeloid doublet | FCN1 > 0.2 | 2,381 |
gdT | Platelet doublet | PPBP > 0.4 | 4,822 |
ILC | Erythrocyte doublet | HBB > 0.4 | 8 |
ILC | Platelet doublet | PPBP > 0.2 | 61 |
ILC | T cell doublet | CD3D > 0.4 | 311 |
Intermediate monocyte | B cell doublet | MS4A1 > 0.4 | 342 |
Intermediate monocyte | Erythrocyte doublet | HBB > 0.4 | 414 |
Intermediate monocyte | Platelet doublet | PPBP > 0.2 | 1,928 |
Intermediate monocyte | T cell doublet | CD3E > 0.2 | 9,627 |
MAIT | B cell doublet | MS4A1 > 0.4 | 1,282 |
MAIT | Erythrocyte doublet | HBB > 0.4 | 1,717 |
MAIT | Myeloid doublet | FCN1 > 0.2 | 4,154 |
MAIT | Platelet doublet | PPBP > 0.2 | 8,173 |
Memory B cell | Erythrocyte doublet | HBB > 0.4 | 4,231 |
Memory B cell | Myeloid doublet | FCN1 > 0.2 | 6,726 |
Memory B cell | Platelet doublet | PPBP > 0.2 | 8,168 |
Memory B cell | T cell doublet | CD3D > 0.4 | 6,610 |
Memory CD4 T cell | Erythrocyte doublet | HBB > 0.4 | 5,840 |
Memory CD4 T cell | Myeloid doublet | FCN1 > 0.2 | 13,774 |
Memory CD8 T cell | B cell doublet | MS4A1 > 0.4 | 3,100 |
Memory CD8 T cell | Myeloid doublet | FCN1 > 0.2 | 10,487 |
Memory CD8 T cell | Platelet doublet | PPBP > 0.2 | 16,354 |
Naive B cell | Erythrocyte doublet | HBB > 0.4 | 1,670 |
Naive B cell | Myeloid doublet | FCN1 > 0.2 | 1,005 |
Naive B cell | Platelet doublet | PPBP > 0.2 | 26,396 |
Naive B cell | T cell doublet | CD3D > 0.4 | 7,946 |
Naive CD4 T cell | B cell doublet | MS4A1 > 0.2 | 4,040 |
Naive CD4 T cell | Erythrocyte doublet | HBB > 0.4 | 19,387 |
Naive CD4 T cell | Myeloid doublet | FCN1 > 0.2 | 22,122 |
Naive CD4 T cell | Platelet doublet | PPBP > 0.2 | 49,784 |
Naive CD8 T cell | B cell doublet | MS4A1 > 0.2 | 1,519 |
Naive CD8 T cell | Erythrocyte doublet | HBB > 0.4 | 29,596 |
Naive CD8 T cell | Myeloid doublet | FCN1 > 0.2 | 3,723 |
Naive CD8 T cell | Platelet doublet | PPBP > 0.2 | 21,018 |
pDC | B cell doublet | MS4A1 > 0.2 | 233 |
pDC | Erythrocyte doublet | HBB > 0.4 | 247 |
pDC | Myeloid doublet | FCN1 > 0.2 | 1,367 |
pDC | Platelet doublet | PPBP > 0.2 | 2,869 |
pDC | T cell doublet | CD3D > 0.4 | 3,472 |
Plasma cell | Erythrocyte doublet | HBB > 0.4 | 53 |
Plasma cell | Myeloid doublet | FCN1 > 0.2 | 50 |
Plasma cell | Platelet doublet | PPBP > 0.4 | 39 |
Plasma cell | T cell doublet | CD3D > 0.4 | 1,956 |
Platelet | B cell doublet | MS4A1 > 0.4 | 467 |
Platelet | Erythrocyte doublet | HBB > 0.4 | 3,345 |
Platelet | Myeloid doublet | FCN1 > 0.2 | 4,284 |
Platelet | T cell doublet | CD3D > 0.4 | 5,367 |
Progenitor cell | B cell doublet | MS4A1 > 0.4 | 61 |
Progenitor cell | Erythrocyte doublet | HBB > 0.2 | 437 |
Progenitor cell | Myeloid doublet | FCN1 > 0.2 | 108 |
Progenitor cell | T cell doublet | CD3E > 0.2 | 1,380 |
Proliferating NK cell | B cell doublet | MS4A1 > 0.4 | 102 |
Proliferating NK cell | Erythrocyte doublet | HBB > 0.4 | 63 |
Proliferating NK cell | Platelet doublet | PPBP > 0.2 | 85 |
Proliferating NK cell | T cell doublet | CD3D > 0.4 | 1,311 |
Proliferating T cell | B cell doublet | MS4A1 > 0.4 | 117 |
Proliferating T cell | Myeloid doublet | FCN1 > 0.2 | 901 |
Proliferating T cell | Platelet doublet | PPBP > 0.2 | 186 |
Transitional B cell | Erythrocyte doublet | HBB > 0.4 | 446 |
Transitional B cell | Platelet doublet | PPBP > 0.2 | 1,235 |
Transitional B cell | T cell doublet | CD3D > 0.4 | 1,885 |
Treg | B cell doublet | MS4A1 > 0.4 | 538 |
Treg | Erythrocyte doublet | HBB > 0.4 | 1,765 |
Treg | Myeloid doublet | FCN1 > 0.2 | 2,529 |
In addition to these filters based on fraction of gene detection, we included an additional filter for specific subsets based on mean gene expression of HBB where this better distinguished doublet populations that included Erythrocyte contamination:
Population | Reason | Threshhold | N cells |
---|---|---|---|
CD14 monocyte | Erythrocyte doublet | HBB > 1 | 7,190 |
CD56dim NK cell | Erythrocyte doublet | HBB > 0.7 | 5,004 |
Memory CD8 T cell | Erythrocyte doublet | HBB > 0.7 | 7,711 |
Proliferating T cell | Erythrocyte doublet | HBB > 0.7 | 68 |
Finally, we also removed clusters with low gene detection based on a single threshold: If > 30% of cells in a cluster had < 750 genes detected, the cluster was removed as low quality. Due to their generally lower gene detection, this threshold was not applied to Erythrocyte or Platetelet populations.
Population | Reason | N cells |
---|---|---|
CD14 monocyte | Low gene detection | 5,269 |
CD56bright NK cell | Low gene detection | 498 |
CD56dim NK cell | Low gene detection | 11,835 |
DN T cell | Low gene detection | 370 |
MAIT | Low gene detection | 283 |
Memory B cell | Low gene detection | 1,092 |
Memory CD4 T cell | Low gene detection | 9,538 |
Memory CD8 T cell | Low gene detection | 18,535 |
Naive B cell | Low gene detection | 2,860 |
Naive CD4 T cell | Low gene detection | 48,028 |
Naive CD8 T cell | Low gene detection | 1,020 |
Proliferating NK cell | Low gene detection | 415 |
Treg | Low gene detection | 8,608 |
gdT | Low gene detection | 772 |
The table below presents the total number of cells removed for each reason listed above. In total, 800,059 cells (5.39%) were removed by this filtering process, and 14,029,963 cells (94.61%) were retained for final clean up based on high-resolution AIFI_L3 labels.
Reason | N cells |
---|---|
B cell doublet | 29,551 (0.20%) |
Erythrocyte doublet | 92,621 (0.62%) |
Low gene detection | 109,123 (0.74%) |
Myeloid doublet | 88,380 (0.60%) |
Platelet doublet | 238,608 (1.61%) |
T cell doublet | 241,776 (1.63%) |
Total removed | 800,059 (5.39%) |
Clustering and refinement of high-resolution labels
View Notebook: 09-Python_partition_L3_data.ipynb
After removal of doublets at the AIFI_L2 label resolution, we assembled cells from all samples for each subset in the high-resolution AIFI_L3 labeling results. Separately for each subset, we performed the scanpy data processing procedure described above to enable a final review of high resolution labels. Each subset was examined to identify remaining doublet clusters or clusters of cells that appeared to be mislabeled based on marker gene expression in collaboration with immunological domain experts.
L3 subset clustering
View Notebook: 10a-Python_cluster_L3_b_cell_data.ipynb
View Notebook: 10b-Python_cluster_L3_cd4_t_cell_data.ipynb
View Notebook: 10c-Python_cluster_L3_cd8_t_cell_data.ipynb
View Notebook: 10d-Python_cluster_L3_other_t_data.ipynb
View Notebook: 10e-Python_cluster_L3_myeloid_data.ipynb
View Notebook: 10f-Python_cluster_L3_nk_data.ipynb
View Notebook: 10g-Python_cluster_L3_other_data.ipynb
L3 clustering review
View Notebook: 11a-Python_review_L3_b_cell_data.ipynb
View Notebook: 11b-Python_review_L3_cd4_t_cell_data.ipynb
View Notebook: 11c-Python_review_L3_cd8_t_cell_data.ipynb
View Notebook: 11d-Python_review_L3_other_t_cell_data.ipynb
View Notebook: 11e-Python_review_L3_myeloid_data.ipynb
View Notebook: 11f-Python_review_L3_nk_data.ipynb
View Notebook: 11g-Python_review_L3_other_data.ipynb
L3 filtering
View Notebook: 12a-Python_filter_L3_b_cell_data.ipynb
View Notebook: 12b-Python_filter_L3_cd4_t_cell_data.ipynb
View Notebook: 12c-Python_filter_L3_cd8_t_cell_data.ipynb
View Notebook: 12d-Python_filter_L3_other_t_cell_data.ipynb
View Notebook: 12e-Python_filter_L3_myeloid_data.ipynb
View Notebook: 12f-Python_filter_L3_nk_data.ipynb
View Notebook: 12g-Python_filter_L3_other_data.ipynb
Final review of L3 labels
View Notebook: 13a-Python_review_filtered_L3_b_cell_data.ipynb
View Notebook: 13b-Python_review_filtered_L3_cd4_t_cell_data.ipynb
View Notebook: 13c-Python_review_filtered_L3_cd8_t_cell_data.ipynb
View Notebook: 13d-Python_review_filtered_L3_other_t_cell_data.ipynb
View Notebook: 13e-Python_review_filtered_L3_myeloid_data.ipynb
View Notebook: 13f-Python_review_filtered_L3_nk_data.ipynb
View Notebook: 13g-Python_review_filtered_L3_other_data.ipynb
During this process, the following additional cells were removed from downstream analysis:
CellTypist Population | Reason | N cells (%) |
---|---|---|
ASDC | T cell doublet | 3 (0.08%) |
Adaptive NK cell | Erythrocyte doublet | 1816 (1.69%) |
Adaptive NK cell | Myeloid doublet | 21 (0.02%) |
Adaptive NK cell | T cell doublet | 2503 (2.33%) |
BaEoMaP cell | B cell doublet | 5 (0.81%) |
BaEoMaP cell | T cell doublet | 30 (4.89%) |
C1Q+ CD16 monocyte | CD14 doublet | 349 (1.11%) |
CD4 MAIT | Doublets | 6206 (46.98%) |
CD56bright NK cell | Erythrocyte doublet | 837 (0.95%) |
CD56bright NK cell | T cell doublet | 484 (0.55%) |
CD8 MAIT | Mislabeled GZMH+ | 3346 (0.96%) |
CD95 memory B cell | T cell doublet | 10 (0.07%) |
CLP cell | DC doublet | 18 (0.9%) |
CM CD4 T cell | Low quality | 4649 (0.33%) |
Core CD16 monocyte | CD14/CD16 doublet | 8830 (3.18%) |
Core naive B cell | Myeloid doublet | 7226 (1.1%) |
Core naive CD4 T cell | Erythrocyte doublet | 1107 (0.04%) |
Core naive CD4 T cell | Low quality | 8257 (0.3%) |
DN T cell | B cell Doublet | 42 (0.25%) |
Early memory B cell | T cell doublet | 295 (0.1%) |
Erythrocyte | Platelet doublet | 166 (0.94%) |
GZMB+ Vd2 gdT | Erythrocyte doublet | 13 (0.01%) |
GZMB- CD27- EM CD4 T cell | B cell doublet | 2797 (0.52%) |
GZMB- CD27- EM CD4 T cell | Low quality | 3180 (0.6%) |
GZMK+ CD56dim NK cell | Erythrocyte doublet | 2439 (2.46%) |
GZMK+ CD56dim NK cell | Myeloid doublet | 38 (0.04%) |
GZMK+ CD56dim NK cell | T cell doublet | 1542 (1.55%) |
GZMK+ Vd2 gdT | CD8A Doublet | 2365 (1.83%) |
GZMK+ Vd2 gdT | Doublets | 4978 (3.85%) |
GZMK- CD56dim NK cell | Erythrocyte doublet | 28560 (3.49%) |
GZMK- CD56dim NK cell | T cell doublet | 896 (0.11%) |
IL1B+ CD14 monocyte | Erythrocyte doublet | 87 (0.39%) |
IL1B+ CD14 monocyte | Low quality | 25 (0.11%) |
IL1B+ CD14 monocyte | Platelet doublet | 273 (1.21%) |
IL1B+ CD14 monocyte | T cell doublet | 45 (0.2%) |
ILC | T cell doublet | 105 (1.86%) |
ISG+ CD14 monocyte | Mislabeled IFI27+ | 3776 (1.3%) |
ISG+ CD56dim NK cell | T cell doublet | 824 (4.43%) |
ISG+ memory CD4 T cell | CD8+ cells | 1308 (3.54%) |
ISG+ memory CD4 T cell | Myeloid doublet | 37 (0.1%) |
ISG+ naive B cell | Mislabeled Non-Naive | 998 (4.25%) |
ISG+ naive B cell | Myeloid doublet | 138 (0.59%) |
ISG+ naive B cell | T cell doublet | 107 (0.46%) |
ISG+ naive CD4 T cell | CD8 doublet | 363 (0.83%) |
ISG+ naive CD4 T cell | Myeloid doublet | 239 (0.55%) |
KLRB1+ memory CD4 Treg | Mislabeled CCL5+ | 276 (1.32%) |
KLRB1+ memory CD8 Treg | Mislabeled GNLY+ | 223 (5.37%) |
KLRB1+ memory CD8 Treg | Platelet doublet | 42 (1.01%) |
KLRF1+ effector Vd1 gdT | abT/gdT Doublet | 433 (1.11%) |
KLRF1- GZMB+ CD27- EM CD8 T cell | Erythrocyte doublet | 8751 (1.76%) |
KLRF1- effector Vd1 gdT | Mislabeled TRDC- | 2308 (8.78%) |
KLRF1- effector Vd1 gdT | abT/gdT doublet | 226 (0.86%) |
Naive Vd1 gdT | Mislabeled TRDC- | 1366 (7.76%) |
Naive Vd1 gdT | Mislabeled ZBTB16+ | 564 (3.2%) |
Plasma cell | Mislabeled MS4A1+ | 338 (2.33%) |
Platelet | Erythrocyte doublet | 1665 (2.82%) |
Platelet | Myeloid doublet | 105 (0.18%) |
Platelet | T cell doublet | 31 (0.05%) |
Proliferating NK cell | Myeloid doublet | 23 (0.12%) |
Proliferating NK cell | T cell doublet | 323 (1.66%) |
Proliferating T cell | Mislabeled CD3D- | 475 (2.58%) |
SOX4+ Vd1 gdT | Mislabeled TRDC- | 597 (10.72%) |
SOX4+ Vd1 gdT | Mislabeled ZBTB16+ | 144 (2.59%) |
SOX4+ naive CD4 T cell | CD8 doublet | 1088 (1.09%) |
SOX4+ naive CD4 T cell | gdT doublet | 1386 (1.39%) |
SOX4+ naive CD8 T cell | Mislabeled GZMB+ | 106 (0.54%) |
In total, 234,117 cells were removed at this stage of label refinement, and 13,795,846 cells were retained for downstream analyses. In addition to cells that were removed from the dataset, we reassigned the following cells based on gene expression to generate the final label set:
CellTypist Population | Reassigned Population | N cells (%) |
---|---|---|
CD27+ effector B cell | Core memory B cell | 6279 (12.64%) |
CM CD4 T cell | CD8+ cells | 61331 (4.31%) |
Core CD14 monocyte | CD14+ cDC2 | 10910 (0.62%) |
Core CD16 monocyte | Intermediate monocyte | 15141 (5.44%) |
Core naive B cell | Core memory B cell | 22320 (3.39%) |
GZMB- CD27+ EM CD4 T cell | CD8+ cells | 17779 (2.92%) |
GZMB- CD27- EM CD4 T cell | CD8+ cells | 27230 (5.1%) |
GZMK- CD27+ EM CD8 T cell | GZMK+ CD27+ EM CD8 T cell | 2397 (6.82%) |
HLA-DRhi cDC2 | CD14+ cDC2 | 2494 (4.69%) |
ISG+ CD16 monocyte | ISG+ CD14 monocyte | 256 (0.67%) |
KLRB1+ memory CD4 Treg | GZMK+ memory CD4 Treg | 1071 (5.11%) |
KLRF1+ effector Vd1 gdT | GZMB+ Vd2 gdT | 650 (1.66%) |
KLRF1- GZMB+ CD27- EM CD8 T cell | ISG+ memory CD8 T cell | 6338 (1.27%) |
KLRF1- GZMB+ CD27- memory CD4 T cell | CD8+ cells | 5974 (3.71%) |
Memory CD4 Treg | Proliferating T cell | 1026 (0.71%) |
Naive CD4 Treg | Mislabeled GZMB+ | 184 (0.47%) |
Naive CD4 Treg | Mislabeled GZMK+ | 951 (2.43%) |
In total, 182,331 cells were reassigned in this final pass of cell type refinement.
Population subset frequency analyses
View Notebook: 15-R_assemble_type_frequencies.ipynb
To analyze changes in subset abundance, we tabulated subset frequency for each sample in our dataset for high resolution (L3) label assignments. We also compute proportions of each subset per sample, which were used for Centered Log Ratio (CLR) transformations. To compare abundance between sets of subjects, we used rank-sum Wilcoxon tests on CLR-transformed values. We compared abundance differences between Age Groups (Young Adult vs Older Adult), CMV status (Negative vs Positive), and Biological Sex (Female vs Male) using the first Flu Vaccine time point (Flu Year 1 Day 0) for each subject. For comparisons between longitudinal time points in the flu vaccine series (Day 7 vs Day 0), we used paired signed-rank Wilcoxon tests.
Pseudobulk data assembly
We generated Pseudobulk data by aggregating results for cells with the same high resolution (L3) label within each sample. This aggregation was performed using multiple methods for downstream analysis: summing UMI counts per gene (used for DESeq2 analysis, below), taking the mean of normalized and log transformed values (for visualization).
Differential gene expression tests
We performed differential gene expression tests using the DESeq2 v1.42.0 (Love, Huber, and Anders, 2014) package in R v4.3.2. Prior to differential tests, genes were filtered based on a minimum of 10% gene detection across all single cells used for the comparison. Summed UMI counts from single cells were used for the comparisons. DESeq2 analyses used default parameters. For comparisons of groups of subjects at baseline, we used pseudobulk data from Flu Year 1 Day 0 samples. The DESeq2 design formula for these comparisons was `Aggregated Counts ~ Age + CMV + Sex`, and then contrasts were made for each factor to identify differential results. Genes were considered differentially expressed if their adjusted p-value was below 0.05 and their absolute fold change was greater than 0.1.
References
Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.
doi:10.1186/s13059-014-0550-8
Traag V, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. arXiv [cs.SI]. 2018. Available: http://arxiv.org/abs/1810.08473
doi:10.48550/arXiv.1810.08473
Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0
Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281–291.e9.
doi:10.1016/j.cels.2018.11.005