Analysis methods

To build our longitudinally sampled PBMC scRNA-seq dataset from 95 healthy subjects, we utilized data and analysis environments within the Human Immune System Explorer (HISE) system to trace data processing, analytical code, and analysis environments from raw FASTQ data to a labeled, high-quality final dataset. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.

This analysis consisted of the following major steps:

Consistent pipeline processing of hashed 10x scRNA-seq data
Sample selection from HISE data storage
Labeling of cells using our CellTypist models and doublet scoring
Separation of cell classes (AIFI_L2 labels) and subclustering to identify cross-class doublets and low-quality clusters
Clustering and label refinement for high-resolution population subset labels (AIFI_L3 labels)
Assembly of final datasets
Differential expression tests

These steps and some additional analyses are available at:
aifimmunology/sound-life-scrna-analysis

Pipeline processing

After sequencing, Gene Expression and Hash Tag Oligonucleotide libraries from pooled samples in our pipeline batches were demultiplexed and assembled as individual files for each biological sample. See our Multiplexed scRNA-seq pipeline documentation for additional details.

Data selection from HISE storage

View Notebook: 00-R_Sample_Selection.ipynb

To assemble the input data for our dataset, we selected all available samples from all subjects in our SoundLife Young Adult (age 25-35) and SoundLife Older Adult (age 55-65) cohorts. We then filtered to remove a batch of samples with previously identified quality problems (B004), as well as subjects with non-healthy or abnormal disease states recorded at the time of any of their sample collection visits. In cases where aliquots from the same blood draw were processed through our pipeline multiple times, we selected the most recently generated data (highest batch number).

In total, 868 samples were selected for use in our dataset (Young Adult, n = 49 subjects; n = 418 samples; Older Adult, n = 47 subjects, n = 450 samples), consisting of 15,781,886 cells before additional QC filtering.

View Notebook: 01-Python_retrieve_cmv_bmi.ipynb

In addition to processed data and sample metadata identification, we retrieved clinical lab results and clinical subject data to assemble CMV infection status for all subjects, as well as BMI for all adult subjects (Sound Life Cohorts).

Labeling and doublet detection

To add cell subset labels, we utilized CellTypist (v1.6.1, Dominguez Conde, et al.) and the CellTypist models we generated using our Immune Health PBMC Atlas dataset at 3 levels of subset label resolution (AIFI_L1, 9 broad cell classes; AIFI_L2, 29 intermediate cell classes, and AIFI_L3, 71 high-resolution cell classes) by following the approach described in the CellTypist reference documentation at https://celltypist.org.

Subset labels were assigned in two stages. See the Github repository for additional details.

L1 and L2 labels:

View Notebook: 02a-Python_label_predictions_celltypist_L1-L2_BR1_Female.ipynb
View Notebook: 02b-Python_label_predictions_celltypist_L1-L2_BR1_Male.ipynb
View Notebook: 02c-Python_label_predictions_celltypist_L1-L2_BR2_Female.ipynb
View Notebook: 02d-Python_label_predictions_celltypist_L1-L2_BR2_Male.ipynb

L3 labels:

View Notebook: 08a-Python_label_predictions_celltypist_L3_BR1_Female.ipynb
View Notebook: 08b-Python_label_predictions_celltypist_L3_BR1_Male.ipynb
View Notebook: 08c-Python_label_predictions_celltypist_L3_BR2_Female.ipynb
View Notebook: 08d-Python_label_predictions_celltypist_L3_BR2_Male.ipynb

View Notebook: 03-Python_doublet_detection.ipynb

While our cell hashing pipeline identifies doublets that occur between multiple samples, we are blind to doublets that occur with the same cell droplet. To assist in identifying these remaining doublets, we performed doublet detection using the scrublet package (v0.2.3, Wolock, et al.) as implemented in the scanpy.external module of scanpy (v1.9.6, Alexander Wolf, et al.).

View Notebook: 04-Python_assembly_of_subsets.ipynb

After subset labeling and doublet detection were performed on each sample, we assembled data, labels, scrublet calls, sample metadata, CMV status, and BMI data across 8 sets of samples defined by the cohort, biological sex, and CMV status of subjects to facilitate downstream analyses.

QC filtering

View Notebook: 05-Python_subset_qc_filtering.ipynb

After assembly of all cells across our samples, we filtered our data against a set of QC criteria designed to remove possible doublets (based on scrublet and high gene detection, > 5,000 genes), low-quality cells (based on low gene detection, < 200 genes), and dying or dead cells (based on mitochondrial UMI content, > 10%). Mitochondrial content was assessed using scanpy by identifying mitochondrial genes (prefixed with "MT-"), and using the scanpy function calculate_qc_metrics, as demonstrated in the scanpy tutorials (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). QC criteria were applied to the dataset in series, as shown in the table below. In total, 951,864 cells were removed (6.03%), and 14,830,022 cells passed QC filtering (93.97%).

QC Criteria	Cells Removed	Cells Remaining
Pre-QC filtering	NA	15,781,886 (100%)
Scrublet doublet call	156,381 (0.99%)	15,625,505 (99.00%)
Mitochondrial UMIs > 10%	770,582 (4.88%)	14,854,923 (94.13%)
N Genes Detected < 200	6,302 (0.04%)	14,848,621 (94.09%)
N Genes Detected > 2,500	18,599 (0.12%)	14,830,022 (93.97%)
Total Removed	951,864 (6.03%)	14,830,022 (93.97%)

Clustering, class subsetting, and marker-based doublet removal

After QC filtering, all remaining cells were clustered within AIFI_L2 labels using a scanpy workflow that consisted of normalization using scanpy.pp.normalize_total with the parameter target_sum = 1e4; log transformation using scanpy.pp.log1p; highly variable gene detection using scanpy.pp.highly_variable_genes; scaling of highly variable gene data using sc.pp.scale; Dimensionality reduction using scanpy.tl.pca with the parameter svd_solver = 'arpack'; Cohort integration using the Harmony method (harmonypy v0.0.9; Korsunsky, et al.) implemented in scanpy.external.harmony_integrate with the parameter key = 'cohort.cohortGuid'; Nearest neighbor calculation with scanpy.pp.neighbors and parameters n_neighbors = 50, use_rep = 'X_pca_harmony', and n_pcs = 30; UMAP embedding with scanpy.tl.umap; and Leiden clustering (leidenalg v0.10.1; Traag, et al.) using scanpy.tl.leiden. Unless specified, default parameters provided by scanpy v1.9.6 were used.

For most AIFI_L2 cell classes, the processing and clustering described above was performed using cells from all samples. However, 5 cell classes consisting of > 1 million cells were processed separately as 8 sets of cells based on cohort, sex, and CMV status to make file handling and clustering processes more manageable: CD4 naive T cells, CD4 memory T cells, CD8 memory T cells, CD56dim NK cells, and CD14 Monocytes.

Large subset clustering at Intermediate (L2) subset resolution:

View Notebook: 06a-Python_partition_large_cell_classes_BR1_Female.ipynb
View Notebook: 06b-Python_partition_large_cell_classes_BR1_Male.ipynb
View Notebook: 06c-Python_partition_large_cell_classes_BR2_Female.ipynb
View Notebook: 06d-Python_partition_large_cell_classes_BR2_Male.ipynb

All other subsets

View Notebook: 06e-Python_partition_small_cell_classes.ipynb

View Notebook: 07-Python_filter_cell_classes.ipynb
View Notebook: 07a-R_filter_review_plots.ipynb

After clustering, clusters were filtered to remove doublets based on marker gene expression. For example, CD4 naive T cell clusters with a high fraction of MS4A1 expression were removed as B cell doublets. For each cluster, the fraction of cells with detected gene expression for each marker was computed, and clusters meeting a set of cell class-specific criteria were removed. These thresholds are shown in the table below.

Population	Reason	Threshhold	N cells
ASDC	B cell doublet	MS4A1 > 0.2	18
ASDC	Platelet doublet	PPBP > 0.4	37
ASDC	T cell doublet	CD3E > 0.4	303
CD14 monocyte	B cell doublet	MS4A1 > 0.2	6,076
CD14 monocyte	Platelet doublet	PPBP > 0.4	56,817
CD14 monocyte	T cell doublet	CD3E > 0.1 OR IL7R > 0.2	115,867
CD16 monocyte	B cell doublet	MS4A1 > 0.4	4,366
CD16 monocyte	Erythrocyte doublet	HBB > 0.2	1,677
CD16 monocyte	Platelet doublet	PPBP > 0.4	15,564
CD16 monocyte	T cell doublet	CD3E > 0.2	34,617
CD56bright NK cell	B cell doublet	MS4A1 > 0.4	444
CD56bright NK cell	Erythrocyte doublet	HBB > 0.2	380
CD56bright NK cell	Myeloid doublet	FCN1 > 0.4	738
CD56bright NK cell	Platelet doublet	PPBP > 0.4	1,277
CD56bright NK cell	T cell doublet	CD3D > 0.4	4,863
CD56dim NK cell	B cell doublet	MS4A1 > 0.4	4,861
CD56dim NK cell	Myeloid doublet	FCN1 > 0.4	10,610
CD56dim NK cell	Platelet doublet	PPBP > 0.4	19,301
CD56dim NK cell	T cell doublet	IL7R > 0.4	30,614
CD8aa	Myeloid doublet	FCN1 > 0.2	139
CD8aa	Platelet doublet	PPBP > 0.4	88
cDC1	Platelet doublet	PPBP > 0.2	158
cDC1	T cell doublet	CD3D > 0.2 OR IL7R > 0.2	821
cDC2	Erythrocyte doublet	HBB > 0.4	391
cDC2	Platelet doublet	PPBP > 0.2	1,440
cDC2	T cell doublet	CD3D > 0.2	10,075
DN T cell	Erythrocyte doublet	HBB > 0.4	53
DN T cell	Myeloid doublet	FCN1 > 0.2	144
DN T cell	Platelet doublet	PPBP > 0.4	98
Effector B cell	Erythrocyte doublet	HBB > 0.4	333
Effector B cell	Platelet doublet	PPBP > 0.4	641
Effector B cell	T cell doublet	CD3D > 0.2 OR IL7R > 0.2	4,751
Erythrocyte	B cell doublet	MS4A1 > 0.4	1,985
Erythrocyte	Myeloid doublet	FCN1 > 0.2	3,138
Erythrocyte	Platelet doublet	PPBP > 0.4	2,069
gdT	Erythrocyte doublet	HBB > 0.2	595
gdT	Myeloid doublet	FCN1 > 0.2	2,381
gdT	Platelet doublet	PPBP > 0.4	4,822
ILC	Erythrocyte doublet	HBB > 0.4	8
ILC	Platelet doublet	PPBP > 0.2	61
ILC	T cell doublet	CD3D > 0.4	311
Intermediate monocyte	B cell doublet	MS4A1 > 0.4	342
Intermediate monocyte	Erythrocyte doublet	HBB > 0.4	414
Intermediate monocyte	Platelet doublet	PPBP > 0.2	1,928
Intermediate monocyte	T cell doublet	CD3E > 0.2	9,627
MAIT	B cell doublet	MS4A1 > 0.4	1,282
MAIT	Erythrocyte doublet	HBB > 0.4	1,717
MAIT	Myeloid doublet	FCN1 > 0.2	4,154
MAIT	Platelet doublet	PPBP > 0.2	8,173
Memory B cell	Erythrocyte doublet	HBB > 0.4	4,231
Memory B cell	Myeloid doublet	FCN1 > 0.2	6,726
Memory B cell	Platelet doublet	PPBP > 0.2	8,168
Memory B cell	T cell doublet	CD3D > 0.4	6,610
Memory CD4 T cell	Erythrocyte doublet	HBB > 0.4	5,840
Memory CD4 T cell	Myeloid doublet	FCN1 > 0.2	13,774
Memory CD8 T cell	B cell doublet	MS4A1 > 0.4	3,100
Memory CD8 T cell	Myeloid doublet	FCN1 > 0.2	10,487
Memory CD8 T cell	Platelet doublet	PPBP > 0.2	16,354
Naive B cell	Erythrocyte doublet	HBB > 0.4	1,670
Naive B cell	Myeloid doublet	FCN1 > 0.2	1,005
Naive B cell	Platelet doublet	PPBP > 0.2	26,396
Naive B cell	T cell doublet	CD3D > 0.4	7,946
Naive CD4 T cell	B cell doublet	MS4A1 > 0.2	4,040
Naive CD4 T cell	Erythrocyte doublet	HBB > 0.4	19,387
Naive CD4 T cell	Myeloid doublet	FCN1 > 0.2	22,122
Naive CD4 T cell	Platelet doublet	PPBP > 0.2	49,784
Naive CD8 T cell	B cell doublet	MS4A1 > 0.2	1,519
Naive CD8 T cell	Erythrocyte doublet	HBB > 0.4	29,596
Naive CD8 T cell	Myeloid doublet	FCN1 > 0.2	3,723
Naive CD8 T cell	Platelet doublet	PPBP > 0.2	21,018
pDC	B cell doublet	MS4A1 > 0.2	233
pDC	Erythrocyte doublet	HBB > 0.4	247
pDC	Myeloid doublet	FCN1 > 0.2	1,367
pDC	Platelet doublet	PPBP > 0.2	2,869
pDC	T cell doublet	CD3D > 0.4	3,472
Plasma cell	Erythrocyte doublet	HBB > 0.4	53
Plasma cell	Myeloid doublet	FCN1 > 0.2	50
Plasma cell	Platelet doublet	PPBP > 0.4	39
Plasma cell	T cell doublet	CD3D > 0.4	1,956
Platelet	B cell doublet	MS4A1 > 0.4	467
Platelet	Erythrocyte doublet	HBB > 0.4	3,345
Platelet	Myeloid doublet	FCN1 > 0.2	4,284
Platelet	T cell doublet	CD3D > 0.4	5,367
Progenitor cell	B cell doublet	MS4A1 > 0.4	61
Progenitor cell	Erythrocyte doublet	HBB > 0.2	437
Progenitor cell	Myeloid doublet	FCN1 > 0.2	108
Progenitor cell	T cell doublet	CD3E > 0.2	1,380
Proliferating NK cell	B cell doublet	MS4A1 > 0.4	102
Proliferating NK cell	Erythrocyte doublet	HBB > 0.4	63
Proliferating NK cell	Platelet doublet	PPBP > 0.2	85
Proliferating NK cell	T cell doublet	CD3D > 0.4	1,311
Proliferating T cell	B cell doublet	MS4A1 > 0.4	117
Proliferating T cell	Myeloid doublet	FCN1 > 0.2	901
Proliferating T cell	Platelet doublet	PPBP > 0.2	186
Transitional B cell	Erythrocyte doublet	HBB > 0.4	446
Transitional B cell	Platelet doublet	PPBP > 0.2	1,235
Transitional B cell	T cell doublet	CD3D > 0.4	1,885
Treg	B cell doublet	MS4A1 > 0.4	538
Treg	Erythrocyte doublet	HBB > 0.4	1,765
Treg	Myeloid doublet	FCN1 > 0.2	2,529

In addition to these filters based on fraction of gene detection, we included an additional filter for specific subsets based on mean gene expression of HBB where this better distinguished doublet populations that included Erythrocyte contamination:

Population	Reason	Threshhold	N cells
CD14 monocyte	Erythrocyte doublet	HBB > 1	7,190
CD56dim NK cell	Erythrocyte doublet	HBB > 0.7	5,004
Memory CD8 T cell	Erythrocyte doublet	HBB > 0.7	7,711
Proliferating T cell	Erythrocyte doublet	HBB > 0.7	68

Finally, we also removed clusters with low gene detection based on a single threshold: If > 30% of cells in a cluster had < 750 genes detected, the cluster was removed as low quality. Due to their generally lower gene detection, this threshold was not applied to Erythrocyte or Platetelet populations.

Population	Reason	N cells
CD14 monocyte	Low gene detection	5,269
CD56bright NK cell	Low gene detection	498
CD56dim NK cell	Low gene detection	11,835
DN T cell	Low gene detection	370
MAIT	Low gene detection	283
Memory B cell	Low gene detection	1,092
Memory CD4 T cell	Low gene detection	9,538
Memory CD8 T cell	Low gene detection	18,535
Naive B cell	Low gene detection	2,860
Naive CD4 T cell	Low gene detection	48,028
Naive CD8 T cell	Low gene detection	1,020
Proliferating NK cell	Low gene detection	415
Treg	Low gene detection	8,608
gdT	Low gene detection	772

The table below presents the total number of cells removed for each reason listed above. In total, 800,059 cells (5.39%) were removed by this filtering process, and 14,029,963 cells (94.61%) were retained for final clean up based on high-resolution AIFI_L3 labels.

Reason	N cells
B cell doublet	29,551 (0.20%)
Erythrocyte doublet	92,621 (0.62%)
Low gene detection	109,123 (0.74%)
Myeloid doublet	88,380 (0.60%)
Platelet doublet	238,608 (1.61%)
T cell doublet	241,776 (1.63%)
Total removed	800,059 (5.39%)

Clustering and refinement of high-resolution labels

View Notebook: 09-Python_partition_L3_data.ipynb

After removal of doublets at the AIFI_L2 label resolution, we assembled cells from all samples for each subset in the high-resolution AIFI_L3 labeling results. Separately for each subset, we performed the scanpy data processing procedure described above to enable a final review of high resolution labels. Each subset was examined to identify remaining doublet clusters or clusters of cells that appeared to be mislabeled based on marker gene expression in collaboration with immunological domain experts.

L3 subset clustering

View Notebook: 10a-Python_cluster_L3_b_cell_data.ipynb
View Notebook: 10b-Python_cluster_L3_cd4_t_cell_data.ipynb
View Notebook: 10c-Python_cluster_L3_cd8_t_cell_data.ipynb
View Notebook: 10d-Python_cluster_L3_other_t_data.ipynb
View Notebook: 10e-Python_cluster_L3_myeloid_data.ipynb
View Notebook: 10f-Python_cluster_L3_nk_data.ipynb
View Notebook: 10g-Python_cluster_L3_other_data.ipynb

L3 clustering review

View Notebook: 11a-Python_review_L3_b_cell_data.ipynb
View Notebook: 11b-Python_review_L3_cd4_t_cell_data.ipynb
View Notebook: 11c-Python_review_L3_cd8_t_cell_data.ipynb
View Notebook: 11d-Python_review_L3_other_t_cell_data.ipynb
View Notebook: 11e-Python_review_L3_myeloid_data.ipynb
View Notebook: 11f-Python_review_L3_nk_data.ipynb
View Notebook: 11g-Python_review_L3_other_data.ipynb

L3 filtering

View Notebook: 12a-Python_filter_L3_b_cell_data.ipynb
View Notebook: 12b-Python_filter_L3_cd4_t_cell_data.ipynb
View Notebook: 12c-Python_filter_L3_cd8_t_cell_data.ipynb
View Notebook: 12d-Python_filter_L3_other_t_cell_data.ipynb
View Notebook: 12e-Python_filter_L3_myeloid_data.ipynb
View Notebook: 12f-Python_filter_L3_nk_data.ipynb
View Notebook: 12g-Python_filter_L3_other_data.ipynb

Final review of L3 labels

View Notebook: 13a-Python_review_filtered_L3_b_cell_data.ipynb
View Notebook: 13b-Python_review_filtered_L3_cd4_t_cell_data.ipynb
View Notebook: 13c-Python_review_filtered_L3_cd8_t_cell_data.ipynb
View Notebook: 13d-Python_review_filtered_L3_other_t_cell_data.ipynb
View Notebook: 13e-Python_review_filtered_L3_myeloid_data.ipynb
View Notebook: 13f-Python_review_filtered_L3_nk_data.ipynb
View Notebook: 13g-Python_review_filtered_L3_other_data.ipynb

During this process, the following additional cells were removed from downstream analysis:

CellTypist Population	Reason	N cells (%)
ASDC	T cell doublet	3 (0.08%)
Adaptive NK cell	Erythrocyte doublet	1816 (1.69%)
Adaptive NK cell	Myeloid doublet	21 (0.02%)
Adaptive NK cell	T cell doublet	2503 (2.33%)
BaEoMaP cell	B cell doublet	5 (0.81%)
BaEoMaP cell	T cell doublet	30 (4.89%)
C1Q+ CD16 monocyte	CD14 doublet	349 (1.11%)
CD4 MAIT	Doublets	6206 (46.98%)
CD56bright NK cell	Erythrocyte doublet	837 (0.95%)
CD56bright NK cell	T cell doublet	484 (0.55%)
CD8 MAIT	Mislabeled GZMH+	3346 (0.96%)
CD95 memory B cell	T cell doublet	10 (0.07%)
CLP cell	DC doublet	18 (0.9%)
CM CD4 T cell	Low quality	4649 (0.33%)
Core CD16 monocyte	CD14/CD16 doublet	8830 (3.18%)
Core naive B cell	Myeloid doublet	7226 (1.1%)
Core naive CD4 T cell	Erythrocyte doublet	1107 (0.04%)
Core naive CD4 T cell	Low quality	8257 (0.3%)
DN T cell	B cell Doublet	42 (0.25%)
Early memory B cell	T cell doublet	295 (0.1%)
Erythrocyte	Platelet doublet	166 (0.94%)
GZMB+ Vd2 gdT	Erythrocyte doublet	13 (0.01%)
GZMB- CD27- EM CD4 T cell	B cell doublet	2797 (0.52%)
GZMB- CD27- EM CD4 T cell	Low quality	3180 (0.6%)
GZMK+ CD56dim NK cell	Erythrocyte doublet	2439 (2.46%)
GZMK+ CD56dim NK cell	Myeloid doublet	38 (0.04%)
GZMK+ CD56dim NK cell	T cell doublet	1542 (1.55%)
GZMK+ Vd2 gdT	CD8A Doublet	2365 (1.83%)
GZMK+ Vd2 gdT	Doublets	4978 (3.85%)
GZMK- CD56dim NK cell	Erythrocyte doublet	28560 (3.49%)
GZMK- CD56dim NK cell	T cell doublet	896 (0.11%)
IL1B+ CD14 monocyte	Erythrocyte doublet	87 (0.39%)
IL1B+ CD14 monocyte	Low quality	25 (0.11%)
IL1B+ CD14 monocyte	Platelet doublet	273 (1.21%)
IL1B+ CD14 monocyte	T cell doublet	45 (0.2%)
ILC	T cell doublet	105 (1.86%)
ISG+ CD14 monocyte	Mislabeled IFI27+	3776 (1.3%)
ISG+ CD56dim NK cell	T cell doublet	824 (4.43%)
ISG+ memory CD4 T cell	CD8+ cells	1308 (3.54%)
ISG+ memory CD4 T cell	Myeloid doublet	37 (0.1%)
ISG+ naive B cell	Mislabeled Non-Naive	998 (4.25%)
ISG+ naive B cell	Myeloid doublet	138 (0.59%)
ISG+ naive B cell	T cell doublet	107 (0.46%)
ISG+ naive CD4 T cell	CD8 doublet	363 (0.83%)
ISG+ naive CD4 T cell	Myeloid doublet	239 (0.55%)
KLRB1+ memory CD4 Treg	Mislabeled CCL5+	276 (1.32%)
KLRB1+ memory CD8 Treg	Mislabeled GNLY+	223 (5.37%)
KLRB1+ memory CD8 Treg	Platelet doublet	42 (1.01%)
KLRF1+ effector Vd1 gdT	abT/gdT Doublet	433 (1.11%)
KLRF1- GZMB+ CD27- EM CD8 T cell	Erythrocyte doublet	8751 (1.76%)
KLRF1- effector Vd1 gdT	Mislabeled TRDC-	2308 (8.78%)
KLRF1- effector Vd1 gdT	abT/gdT doublet	226 (0.86%)
Naive Vd1 gdT	Mislabeled TRDC-	1366 (7.76%)
Naive Vd1 gdT	Mislabeled ZBTB16+	564 (3.2%)
Plasma cell	Mislabeled MS4A1+	338 (2.33%)
Platelet	Erythrocyte doublet	1665 (2.82%)
Platelet	Myeloid doublet	105 (0.18%)
Platelet	T cell doublet	31 (0.05%)
Proliferating NK cell	Myeloid doublet	23 (0.12%)
Proliferating NK cell	T cell doublet	323 (1.66%)
Proliferating T cell	Mislabeled CD3D-	475 (2.58%)
SOX4+ Vd1 gdT	Mislabeled TRDC-	597 (10.72%)
SOX4+ Vd1 gdT	Mislabeled ZBTB16+	144 (2.59%)
SOX4+ naive CD4 T cell	CD8 doublet	1088 (1.09%)
SOX4+ naive CD4 T cell	gdT doublet	1386 (1.39%)
SOX4+ naive CD8 T cell	Mislabeled GZMB+	106 (0.54%)

In total, 234,117 cells were removed at this stage of label refinement, and 13,795,846 cells were retained for downstream analyses. In addition to cells that were removed from the dataset, we reassigned the following cells based on gene expression to generate the final label set:

CellTypist Population	Reassigned Population	N cells (%)
CD27+ effector B cell	Core memory B cell	6279 (12.64%)
CM CD4 T cell	CD8+ cells	61331 (4.31%)
Core CD14 monocyte	CD14+ cDC2	10910 (0.62%)
Core CD16 monocyte	Intermediate monocyte	15141 (5.44%)
Core naive B cell	Core memory B cell	22320 (3.39%)
GZMB- CD27+ EM CD4 T cell	CD8+ cells	17779 (2.92%)
GZMB- CD27- EM CD4 T cell	CD8+ cells	27230 (5.1%)
GZMK- CD27+ EM CD8 T cell	GZMK+ CD27+ EM CD8 T cell	2397 (6.82%)
HLA-DRhi cDC2	CD14+ cDC2	2494 (4.69%)
ISG+ CD16 monocyte	ISG+ CD14 monocyte	256 (0.67%)
KLRB1+ memory CD4 Treg	GZMK+ memory CD4 Treg	1071 (5.11%)
KLRF1+ effector Vd1 gdT	GZMB+ Vd2 gdT	650 (1.66%)
KLRF1- GZMB+ CD27- EM CD8 T cell	ISG+ memory CD8 T cell	6338 (1.27%)
KLRF1- GZMB+ CD27- memory CD4 T cell	CD8+ cells	5974 (3.71%)
Memory CD4 Treg	Proliferating T cell	1026 (0.71%)
Naive CD4 Treg	Mislabeled GZMB+	184 (0.47%)
Naive CD4 Treg	Mislabeled GZMK+	951 (2.43%)

In total, 182,331 cells were reassigned in this final pass of cell type refinement.

Population subset frequency analyses

View Notebook: 15-R_assemble_type_frequencies.ipynb

To analyze changes in subset abundance, we tabulated subset frequency for each sample in our dataset for high resolution (L3) label assignments. We also compute proportions of each subset per sample, which were used for Centered Log Ratio (CLR) transformations. To compare abundance between sets of subjects, we used rank-sum Wilcoxon tests on CLR-transformed values. We compared abundance differences between Age Groups (Young Adult vs Older Adult), CMV status (Negative vs Positive), and Biological Sex (Female vs Male) using the first Flu Vaccine time point (Flu Year 1 Day 0) for each subject. For comparisons between longitudinal time points in the flu vaccine series (Day 7 vs Day 0), we used paired signed-rank Wilcoxon tests.

Pseudobulk data assembly

We generated Pseudobulk data by aggregating results for cells with the same high resolution (L3) label within each sample. This aggregation was performed using multiple methods for downstream analysis: summing UMI counts per gene (used for DESeq2 analysis, below), taking the mean of normalized and log transformed values (for visualization).

Differential gene expression tests

We performed differential gene expression tests using the DESeq2 v1.42.0 (Love, Huber, and Anders, 2014) package in R v4.3.2. Prior to differential tests, genes were filtered based on a minimum of 10% gene detection across all single cells used for the comparison. Summed UMI counts from single cells were used for the comparisons. DESeq2 analyses used default parameters. For comparisons of groups of subjects at baseline, we used pseudobulk data from Flu Year 1 Day 0 samples. The DESeq2 design formula for these comparisons was `Aggregated Counts ~ Age + CMV + Sex`, and then contrasts were made for each factor to identify differential results. Genes were considered differentially expressed if their adjusted p-value was below 0.05 and their absolute fold change was greater than 0.1.

References

Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.
doi:10.1186/s13059-014-0550-8

Traag V, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. arXiv [cs.SI]. 2018. Available: http://arxiv.org/abs/1810.08473
doi:10.48550/arXiv.1810.08473

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0

Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281–291.e9.
doi:10.1016/j.cels.2018.11.005

Analysis methods

Pipeline processing

Data selection from HISE storage

Labeling and doublet detection

Population Subset Labeling Notebooks

QC filtering

Clustering, class subsetting, and marker-based doublet removal

L2 subset clustering notebooks

Filtering counts per AIFI_L2 label

Clustering and refinement of high-resolution labels

L3 clustering, review, and filtering notebooks

Filtering counts per AIFI_L3 label

Population subset frequency analyses

Pseudobulk data assembly

Differential gene expression tests

References