Analysis methods

To build our longitudinally sampled PBMC scRNA-seq dataset from 95 healthy subjects, we utilized data and analysis environments within the Human Immune System Explorer (HISE) system to trace data processing, analytical code, and analysis environments from raw FASTQ data to a labeled, high-quality final dataset. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.

This analysis consisted of the following major steps:

  • Consistent pipeline processing of hashed 10x scRNA-seq data
  • Sample selection from HISE data storage
  • Labeling of cells using our CellTypist models and doublet scoring
  • Separation of cell classes (AIFI_L2 labels) and subclustering to identify cross-class doublets and low-quality clusters
  • Clustering and label refinement for high-resolution population subset labels (AIFI_L3 labels)
  • Assembly of final datasets
  • Differential expression tests

These steps and some additional analyses are available at: 
aifimmunology/sound-life-scrna-analysis

Pipeline processing

After sequencing, Gene Expression and Hash Tag Oligonucleotide libraries from pooled samples in our pipeline batches were demultiplexed and assembled as individual files for each biological sample. See our Multiplexed scRNA-seq pipeline documentation for additional details.

Data selection from HISE storage

View Notebook: 00-R_Sample_Selection.ipynb

To assemble the input data for our dataset, we selected all available samples from all subjects in our SoundLife Young Adult (age 25-35) and SoundLife Older Adult (age 55-65) cohorts. We then filtered to remove a batch of samples with previously identified quality problems (B004), as well as subjects with non-healthy or abnormal disease states recorded at the time of any of their sample collection visits. In cases where aliquots from the same blood draw were processed through our pipeline multiple times, we selected the most recently generated data (highest batch number).

In total, 868 samples were selected for use in our dataset (Young Adult, n = 49 subjects; n =  418 samples; Older Adult, n = 47 subjects, n = 450 samples), consisting of 15,781,886 cells before additional QC filtering.

View Notebook: 01-Python_retrieve_cmv_bmi.ipynb

In addition to processed data and sample metadata identification, we retrieved clinical lab results and clinical subject data to assemble CMV infection status for all subjects, as well as BMI for all adult subjects (Sound Life Cohorts).

Labeling and doublet detection

To add cell subset labels, we utilized CellTypist (v1.6.1, Dominguez Conde, et al.) and the CellTypist models we generated using our Immune Health PBMC Atlas dataset at 3 levels of subset label resolution (AIFI_L1, 9 broad cell classes; AIFI_L2, 29 intermediate cell classes, and AIFI_L3, 71 high-resolution cell classes) by following the approach described in the CellTypist reference documentation at https://celltypist.org.

View Notebook: 03-Python_doublet_detection.ipynb

While our cell hashing pipeline identifies doublets that occur between multiple samples, we are blind to doublets that occur with the same cell droplet. To assist in identifying these remaining doublets, we performed doublet detection using the scrublet package (v0.2.3, Wolock, et al.) as implemented in the scanpy.external module of scanpy (v1.9.6, Alexander Wolf, et al.).

View Notebook: 04-Python_assembly_of_subsets.ipynb

After subset labeling and doublet detection were performed on each sample, we assembled data, labels, scrublet calls, sample metadata, CMV status, and BMI data across 8 sets of samples defined by the cohort, biological sex, and CMV status of subjects to facilitate downstream analyses.

QC filtering

View Notebook: 05-Python_subset_qc_filtering.ipynb

After assembly of all cells across our samples, we filtered our data against a set of QC criteria designed to remove possible doublets (based on scrublet and high gene detection, > 5,000 genes), low-quality cells (based on low gene detection, < 200 genes), and dying or dead cells (based on mitochondrial UMI content, > 10%). Mitochondrial content was assessed using scanpy by identifying mitochondrial genes (prefixed with "MT-"), and using the scanpy function calculate_qc_metrics, as demonstrated in the scanpy tutorials (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). QC criteria were applied to the dataset in series, as shown in the table below. In total, 951,864 cells were removed (6.03%), and 14,830,022 cells passed QC filtering (93.97%).

QC Criteria Cells Removed Cells Remaining
Pre-QC filtering NA 15,781,886 (100%)
Scrublet doublet call 156,381 (0.99%) 15,625,505 (99.00%)
Mitochondrial UMIs > 10% 770,582 (4.88%) 14,854,923 (94.13%)
N Genes Detected < 200 6,302 (0.04%) 14,848,621 (94.09%)
N Genes Detected > 2,500 18,599 (0.12%) 14,830,022 (93.97%)
Total Removed 951,864 (6.03%) 14,830,022 (93.97%)

 

Clustering, class subsetting, and marker-based doublet removal

After QC filtering, all remaining cells were clustered within AIFI_L2 labels using a scanpy workflow that consisted of normalization using scanpy.pp.normalize_total with the parameter target_sum = 1e4; log transformation using scanpy.pp.log1p; highly variable gene detection using scanpy.pp.highly_variable_genes; scaling of highly variable gene data using sc.pp.scale; Dimensionality reduction using scanpy.tl.pca with the parameter svd_solver = 'arpack'; Cohort integration using the Harmony method (harmonypy v0.0.9; Korsunsky, et al.) implemented in scanpy.external.harmony_integrate with the parameter key = 'cohort.cohortGuid'; Nearest neighbor calculation with scanpy.pp.neighbors and parameters n_neighbors = 50, use_rep = 'X_pca_harmony', and n_pcs = 30; UMAP embedding with scanpy.tl.umap; and Leiden clustering (leidenalg v0.10.1; Traag, et al.) using scanpy.tl.leiden. Unless specified, default parameters provided by scanpy v1.9.6 were used. 

For most AIFI_L2 cell classes, the processing and clustering described above was performed using cells from all samples. However, 5 cell classes consisting of > 1 million cells were processed separately as 8 sets of cells based on cohort, sex, and CMV status to make file handling and clustering processes more manageable: CD4 naive T cells, CD4 memory T cells, CD8 memory T cells, CD56dim NK cells, and CD14 Monocytes.

View Notebook: 07-Python_filter_cell_classes.ipynb
View Notebook: 07a-R_filter_review_plots.ipynb

After clustering, clusters were filtered to remove doublets based on marker gene expression. For example, CD4 naive T cell clusters with a high fraction of MS4A1 expression were removed as B cell doublets. For each cluster, the fraction of cells with detected gene expression for each marker was computed, and clusters meeting a set of cell class-specific criteria were removed. These thresholds are shown in the table below.

Population Reason Threshhold N cells
ASDC B cell doublet MS4A1 > 0.2 18
ASDC Platelet doublet PPBP > 0.4 37
ASDC T cell doublet CD3E > 0.4 303
CD14 monocyte B cell doublet MS4A1 > 0.2 6,076
CD14 monocyte Platelet doublet PPBP > 0.4 56,817
CD14 monocyte T cell doublet CD3E > 0.1 OR IL7R > 0.2 115,867
CD16 monocyte B cell doublet MS4A1 > 0.4 4,366
CD16 monocyte Erythrocyte doublet HBB > 0.2 1,677
CD16 monocyte Platelet doublet PPBP > 0.4 15,564
CD16 monocyte T cell doublet CD3E > 0.2 34,617
CD56bright NK cell B cell doublet MS4A1 > 0.4 444
CD56bright NK cell Erythrocyte doublet HBB > 0.2 380
CD56bright NK cell Myeloid doublet FCN1 > 0.4 738
CD56bright NK cell Platelet doublet PPBP > 0.4 1,277
CD56bright NK cell T cell doublet CD3D > 0.4 4,863
CD56dim NK cell B cell doublet MS4A1 > 0.4 4,861
CD56dim NK cell Myeloid doublet FCN1 > 0.4 10,610
CD56dim NK cell Platelet doublet PPBP > 0.4 19,301
CD56dim NK cell T cell doublet IL7R > 0.4 30,614
CD8aa Myeloid doublet FCN1 > 0.2 139
CD8aa Platelet doublet PPBP > 0.4 88
cDC1 Platelet doublet PPBP > 0.2 158
cDC1 T cell doublet CD3D > 0.2 OR IL7R > 0.2 821
cDC2 Erythrocyte doublet HBB > 0.4 391
cDC2 Platelet doublet PPBP > 0.2 1,440
cDC2 T cell doublet CD3D > 0.2 10,075
DN T cell Erythrocyte doublet HBB > 0.4 53
DN T cell Myeloid doublet FCN1 > 0.2 144
DN T cell Platelet doublet PPBP > 0.4 98
Effector B cell Erythrocyte doublet HBB > 0.4 333
Effector B cell Platelet doublet PPBP > 0.4 641
Effector B cell T cell doublet CD3D > 0.2 OR IL7R > 0.2 4,751
Erythrocyte B cell doublet MS4A1 > 0.4 1,985
Erythrocyte Myeloid doublet FCN1 > 0.2 3,138
Erythrocyte Platelet doublet PPBP > 0.4 2,069
gdT Erythrocyte doublet HBB > 0.2 595
gdT Myeloid doublet FCN1 > 0.2 2,381
gdT Platelet doublet PPBP > 0.4 4,822
ILC Erythrocyte doublet HBB > 0.4 8
ILC Platelet doublet PPBP > 0.2 61
ILC T cell doublet CD3D > 0.4 311
Intermediate monocyte B cell doublet MS4A1 > 0.4 342
Intermediate monocyte Erythrocyte doublet HBB > 0.4 414
Intermediate monocyte Platelet doublet PPBP > 0.2 1,928
Intermediate monocyte T cell doublet CD3E > 0.2 9,627
MAIT B cell doublet MS4A1 > 0.4 1,282
MAIT Erythrocyte doublet HBB > 0.4 1,717
MAIT Myeloid doublet FCN1 > 0.2 4,154
MAIT Platelet doublet PPBP > 0.2 8,173
Memory B cell Erythrocyte doublet HBB > 0.4 4,231
Memory B cell Myeloid doublet FCN1 > 0.2 6,726
Memory B cell Platelet doublet PPBP > 0.2 8,168
Memory B cell T cell doublet CD3D > 0.4 6,610
Memory CD4 T cell Erythrocyte doublet HBB > 0.4 5,840
Memory CD4 T cell Myeloid doublet FCN1 > 0.2 13,774
Memory CD8 T cell B cell doublet MS4A1 > 0.4 3,100
Memory CD8 T cell Myeloid doublet FCN1 > 0.2 10,487
Memory CD8 T cell Platelet doublet PPBP > 0.2 16,354
Naive B cell Erythrocyte doublet HBB > 0.4 1,670
Naive B cell Myeloid doublet FCN1 > 0.2 1,005
Naive B cell Platelet doublet PPBP > 0.2 26,396
Naive B cell T cell doublet CD3D > 0.4 7,946
Naive CD4 T cell B cell doublet MS4A1 > 0.2 4,040
Naive CD4 T cell Erythrocyte doublet HBB > 0.4 19,387
Naive CD4 T cell Myeloid doublet FCN1 > 0.2 22,122
Naive CD4 T cell Platelet doublet PPBP > 0.2 49,784
Naive CD8 T cell B cell doublet MS4A1 > 0.2 1,519
Naive CD8 T cell Erythrocyte doublet HBB > 0.4 29,596
Naive CD8 T cell Myeloid doublet FCN1 > 0.2 3,723
Naive CD8 T cell Platelet doublet PPBP > 0.2 21,018
pDC B cell doublet MS4A1 > 0.2 233
pDC Erythrocyte doublet HBB > 0.4 247
pDC Myeloid doublet FCN1 > 0.2 1,367
pDC Platelet doublet PPBP > 0.2 2,869
pDC T cell doublet CD3D > 0.4 3,472
Plasma cell Erythrocyte doublet HBB > 0.4 53
Plasma cell Myeloid doublet FCN1 > 0.2 50
Plasma cell Platelet doublet PPBP > 0.4 39
Plasma cell T cell doublet CD3D > 0.4 1,956
Platelet B cell doublet MS4A1 > 0.4 467
Platelet Erythrocyte doublet HBB > 0.4 3,345
Platelet Myeloid doublet FCN1 > 0.2 4,284
Platelet T cell doublet CD3D > 0.4 5,367
Progenitor cell B cell doublet MS4A1 > 0.4 61
Progenitor cell Erythrocyte doublet HBB > 0.2 437
Progenitor cell Myeloid doublet FCN1 > 0.2 108
Progenitor cell T cell doublet CD3E > 0.2 1,380
Proliferating NK cell B cell doublet MS4A1 > 0.4 102
Proliferating NK cell Erythrocyte doublet HBB > 0.4 63
Proliferating NK cell Platelet doublet PPBP > 0.2 85
Proliferating NK cell T cell doublet CD3D > 0.4 1,311
Proliferating T cell B cell doublet MS4A1 > 0.4 117
Proliferating T cell Myeloid doublet FCN1 > 0.2 901
Proliferating T cell Platelet doublet PPBP > 0.2 186
Transitional B cell Erythrocyte doublet HBB > 0.4 446
Transitional B cell Platelet doublet PPBP > 0.2 1,235
Transitional B cell T cell doublet CD3D > 0.4 1,885
Treg B cell doublet MS4A1 > 0.4 538
Treg Erythrocyte doublet HBB > 0.4 1,765
Treg Myeloid doublet FCN1 > 0.2 2,529

In addition to these filters based on fraction of gene detection, we included an additional filter for specific subsets based on mean gene expression of HBB where this better distinguished doublet populations that included Erythrocyte contamination:

Population Reason Threshhold N cells
CD14 monocyte Erythrocyte doublet HBB > 1 7,190
CD56dim NK cell Erythrocyte doublet HBB > 0.7 5,004
Memory CD8 T cell Erythrocyte doublet HBB > 0.7 7,711
Proliferating T cell Erythrocyte doublet HBB > 0.7 68

 

Finally, we also removed clusters with low gene detection based on a single threshold: If > 30% of cells in a cluster had < 750 genes detected, the cluster was removed as low quality. Due to their generally lower gene detection, this threshold was not applied to Erythrocyte or Platetelet populations.

Population Reason N cells
CD14 monocyte Low gene detection 5,269
CD56bright NK cell Low gene detection 498
CD56dim NK cell Low gene detection 11,835
DN T cell Low gene detection 370
MAIT Low gene detection 283
Memory B cell Low gene detection 1,092
Memory CD4 T cell Low gene detection 9,538
Memory CD8 T cell Low gene detection 18,535
Naive B cell Low gene detection 2,860
Naive CD4 T cell Low gene detection 48,028
Naive CD8 T cell Low gene detection 1,020
Proliferating NK cell Low gene detection 415
Treg Low gene detection 8,608
gdT Low gene detection 772

 

The table below presents the total number of cells removed for each reason listed above. In total, 800,059 cells (5.39%) were removed by this filtering process, and 14,029,963 cells (94.61%) were retained for final clean up based on high-resolution AIFI_L3 labels.

Reason N cells
B cell doublet 29,551 (0.20%)
Erythrocyte doublet 92,621 (0.62%)
Low gene detection 109,123 (0.74%)
Myeloid doublet 88,380 (0.60%)
Platelet doublet 238,608 (1.61%)
T cell doublet 241,776 (1.63%)
Total removed 800,059 (5.39%)

 

Clustering and refinement of high-resolution labels

View Notebook: 09-Python_partition_L3_data.ipynb

After removal of doublets at the AIFI_L2 label resolution, we assembled cells from all samples for each subset in the high-resolution AIFI_L3 labeling results. Separately for each subset, we performed the scanpy data processing procedure described above to enable a final review of high resolution labels. Each subset was examined to identify remaining doublet clusters or clusters of cells that appeared to be mislabeled based on marker gene expression in collaboration with immunological domain experts.

L3 subset clustering

View Notebook: 10a-Python_cluster_L3_b_cell_data.ipynb
View Notebook: 10b-Python_cluster_L3_cd4_t_cell_data.ipynb
View Notebook: 10c-Python_cluster_L3_cd8_t_cell_data.ipynb
View Notebook: 10d-Python_cluster_L3_other_t_data.ipynb
View Notebook: 10e-Python_cluster_L3_myeloid_data.ipynb
View Notebook: 10f-Python_cluster_L3_nk_data.ipynb
View Notebook: 10g-Python_cluster_L3_other_data.ipynb

L3 clustering review

View Notebook: 11a-Python_review_L3_b_cell_data.ipynb
View Notebook: 11b-Python_review_L3_cd4_t_cell_data.ipynb
View Notebook: 11c-Python_review_L3_cd8_t_cell_data.ipynb
View Notebook: 11d-Python_review_L3_other_t_cell_data.ipynb
View Notebook: 11e-Python_review_L3_myeloid_data.ipynb
View Notebook: 11f-Python_review_L3_nk_data.ipynb
View Notebook: 11g-Python_review_L3_other_data.ipynb

L3 filtering

View Notebook: 12a-Python_filter_L3_b_cell_data.ipynb
View Notebook: 12b-Python_filter_L3_cd4_t_cell_data.ipynb
View Notebook: 12c-Python_filter_L3_cd8_t_cell_data.ipynb
View Notebook: 12d-Python_filter_L3_other_t_cell_data.ipynb
View Notebook: 12e-Python_filter_L3_myeloid_data.ipynb
View Notebook: 12f-Python_filter_L3_nk_data.ipynb
View Notebook: 12g-Python_filter_L3_other_data.ipynb

Final review of L3 labels

View Notebook: 13a-Python_review_filtered_L3_b_cell_data.ipynb
View Notebook: 13b-Python_review_filtered_L3_cd4_t_cell_data.ipynb
View Notebook: 13c-Python_review_filtered_L3_cd8_t_cell_data.ipynb
View Notebook: 13d-Python_review_filtered_L3_other_t_cell_data.ipynb
View Notebook: 13e-Python_review_filtered_L3_myeloid_data.ipynb
View Notebook: 13f-Python_review_filtered_L3_nk_data.ipynb
View Notebook: 13g-Python_review_filtered_L3_other_data.ipynb

During this process, the following additional cells were removed from downstream analysis:

CellTypist Population Reason N cells (%)
ASDC T cell doublet 3 (0.08%)
Adaptive NK cell Erythrocyte doublet 1816 (1.69%)
Adaptive NK cell Myeloid doublet 21 (0.02%)
Adaptive NK cell T cell doublet 2503 (2.33%)
BaEoMaP cell B cell doublet 5 (0.81%)
BaEoMaP cell T cell doublet 30 (4.89%)
C1Q+ CD16 monocyte CD14 doublet 349 (1.11%)
CD4 MAIT Doublets 6206 (46.98%)
CD56bright NK cell Erythrocyte doublet 837 (0.95%)
CD56bright NK cell T cell doublet 484 (0.55%)
CD8 MAIT Mislabeled GZMH+ 3346 (0.96%)
CD95 memory B cell T cell doublet 10 (0.07%)
CLP cell DC doublet 18 (0.9%)
CM CD4 T cell Low quality 4649 (0.33%)
Core CD16 monocyte CD14/CD16 doublet 8830 (3.18%)
Core naive B cell Myeloid doublet 7226 (1.1%)
Core naive CD4 T cell Erythrocyte doublet 1107 (0.04%)
Core naive CD4 T cell Low quality 8257 (0.3%)
DN T cell B cell Doublet 42 (0.25%)
Early memory B cell T cell doublet 295 (0.1%)
Erythrocyte Platelet doublet 166 (0.94%)
GZMB+ Vd2 gdT Erythrocyte doublet 13 (0.01%)
GZMB- CD27- EM CD4 T cell B cell doublet 2797 (0.52%)
GZMB- CD27- EM CD4 T cell Low quality 3180 (0.6%)
GZMK+ CD56dim NK cell Erythrocyte doublet 2439 (2.46%)
GZMK+ CD56dim NK cell Myeloid doublet 38 (0.04%)
GZMK+ CD56dim NK cell T cell doublet 1542 (1.55%)
GZMK+ Vd2 gdT CD8A Doublet 2365 (1.83%)
GZMK+ Vd2 gdT Doublets 4978 (3.85%)
GZMK- CD56dim NK cell Erythrocyte doublet 28560 (3.49%)
GZMK- CD56dim NK cell T cell doublet 896 (0.11%)
IL1B+ CD14 monocyte Erythrocyte doublet 87 (0.39%)
IL1B+ CD14 monocyte Low quality 25 (0.11%)
IL1B+ CD14 monocyte Platelet doublet 273 (1.21%)
IL1B+ CD14 monocyte T cell doublet 45 (0.2%)
ILC T cell doublet 105 (1.86%)
ISG+ CD14 monocyte Mislabeled IFI27+ 3776 (1.3%)
ISG+ CD56dim NK cell T cell doublet 824 (4.43%)
ISG+ memory CD4 T cell CD8+ cells 1308 (3.54%)
ISG+ memory CD4 T cell Myeloid doublet 37 (0.1%)
ISG+ naive B cell Mislabeled Non-Naive 998 (4.25%)
ISG+ naive B cell Myeloid doublet 138 (0.59%)
ISG+ naive B cell T cell doublet 107 (0.46%)
ISG+ naive CD4 T cell CD8 doublet 363 (0.83%)
ISG+ naive CD4 T cell Myeloid doublet 239 (0.55%)
KLRB1+ memory CD4 Treg Mislabeled CCL5+ 276 (1.32%)
KLRB1+ memory CD8 Treg Mislabeled GNLY+ 223 (5.37%)
KLRB1+ memory CD8 Treg Platelet doublet 42 (1.01%)
KLRF1+ effector Vd1 gdT abT/gdT Doublet 433 (1.11%)
KLRF1- GZMB+ CD27- EM CD8 T cell Erythrocyte doublet 8751 (1.76%)
KLRF1- effector Vd1 gdT Mislabeled TRDC- 2308 (8.78%)
KLRF1- effector Vd1 gdT abT/gdT doublet 226 (0.86%)
Naive Vd1 gdT Mislabeled TRDC- 1366 (7.76%)
Naive Vd1 gdT Mislabeled ZBTB16+ 564 (3.2%)
Plasma cell Mislabeled MS4A1+ 338 (2.33%)
Platelet Erythrocyte doublet 1665 (2.82%)
Platelet Myeloid doublet 105 (0.18%)
Platelet T cell doublet 31 (0.05%)
Proliferating NK cell Myeloid doublet 23 (0.12%)
Proliferating NK cell T cell doublet 323 (1.66%)
Proliferating T cell Mislabeled CD3D- 475 (2.58%)
SOX4+ Vd1 gdT Mislabeled TRDC- 597 (10.72%)
SOX4+ Vd1 gdT Mislabeled ZBTB16+ 144 (2.59%)
SOX4+ naive CD4 T cell CD8 doublet 1088 (1.09%)
SOX4+ naive CD4 T cell gdT doublet 1386 (1.39%)
SOX4+ naive CD8 T cell Mislabeled GZMB+ 106 (0.54%)


In total, 234,117 cells were removed at this stage of label refinement, and 13,795,846 cells were retained for downstream analyses. In addition to cells that were removed from the dataset, we reassigned the following cells based on gene expression to generate the final label set:

CellTypist Population Reassigned Population N cells (%)
CD27+ effector B cell Core memory B cell 6279 (12.64%)
CM CD4 T cell CD8+ cells 61331 (4.31%)
Core CD14 monocyte CD14+ cDC2 10910 (0.62%)
Core CD16 monocyte Intermediate monocyte 15141 (5.44%)
Core naive B cell Core memory B cell 22320 (3.39%)
GZMB- CD27+ EM CD4 T cell CD8+ cells 17779 (2.92%)
GZMB- CD27- EM CD4 T cell CD8+ cells 27230 (5.1%)
GZMK- CD27+ EM CD8 T cell GZMK+ CD27+ EM CD8 T cell 2397 (6.82%)
HLA-DRhi cDC2 CD14+ cDC2 2494 (4.69%)
ISG+ CD16 monocyte ISG+ CD14 monocyte 256 (0.67%)
KLRB1+ memory CD4 Treg GZMK+ memory CD4 Treg 1071 (5.11%)
KLRF1+ effector Vd1 gdT GZMB+ Vd2 gdT 650 (1.66%)
KLRF1- GZMB+ CD27- EM CD8 T cell ISG+ memory CD8 T cell 6338 (1.27%)
KLRF1- GZMB+ CD27- memory CD4 T cell CD8+ cells 5974 (3.71%)
Memory CD4 Treg Proliferating T cell 1026 (0.71%)
Naive CD4 Treg Mislabeled GZMB+ 184 (0.47%)
Naive CD4 Treg Mislabeled GZMK+ 951 (2.43%)

In total, 182,331 cells were reassigned in this final pass of cell type refinement.

Population subset frequency analyses

View Notebook: 15-R_assemble_type_frequencies.ipynb

To analyze changes in subset abundance, we tabulated subset frequency for each sample in our dataset for high resolution (L3) label assignments. We also compute proportions of each subset per sample, which were used for Centered Log Ratio (CLR) transformations. To compare abundance between sets of subjects, we used rank-sum Wilcoxon tests on CLR-transformed values. We compared abundance differences between Age Groups (Young Adult vs Older Adult), CMV status (Negative vs Positive), and Biological Sex (Female vs Male) using the first Flu Vaccine time point (Flu Year 1 Day 0) for each subject. For comparisons between longitudinal time points in the flu vaccine series (Day 7 vs Day 0), we used paired signed-rank Wilcoxon tests.

Pseudobulk data assembly

We generated Pseudobulk data by aggregating results for cells with the same high resolution (L3) label within each sample. This aggregation was performed using multiple methods for downstream analysis: summing UMI counts per gene (used for DESeq2 analysis, below), taking the mean of normalized and log transformed values (for visualization).

Differential gene expression tests

We performed differential gene expression tests using the DESeq2 v1.42.0 (Love, Huber, and Anders, 2014) package in R v4.3.2. Prior to differential tests, genes were filtered based on a minimum of 10% gene detection across all single cells used for the comparison. Summed UMI counts from single cells were used for the comparisons. DESeq2 analyses used default parameters. For comparisons of groups of subjects at baseline, we used pseudobulk data from Flu Year 1 Day 0 samples. The DESeq2 design formula for these comparisons was `Aggregated Counts ~ Age + CMV + Sex`, and then contrasts were made for each factor to identify differential results. Genes were considered differentially expressed if their adjusted p-value was below 0.05 and their absolute fold change was greater than 0.1.
 

References

Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.
doi:10.1186/s13059-014-0550-8

Traag V, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. arXiv [cs.SI]. 2018. Available: http://arxiv.org/abs/1810.08473
doi:10.48550/arXiv.1810.08473

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0

Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281–291.e9.
doi:10.1016/j.cels.2018.11.005