Analysis methods

To build our longitudinally sampled PBMC scRNA-seq dataset from 95 healthy subjects, we utilized data and analysis environments within the Human Immune System Explorer (HISE) system to trace data processing, analytical code, and analysis environments from raw FASTQ data to a labeled, high-quality final dataset. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.

This analysis consisted of the following major steps:

  • Consistent pipeline processing of hashed 10x scRNA-seq data
  • Sample selection from HISE data storage
  • Labeling of cells using our CellTypist models and doublet scoring
  • Separation of cell classes (AIFI_L2 labels) and subclustering to identify cross-class doublets and low-quality clusters
  • Clustering and label refinement for high-resolution population subset labels (AIFI_L3 labels)
  • Assembly of final datasets
  • Differential expression tests

These steps and some additional analyses are available at: 
aifimmunology/sound-life-scrna-analysis

Pipeline processing

After sequencing, Gene Expression and Hash Tag Oligonucleotide libraries from pooled samples in our pipeline batches were demultiplexed and assembled as individual files for each biological sample. See our Multiplexed scRNA-seq pipeline documentation for additional details.

Data selection from HISE storage

View Notebook: 00-R_Sample_Selection.ipynb

To assemble the input data for our dataset, we selected all available samples from all subjects in our SoundLife Young Adult (age 25-35) and SoundLife Older Adult (age 55-65) cohorts. We then filtered to remove a batch of samples with previously identified quality problems (B004), as well as subjects with non-healthy or abnormal disease states recorded at the time of any of their sample collection visits. In cases where aliquots from the same blood draw were processed through our pipeline multiple times, we selected the most recently generated data (highest batch number).

In total, 868 samples were selected for use in our dataset (Young Adult, n = 49 subjects; n =  418 samples; Older Adult, n = 47 subjects, n = 450 samples), consisting of 15,781,886 cells before additional QC filtering.

View Notebook: 01-Python_retrieve_cmv_bmi.ipynb

In addition to processed data and sample metadata identification, we retrieved clinical lab results and clinical subject data to assemble CMV infection status for all subjects, as well as BMI for all adult subjects (Sound Life Cohorts).

Labeling and doublet detection

To add cell subset labels, we utilized CellTypist (v1.6.1, Dominguez Conde, et al.) and the CellTypist models we generated using our Immune Health PBMC Atlas dataset at 3 levels of subset label resolution (AIFI_L1, 9 broad cell classes; AIFI_L2, 29 intermediate cell classes, and AIFI_L3, 71 high-resolution cell classes) by following the approach described in the CellTypist reference documentation at https://celltypist.org.

View Notebook: 03-Python_doublet_detection.ipynb

While our cell hashing pipeline identifies doublets that occur between multiple samples, we are blind to doublets that occur with the same cell droplet. To assist in identifying these remaining doublets, we performed doublet detection using the scrublet package (v0.2.3, Wolock, et al.) as implemented in the scanpy.external module of scanpy (v1.9.6, Alexander Wolf, et al.).

View Notebook: 04-Python_assembly_of_subsets.ipynb

After subset labeling and doublet detection were performed on each sample, we assembled data, labels, scrublet calls, sample metadata, CMV status, and BMI data across 8 sets of samples defined by the cohort, biological sex, and CMV status of subjects to facilitate downstream analyses.

QC filtering

View Notebook: 05-Python_subset_qc_filtering.ipynb

After assembly of all cells across our samples, we filtered our data against a set of QC criteria designed to remove possible doublets (based on scrublet and high gene detection, > 5,000 genes), low-quality cells (based on low gene detection, < 200 genes), and dying or dead cells (based on mitochondrial UMI content, > 10%). Mitochondrial content was assessed using scanpy by identifying mitochondrial genes (prefixed with "MT-"), and using the scanpy function calculate_qc_metrics, as demonstrated in the scanpy tutorials (https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html). QC criteria were applied to the dataset in series, as shown in the table below. In total, 951,864 cells were removed (6.03%), and 14,830,022 cells passed QC filtering (93.97%).

QC Criteria Cells Removed Cells Remaining
Pre-QC filtering NA 15,781,886 (100%)
Scrublet doublet call 156,381 (0.99%) 15,625,505 (99.00%)
Mitochondrial UMIs > 10% 770,582 (4.88%) 14,854,923 (94.13%)
N Genes Detected < 200 6,302 (0.04%) 14,848,621 (94.09%)
N Genes Detected > 2,500 18,599 (0.12%) 14,830,022 (93.97%)
Total Removed 951,864 (6.03%) 14,830,022 (93.97%)

 

Clustering, class subsetting, and marker-based doublet removal

After QC filtering, all remaining cells were clustered within AIFI_L2 labels using a scanpy workflow that consisted of normalization using scanpy.pp.normalize_total with the parameter target_sum = 1e4; log transformation using scanpy.pp.log1p; highly variable gene detection using scanpy.pp.highly_variable_genes; scaling of highly variable gene data using sc.pp.scale; Dimensionality reduction using scanpy.tl.pca with the parameter svd_solver = 'arpack'; Cohort integration using the Harmony method (harmonypy v0.0.9; Korsunsky, et al.) implemented in scanpy.external.harmony_integrate with the parameter key = 'cohort.cohortGuid'; Nearest neighbor calculation with scanpy.pp.neighbors and parameters n_neighbors = 50, use_rep = 'X_pca_harmony', and n_pcs = 30; UMAP embedding with scanpy.tl.umap; and Leiden clustering (leidenalg v0.10.1; Traag, et al.) using scanpy.tl.leiden. Unless specified, default parameters provided by scanpy v1.9.6 were used. 

For most AIFI_L2 cell classes, the processing and clustering described above was performed using cells from all samples. However, 5 cell classes consisting of > 1 million cells were processed separately as 8 sets of cells based on cohort, sex, and CMV status to make file handling and clustering processes more manageable: CD4 naive T cells, CD4 memory T cells, CD8 memory T cells, CD56dim NK cells, and CD14 Monocytes.

View Notebook: 07-Python_filter_cell_classes.ipynb
View Notebook: 07a-R_filter_review_plots.ipynb

After clustering, clusters were filtered to remove doublets based on marker gene expression. For example, CD4 naive T cell clusters with a high fraction of MS4A1 expression were removed as B cell doublets. For each cluster, the fraction of cells with detected gene expression for each marker was computed, and clusters meeting a set of cell class-specific criteria were removed. These thresholds are shown in the table below.

PopulationReasonThreshholdN cells
ASDCB cell doubletMS4A1 > 0.218
ASDCPlatelet doubletPPBP > 0.437
ASDCT cell doubletCD3E > 0.4303
CD14 monocyteB cell doubletMS4A1 > 0.26,076
CD14 monocytePlatelet doubletPPBP > 0.456,817
CD14 monocyteT cell doubletCD3E > 0.1 OR IL7R > 0.2115,867
CD16 monocyteB cell doubletMS4A1 > 0.44,366
CD16 monocyteErythrocyte doubletHBB > 0.21,677
CD16 monocytePlatelet doubletPPBP > 0.415,564
CD16 monocyteT cell doubletCD3E > 0.234,617
CD56bright NK cellB cell doubletMS4A1 > 0.4444
CD56bright NK cellErythrocyte doubletHBB > 0.2380
CD56bright NK cellMyeloid doubletFCN1 > 0.4738
CD56bright NK cellPlatelet doubletPPBP > 0.41,277
CD56bright NK cellT cell doubletCD3D > 0.44,863
CD56dim NK cellB cell doubletMS4A1 > 0.44,861
CD56dim NK cellMyeloid doubletFCN1 > 0.410,610
CD56dim NK cellPlatelet doubletPPBP > 0.419,301
CD56dim NK cellT cell doubletIL7R > 0.430,614
CD8aaMyeloid doubletFCN1 > 0.2139
CD8aaPlatelet doubletPPBP > 0.488
cDC1Platelet doubletPPBP > 0.2158
cDC1T cell doubletCD3D > 0.2 OR IL7R > 0.2821
cDC2Erythrocyte doubletHBB > 0.4391
cDC2Platelet doubletPPBP > 0.21,440
cDC2T cell doubletCD3D > 0.210,075
DN T cellErythrocyte doubletHBB > 0.453
DN T cellMyeloid doubletFCN1 > 0.2144
DN T cellPlatelet doubletPPBP > 0.498
Effector B cellErythrocyte doubletHBB > 0.4333
Effector B cellPlatelet doubletPPBP > 0.4641
Effector B cellT cell doubletCD3D > 0.2 OR IL7R > 0.24,751
ErythrocyteB cell doubletMS4A1 > 0.41,985
ErythrocyteMyeloid doubletFCN1 > 0.23,138
ErythrocytePlatelet doubletPPBP > 0.42,069
gdTErythrocyte doubletHBB > 0.2595
gdTMyeloid doubletFCN1 > 0.22,381
gdTPlatelet doubletPPBP > 0.44,822
ILCErythrocyte doubletHBB > 0.48
ILCPlatelet doubletPPBP > 0.261
ILCT cell doubletCD3D > 0.4311
Intermediate monocyteB cell doubletMS4A1 > 0.4342
Intermediate monocyteErythrocyte doubletHBB > 0.4414
Intermediate monocytePlatelet doubletPPBP > 0.21,928
Intermediate monocyteT cell doubletCD3E > 0.29,627
MAITB cell doubletMS4A1 > 0.41,282
MAITErythrocyte doubletHBB > 0.41,717
MAITMyeloid doubletFCN1 > 0.24,154
MAITPlatelet doubletPPBP > 0.28,173
Memory B cellErythrocyte doubletHBB > 0.44,231
Memory B cellMyeloid doubletFCN1 > 0.26,726
Memory B cellPlatelet doubletPPBP > 0.28,168
Memory B cellT cell doubletCD3D > 0.46,610
Memory CD4 T cellErythrocyte doubletHBB > 0.45,840
Memory CD4 T cellMyeloid doubletFCN1 > 0.213,774
Memory CD8 T cellB cell doubletMS4A1 > 0.43,100
Memory CD8 T cellMyeloid doubletFCN1 > 0.210,487
Memory CD8 T cellPlatelet doubletPPBP > 0.216,354
Naive B cellErythrocyte doubletHBB > 0.41,670
Naive B cellMyeloid doubletFCN1 > 0.21,005
Naive B cellPlatelet doubletPPBP > 0.226,396
Naive B cellT cell doubletCD3D > 0.47,946
Naive CD4 T cellB cell doubletMS4A1 > 0.24,040
Naive CD4 T cellErythrocyte doubletHBB > 0.419,387
Naive CD4 T cellMyeloid doubletFCN1 > 0.222,122
Naive CD4 T cellPlatelet doubletPPBP > 0.249,784
Naive CD8 T cellB cell doubletMS4A1 > 0.21,519
Naive CD8 T cellErythrocyte doubletHBB > 0.429,596
Naive CD8 T cellMyeloid doubletFCN1 > 0.23,723
Naive CD8 T cellPlatelet doubletPPBP > 0.221,018
pDCB cell doubletMS4A1 > 0.2233
pDCErythrocyte doubletHBB > 0.4247
pDCMyeloid doubletFCN1 > 0.21,367
pDCPlatelet doubletPPBP > 0.22,869
pDCT cell doubletCD3D > 0.43,472
Plasma cellErythrocyte doubletHBB > 0.453
Plasma cellMyeloid doubletFCN1 > 0.250
Plasma cellPlatelet doubletPPBP > 0.439
Plasma cellT cell doubletCD3D > 0.41,956
PlateletB cell doubletMS4A1 > 0.4467
PlateletErythrocyte doubletHBB > 0.43,345
PlateletMyeloid doubletFCN1 > 0.24,284
PlateletT cell doubletCD3D > 0.45,367
Progenitor cellB cell doubletMS4A1 > 0.461
Progenitor cellErythrocyte doubletHBB > 0.2437
Progenitor cellMyeloid doubletFCN1 > 0.2108
Progenitor cellT cell doubletCD3E > 0.21,380
Proliferating NK cellB cell doubletMS4A1 > 0.4102
Proliferating NK cellErythrocyte doubletHBB > 0.463
Proliferating NK cellPlatelet doubletPPBP > 0.285
Proliferating NK cellT cell doubletCD3D > 0.41,311
Proliferating T cellB cell doubletMS4A1 > 0.4117
Proliferating T cellMyeloid doubletFCN1 > 0.2901
Proliferating T cellPlatelet doubletPPBP > 0.2186
Transitional B cellErythrocyte doubletHBB > 0.4446
Transitional B cellPlatelet doubletPPBP > 0.21,235
Transitional B cellT cell doubletCD3D > 0.41,885
TregB cell doubletMS4A1 > 0.4538
TregErythrocyte doubletHBB > 0.41,765
TregMyeloid doubletFCN1 > 0.22,529

In addition to these filters based on fraction of gene detection, we included an additional filter for specific subsets based on mean gene expression of HBB where this better distinguished doublet populations that included Erythrocyte contamination:

Population Reason Threshhold N cells
CD14 monocyte Erythrocyte doublet HBB > 1 7,190
CD56dim NK cell Erythrocyte doublet HBB > 0.7 5,004
Memory CD8 T cell Erythrocyte doublet HBB > 0.7 7,711
Proliferating T cell Erythrocyte doublet HBB > 0.7 68

 

Finally, we also removed clusters with low gene detection based on a single threshold: If > 30% of cells in a cluster had < 750 genes detected, the cluster was removed as low quality. Due to their generally lower gene detection, this threshold was not applied to Erythrocyte or Platetelet populations.

Population Reason N cells
CD14 monocyte Low gene detection 5,269
CD56bright NK cell Low gene detection 498
CD56dim NK cell Low gene detection 11,835
DN T cell Low gene detection 370
MAIT Low gene detection 283
Memory B cell Low gene detection 1,092
Memory CD4 T cell Low gene detection 9,538
Memory CD8 T cell Low gene detection 18,535
Naive B cell Low gene detection 2,860
Naive CD4 T cell Low gene detection 48,028
Naive CD8 T cell Low gene detection 1,020
Proliferating NK cell Low gene detection 415
Treg Low gene detection 8,608
gdT Low gene detection 772

 

The table below presents the total number of cells removed for each reason listed above. In total, 800,059 cells (5.39%) were removed by this filtering process, and 14,029,963 cells (94.61%) were retained for final clean up based on high-resolution AIFI_L3 labels.

Reason N cells
B cell doublet 29,551 (0.20%)
Erythrocyte doublet 92,621 (0.62%)
Low gene detection 109,123 (0.74%)
Myeloid doublet 88,380 (0.60%)
Platelet doublet 238,608 (1.61%)
T cell doublet 241,776 (1.63%)
Total removed 800,059 (5.39%)

 

Clustering and refinement of high-resolution labels

View Notebook: 09-Python_partition_L3_data.ipynb

After removal of doublets at the AIFI_L2 label resolution, we assembled cells from all samples for each subset in the high-resolution AIFI_L3 labeling results. Separately for each subset, we performed the scanpy data processing procedure described above to enable a final review of high resolution labels. Each subset was examined to identify remaining doublet clusters or clusters of cells that appeared to be mislabeled based on marker gene expression in collaboration with immunological domain experts.

L3 subset clustering

View Notebook: 10a-Python_cluster_L3_b_cell_data.ipynb
View Notebook: 10b-Python_cluster_L3_cd4_t_cell_data.ipynb
View Notebook: 10c-Python_cluster_L3_cd8_t_cell_data.ipynb
View Notebook: 10d-Python_cluster_L3_other_t_data.ipynb
View Notebook: 10e-Python_cluster_L3_myeloid_data.ipynb
View Notebook: 10f-Python_cluster_L3_nk_data.ipynb
View Notebook: 10g-Python_cluster_L3_other_data.ipynb

L3 clustering review

View Notebook: 11a-Python_review_L3_b_cell_data.ipynb
View Notebook: 11b-Python_review_L3_cd4_t_cell_data.ipynb
View Notebook: 11c-Python_review_L3_cd8_t_cell_data.ipynb
View Notebook: 11d-Python_review_L3_other_t_cell_data.ipynb
View Notebook: 11e-Python_review_L3_myeloid_data.ipynb
View Notebook: 11f-Python_review_L3_nk_data.ipynb
View Notebook: 11g-Python_review_L3_other_data.ipynb

L3 filtering

View Notebook: 12a-Python_filter_L3_b_cell_data.ipynb
View Notebook: 12b-Python_filter_L3_cd4_t_cell_data.ipynb
View Notebook: 12c-Python_filter_L3_cd8_t_cell_data.ipynb
View Notebook: 12d-Python_filter_L3_other_t_cell_data.ipynb
View Notebook: 12e-Python_filter_L3_myeloid_data.ipynb
View Notebook: 12f-Python_filter_L3_nk_data.ipynb
View Notebook: 12g-Python_filter_L3_other_data.ipynb

Final review of L3 labels

View Notebook: 13a-Python_review_filtered_L3_b_cell_data.ipynb
View Notebook: 13b-Python_review_filtered_L3_cd4_t_cell_data.ipynb
View Notebook: 13c-Python_review_filtered_L3_cd8_t_cell_data.ipynb
View Notebook: 13d-Python_review_filtered_L3_other_t_cell_data.ipynb
View Notebook: 13e-Python_review_filtered_L3_myeloid_data.ipynb
View Notebook: 13f-Python_review_filtered_L3_nk_data.ipynb
View Notebook: 13g-Python_review_filtered_L3_other_data.ipynb

During this process, the following additional cells were removed from downstream analysis:

CellTypist PopulationReasonN cells (%)
ASDCT cell doublet3 (0.08%)
Adaptive NK cellErythrocyte doublet1816 (1.69%)
Adaptive NK cellMyeloid doublet21 (0.02%)
Adaptive NK cellT cell doublet2503 (2.33%)
BaEoMaP cellB cell doublet5 (0.81%)
BaEoMaP cellT cell doublet30 (4.89%)
C1Q+ CD16 monocyteCD14 doublet349 (1.11%)
CD4 MAITDoublets6206 (46.98%)
CD56bright NK cellErythrocyte doublet837 (0.95%)
CD56bright NK cellT cell doublet484 (0.55%)
CD8 MAITMislabeled GZMH+3346 (0.96%)
CD95 memory B cellT cell doublet10 (0.07%)
CLP cellDC doublet18 (0.9%)
CM CD4 T cellLow quality4649 (0.33%)
Core CD16 monocyteCD14/CD16 doublet8830 (3.18%)
Core naive B cellMyeloid doublet7226 (1.1%)
Core naive CD4 T cellErythrocyte doublet1107 (0.04%)
Core naive CD4 T cellLow quality8257 (0.3%)
DN T cellB cell Doublet42 (0.25%)
Early memory B cellT cell doublet295 (0.1%)
ErythrocytePlatelet doublet166 (0.94%)
GZMB+ Vd2 gdTErythrocyte doublet13 (0.01%)
GZMB- CD27- EM CD4 T cellB cell doublet2797 (0.52%)
GZMB- CD27- EM CD4 T cellLow quality3180 (0.6%)
GZMK+ CD56dim NK cellErythrocyte doublet2439 (2.46%)
GZMK+ CD56dim NK cellMyeloid doublet38 (0.04%)
GZMK+ CD56dim NK cellT cell doublet1542 (1.55%)
GZMK+ Vd2 gdTCD8A Doublet2365 (1.83%)
GZMK+ Vd2 gdTDoublets4978 (3.85%)
GZMK- CD56dim NK cellErythrocyte doublet28560 (3.49%)
GZMK- CD56dim NK cellT cell doublet896 (0.11%)
IL1B+ CD14 monocyteErythrocyte doublet87 (0.39%)
IL1B+ CD14 monocyteLow quality25 (0.11%)
IL1B+ CD14 monocytePlatelet doublet273 (1.21%)
IL1B+ CD14 monocyteT cell doublet45 (0.2%)
ILCT cell doublet105 (1.86%)
ISG+ CD14 monocyteMislabeled IFI27+3776 (1.3%)
ISG+ CD56dim NK cellT cell doublet824 (4.43%)
ISG+ memory CD4 T cellCD8+ cells1308 (3.54%)
ISG+ memory CD4 T cellMyeloid doublet37 (0.1%)
ISG+ naive B cellMislabeled Non-Naive998 (4.25%)
ISG+ naive B cellMyeloid doublet138 (0.59%)
ISG+ naive B cellT cell doublet107 (0.46%)
ISG+ naive CD4 T cellCD8 doublet363 (0.83%)
ISG+ naive CD4 T cellMyeloid doublet239 (0.55%)
KLRB1+ memory CD4 TregMislabeled CCL5+276 (1.32%)
KLRB1+ memory CD8 TregMislabeled GNLY+223 (5.37%)
KLRB1+ memory CD8 TregPlatelet doublet42 (1.01%)
KLRF1+ effector Vd1 gdTabT/gdT Doublet433 (1.11%)
KLRF1- GZMB+ CD27- EM CD8 T cellErythrocyte doublet8751 (1.76%)
KLRF1- effector Vd1 gdTMislabeled TRDC-2308 (8.78%)
KLRF1- effector Vd1 gdTabT/gdT doublet226 (0.86%)
Naive Vd1 gdTMislabeled TRDC-1366 (7.76%)
Naive Vd1 gdTMislabeled ZBTB16+564 (3.2%)
Plasma cellMislabeled MS4A1+338 (2.33%)
PlateletErythrocyte doublet1665 (2.82%)
PlateletMyeloid doublet105 (0.18%)
PlateletT cell doublet31 (0.05%)
Proliferating NK cellMyeloid doublet23 (0.12%)
Proliferating NK cellT cell doublet323 (1.66%)
Proliferating T cellMislabeled CD3D-475 (2.58%)
SOX4+ Vd1 gdTMislabeled TRDC-597 (10.72%)
SOX4+ Vd1 gdTMislabeled ZBTB16+144 (2.59%)
SOX4+ naive CD4 T cellCD8 doublet1088 (1.09%)
SOX4+ naive CD4 T cellgdT doublet1386 (1.39%)
SOX4+ naive CD8 T cellMislabeled GZMB+106 (0.54%)


In total, 234,117 cells were removed at this stage of label refinement, and 13,795,846 cells were retained for downstream analyses. In addition to cells that were removed from the dataset, we reassigned the following cells based on gene expression to generate the final label set:

CellTypist Population Reassigned Population N cells (%)
CD27+ effector B cell Core memory B cell 6279 (12.64%)
CM CD4 T cell CD8+ cells 61331 (4.31%)
Core CD14 monocyte CD14+ cDC2 10910 (0.62%)
Core CD16 monocyte Intermediate monocyte 15141 (5.44%)
Core naive B cell Core memory B cell 22320 (3.39%)
GZMB- CD27+ EM CD4 T cell CD8+ cells 17779 (2.92%)
GZMB- CD27- EM CD4 T cell CD8+ cells 27230 (5.1%)
GZMK- CD27+ EM CD8 T cell GZMK+ CD27+ EM CD8 T cell 2397 (6.82%)
HLA-DRhi cDC2 CD14+ cDC2 2494 (4.69%)
ISG+ CD16 monocyte ISG+ CD14 monocyte 256 (0.67%)
KLRB1+ memory CD4 Treg GZMK+ memory CD4 Treg 1071 (5.11%)
KLRF1+ effector Vd1 gdT GZMB+ Vd2 gdT 650 (1.66%)
KLRF1- GZMB+ CD27- EM CD8 T cell ISG+ memory CD8 T cell 6338 (1.27%)
KLRF1- GZMB+ CD27- memory CD4 T cell CD8+ cells 5974 (3.71%)
Memory CD4 Treg Proliferating T cell 1026 (0.71%)
Naive CD4 Treg Mislabeled GZMB+ 184 (0.47%)
Naive CD4 Treg Mislabeled GZMK+ 951 (2.43%)

In total, 182,331 cells were reassigned in this final pass of cell type refinement.

Population subset frequency analyses

View Notebook: 15-R_assemble_type_frequencies.ipynb

To analyze changes in subset abundance, we tabulated subset frequency for each sample in our dataset for high resolution (L3) label assignments. We also compute proportions of each subset per sample, which were used for Centered Log Ratio (CLR) transformations. To compare abundance between sets of subjects, we used rank-sum Wilcoxon tests on CLR-transformed values. We compared abundance differences between Age Groups (Young Adult vs Older Adult), CMV status (Negative vs Positive), and Biological Sex (Female vs Male) using the first Flu Vaccine time point (Flu Year 1 Day 0) for each subject. For comparisons between longitudinal time points in the flu vaccine series (Day 7 vs Day 0), we used paired signed-rank Wilcoxon tests.

Pseudobulk data assembly

We generated Pseudobulk data by aggregating results for cells with the same high resolution (L3) label within each sample. This aggregation was performed using multiple methods for downstream analysis: summing UMI counts per gene (used for DESeq2 analysis, below), taking the mean of normalized and log transformed values (for visualization).

Differential gene expression tests

We performed differential gene expression tests using the DESeq2 v1.42.0 (Love, Huber, and Anders, 2014) package in R v4.3.2. Prior to differential tests, genes were filtered based on a minimum of 10% gene detection across all single cells used for the comparison. Summed UMI counts from single cells were used for the comparisons. DESeq2 analyses used default parameters. For comparisons of groups of subjects at baseline, we used pseudobulk data from Flu Year 1 Day 0 samples. The DESeq2 design formula for these comparisons was `Aggregated Counts ~ Age + CMV + Sex`, and then contrasts were made for each factor to identify differential results. Genes were considered differentially expressed if their adjusted p-value was below 0.05 and their absolute fold change was greater than 0.1.
 

References

Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.
doi:10.1186/s13059-014-0550-8

Traag V, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. arXiv [cs.SI]. 2018. Available: http://arxiv.org/abs/1810.08473
doi:10.48550/arXiv.1810.08473

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0

Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281–291.e9.
doi:10.1016/j.cels.2018.11.005