Overview

scRNA-seq data was generated on the 10x Genomics 3' scRNA-seq Platform (v3.1). For data collection and processing details, see the Cohorts, Experimental Methods, and Data Analysis Methods sections.

Below, we provide labeled and annotated PBMC scRNA-seq data from our healthy adult cohorts. Because this is a large dataset, we have divided the data to enable data transfer.

All .h5ad files for this project contain sample and subject metadata, in addition to cell type labels and QC metrics. Click the header below for descriptions of these metadata:

Each file contains sample-level metadata, as well as cell-level cell type labels and QC metrics. The following values are stored in the .obs section of these .h5ad files as descriptions of observations:

Sample Identifiers
cohort.cohortGuid: A Globally Unique Identifier (GUID) of the Cohort the subject enrolled in for our study subject.subjectGuid: A GUID for the Subject
sample.sampleKitGuid: A GUID for the Sample Kit, representing all material collected at a visit
specimen.specimenGuid: A GUID for the specific aliquot used for the experiment
pipeline.fileGuid: A GUID for the specific analysis pipeline output file used for analysis

Subject Metadata
subject.biologicalSex: The biological sex of the Subject
subject.birthYear: The Birth Year of the Subject
subject.ageAtFirstDraw: The Age of the Subject at their first on-study sample collection
subject.ageGroup: The Age Group of the Subject (Young Adult or Older Adult)
subject.race: The self-reported Race of the Subject
subject.ethnicity: The self-reported Ethnicity of the subject
subject.cmv: The CMV Status of the subject, as determined by an HCMV assay
subject.bmi: The BMI of the Subject

Sample Metadata
sample.visitName: The name of the study visit (i.e. time point)
sample.drawDate: The date of the study visit (Month and Year; e.g. 2021-03)
sample.subjectAgeAtDraw: The age of the Subject in years at the time of sample collection

Process Identifiers
batch_id: A GUID for the batch of samples processed together (e.g. B039)
pool_id: A GUID for the pool of samples combined for Cell Hashing (e.g. B039-P1)
chip_id: A GUID for the 10x Genomics chip the cells were loaded into (e.g. B039-P1C2)
well_id: A GUID for the 10x Genomics well the cells were loaded into within the chip (e.g. B039-P1C2W4)
*barcodes: A GUID for the individual cell
original_barcodes: The original, sequence-based barcode generated by 10x Genomics Cell Ranger software
cell_name: A quasi-unique, memorable cell identifier generated using an adjective-adjective-animal structure

*used as the primary cell index in our .h5ad files

Cell QC Metrics
n_reads: Number of reads assigned to the cell barcode
n_umis: Number of Unique Molecular Identifiers (unique molecules) detected
n_genes: Number of genes with at least 1 UMI detected
total_counts_mito: Total number of reads that were assigned to mitochondrial genes
pct_counts_mito: Percent of reads that were assigned to mitochondrial genes
doublet_score: Doublet score assigned by Scrublet for doublet detection

Cell Labeling Results
AIFI_L1: Final broad class cell type label (9 types)
AIFI_L1_score: AIFI_L1 prediction score generated by CellTypist
predicted_AIFI_L1: Predicted AIFI_L1 type assigned by CellTypist
AIFI_L2: Final mid resolution cell type label (29 types)
AIFI_L2_score: AIFI_L2 prediction score generated by CellTypist
predicted_AIFI_L2: Predicted AIFI_L2 type assigned by CellTypist
AIFI_L3: Final high resolution cell type label (71 types)
AIFI_L3_score: AIFI_L3 prediction score generated by CellTypist
predicted_AIFI_L3: Predicted AIFI_L3 type assigned by CellTypist

Subject Group .h5ad files

Here, we group samples based on Cohort, Subject Biological Sex, and CMV status to generate subsets of data for use in analysis.

We are providing our scRNA-seq data in AnnData (.h5ad) format. For more details about AnnData, see the AnnData Documentation Page.

Each file provided below contains a subset of the full > 13 million cell dataset. Sample counts, cell counts, and approximate file sizes are below:

File NameN SubjectsN SamplesN CellsFile Size
SoundLife_OlderAdult_Female_CMVneg.h5ad10911,350,74821 GB
SoundLife_OlderAdult_Female_CMVpos.h5ad171632,596,11141 GB
SoundLife_OlderAdult_Male_CMVneg.h5ad121161,750,56529 GB
SoundLife_OlderAdult_Male_CMVpos.h5ad8801,322,06121 GB
SoundLife_YoungAdult_Female_CMVneg.h5ad181622,616,82441 GB
SoundLife_YoungAdult_Female_CMVpos.h5ad10761,234,23412 GB
SoundLife_YoungAdult_Male_CMVneg.h5ad121071,712,24428 GB
SoundLife_YoungAdult_Male_CMVpos.h5ad9731,206,76112 GB

Sound Life Subject Group .h5ad Files
File NameDescriptionDownload Link
Sound_Life_OlderAdult_Female_CMVneg.h5ad Female CMV-negative Older Adult Subjects
Sound_Life_OlderAdult_Female_CMVpos.h5ad Female CMV-positive Older Adult Subjects
Sound_Life_OlderAdult_Male_CMVneg.h5ad Male CMV-negative Older Adult Subjects
Sound_Life_OlderAdult_Male_CMVpos.h5ad Male CMV-positive Older Adult Subjects
Sound_Life_YoungAdult_Female_CMVneg.h5ad Female CMV-negative Young Adult Subjects
Sound_Life_YoungAdult_Female_CMVpos.h5ad Female CMV-positive Young Adult Subjects
Sound_Life_YoungAdult_Male_CMVneg.h5ad Male CMV-negative Young Adult Subjects
Sound_Life_YoungAdult_Male_CMVpos.h5ad Male CMV-positive Young Adult Subjects

Cell population .h5ad files

Here, we group cells by major population category. These files contain cells from all samples.

We are providing our scRNA-seq data in AnnData (.h5ad) format. For more details about AnnData, see the AnnData Documentation Page.

Each file provided below contains a subset of the full > 13 million cell dataset. Sample counts, cell counts, and approximate file sizes are below:

File NameN CellsFile Size
SoundLife_b_plasma.h5ad1,205,08511 GB
SoundLife_dc_monocyte.h5ad2,643,67458 GB
SoundLife_nk.h5ad1,114,97511 GB
SoundLife_other.h5ad94,2440.6 GB
SoundLife_t_cd4_memory.h5ad2,640,49940 GB
SoundLife_t_cd4_naive.h5ad2,839,09938 GB
SoundLife_t_cd8.h5ad2,264,10234 GB
SoundLife_t_other.h5ad987,8709 GB

Sound LIfe Cell Type Group .h5ad Files
File NameDescriptionDownload Link
SoundLife_b_plasma.h5ad
SoundLife_dc_monocyte.h5ad
SoundLife_nk.h5ad
SoundLife_other.h5ad
SoundLife_t_cd4_memory.h5ad
SoundLife_t_cd4_naive.h5ad
SoundLife_t_cd8.h5ad
SoundLife_t_other.h5ad

Cell type frequency data

After labeling cell types, we tabulated the cell count for each of the 868 samples utilized in our study at each level of resolution in our Immune Health Atlas. Below, we provide these cell counts, the fraction of counts for each sample, and the centered log ratio (CLR) transformation of those fractions that we utilized for our analyses.

In addition, we utilized the absolute lymphocyte counts (ALC) provided in our clinical blood count lab results for each sample (available below) to compute estimated cell type abundance based on normalization to ALC. For more details, see our Data Analysis Methods.

Descriptors of columns in the cell type frequency tables can be accessed by clicking on the header below:

Sample Metadata columns
cohort.cohortGuid: Cohort ID (BR1 or BR2)
subject.subjectGuid: Subject ID
subject.biologicalSex: Subject Sex (Female or Male)
subject.cmv: Subject CMV Status (Negative or Positive)
subject.bmi: Subject BMI (integer)
subject.race: Subject race
subject.ethnicity: Subject ethnicity
subject.birthYear: Subject Birth Year
subject.ageAtFirstDraw: Subject Age at earliest blood draw in study
sample.sampleKitGuid: Sample Kit ID
sample.visitName: Sample Visit Name
sample.drawDate: Sample Draw Date (Year-Month)
sample.subjectAgeAtDraw: Subject age at time of draw, based on year of Draw Date and Birth Year
specimen.specimenGuid: Specimen ID (pbmc_sample_id in .h5 files)

Frequency-related columns
(for AIFI_L1 as an example; AIFI_L1 is replaced with AIFI_L2 and AIFI_L3 for those levels)

AIFI_L1: Cell Type assignment
AIFI_L1_count: Count of cells within this sample with cell type assignment
total_cells: Total cells within this sample
scrna.lymphocyte_count: Sum of T, NK, and B cells
bc.lymphocyte_count: Absolute Lymphocyte Count (ALC) from clinical Blood Counts (bc.)
alc_ratio: ALC per scRNA Lymphocyte Count
AIFI_L1_frac_total: Fraction of cells with cell type assignment divided by Total cells for this sample
AIFI_L1_alc: ALC estimate for this cell type assignment
AIFI_L1_clr: Centered Log Ratio computed using AIFI_L1_frac_total for all types within this sample

Sound Life scRNA-seq Cell Type Frequencies
File NameDescriptionDownload Link
sound_life_AIFI_L1_frequencies.csv
sound_life_AIFI_L2_frequencies.csv
sound_life_AIFI_L3_frequencies.csv

Pseudobulk scRNA-seq data

To perform analysis and display data per cell type for each sample, we assembled pseudobulk expression values for each high resolution (AIFI_L3) cell type in our cell type labels.

For use with differential expression tests using the DESeq2 framework, we assembled total UMI counts for each sample and cell type. For display of average expression, we normalized and log-transformed scRNA-seq count data, then computed mean values for each sample and cell type.

Below, we provide matrices of pseudobulk expression for each of these metrics, as well as sample and cell type metadata that are necessary for analysis.

Sound Life Pseudobulk scRNA-seq Data
File NameDescriptionDownload Link
sound-life_pseudobulk_mean-log-norm.tar.gz
sound-life_pseudobulk_sum.tar.gz

scRNA-seq Batch Controls and QC Reports

As part of our scRNA-seq pipeline, we include a batch control sample as an aliquot of PBMCs derived from a single leukapheresis draw. Here, we provide the scRNA-seq data in .h5 format for use in batch comparisons to identify batch effects.

In total, this dataset contained samples from 99 scRNA-seq batches from 143 hashed sample pools. For additional details about our multiplexing and batching approach, see our Multiplexed scRNA-seq methods.

Sound Life Batch Control and QC Report Files
File NameDescriptionDownload Link
sound_life_batch_control_h5.tar
sound_life_batch_report_html.tar

GEO .h5 data access

In addition to the assembled data, above, we have deposited the demultiplexed, per-sample Cell Ranger output .h5 files in the Gene Expression Omnibus (GEO) for public access.

These .h5 files can be found on GEO at accession ID GSE271896.

These files can be read into analysis packages using functions designed for reading 10x Genomics Cell Ranger .h5 files.

Raw data in dbGaP

Raw data in FASTQ format will be available in dbGaP for controlled access. We are currently in the process of data deposition.