Overview
scRNA-seq data was generated on the 10x Genomics 3' scRNA-seq Platform (v3.1). For data collection and processing details, see the Cohorts and Experimental Methods sections.
Below, we provide labeled and annotated PBMC scRNA-seq data from our healthy, at-risk, and RA subjects from the ALTRA cohort and healthy controls from the Sound Life cohort. More information about the Sound Life cohort is available in the Dynamics of Immune Health and Age website.
All .h5ad files for this project contain sample and subject metadata, in addition to cell type labels and QC metrics. Click the header below for descriptions of these metadata:
Each file contains sample-level metadata, as well as cell-level cell type labels and QC metrics. The following values are stored in the .obs
section of these .h5ad files as descriptions of observations:
Sample Identifierscohort.cohortGuid:
A Globally Unique Identifier (GUID) of the Cohort the subject enrolled in for our study
subject.subjectGuid:
A GUID for the Subjectsample.sampleKitGuid:
A GUID for the Sample Kit, representing all material collected at a visitspecimen.specimenGuid:
A GUID for the specific aliquot used for the experimentpipeline.fileGuid:
A GUID for the specific analysis pipeline output file used for analysis
Subject Metadatasubject.biologicalSex:
The biological sex of the Subjectsubject.birthYear:
The Birth Year of the Subjectsubject.ageAtFirstDraw:
The Age of the Subject at their first on-study sample collectionsubject.ageGroup:
The Age Group of the Subject (Young Adult or Older Adult)subject.race:
The self-reported Race of the Subjectsubject.ethnicity:
The self-reported Ethnicity of the subjectsubject.cmv:
The CMV Status of the subject, as determined by an HCMV assaysubject.bmi:
The BMI of the Subject
Sample Metadatasample.visitName:
The name of the study visit (i.e. time point)sample.drawDate:
The date of the study visit (Month and Year; e.g. 2021-03)sample.subjectAgeAtDraw:
The age of the Subject in years at the time of sample collection
Process Identifiersbatch_id:
A GUID for the batch of samples processed together (e.g. B039)pool_id:
A GUID for the pool of samples combined for Cell Hashing (e.g. B039-P1)chip_id:
A GUID for the 10x Genomics chip the cells were loaded into (e.g. B039-P1C2)well_id:
A GUID for the 10x Genomics well the cells were loaded into within the chip (e.g. B039-P1C2W4)*barcodes:
A GUID for the individual celloriginal_barcodes:
The original, sequence-based barcode generated by 10x Genomics Cell Ranger softwarecell_name:
A quasi-unique, memorable cell identifier generated using an adjective-adjective-animal structure
*used as the primary cell index in our .h5ad files
Cell QC Metricsn_reads:
Number of reads assigned to the cell barcoden_umis:
Number of Unique Molecular Identifiers (unique molecules) detectedn_genes:
Number of genes with at least 1 UMI detectedtotal_counts_mito:
Total number of reads that were assigned to mitochondrial genespct_counts_mito:
Percent of reads that were assigned to mitochondrial genesdoublet_score:
Doublet score assigned by Scrublet for doublet detection
Cell Labeling ResultsAIFI_L1:
Final broad class cell type label (9 types)AIFI_L1_score:
AIFI_L1 prediction score generated by CellTypistpredicted_AIFI_L1:
Predicted AIFI_L1 type assigned by CellTypistAIFI_L2:
Final mid resolution cell type label (29 types)AIFI_L2_score:
AIFI_L2 prediction score generated by CellTypistpredicted_AIFI_L2:
Predicted AIFI_L2 type assigned by CellTypistAIFI_L3:
Final high resolution cell type label (71 types)AIFI_L3_score:
AIFI_L3 prediction score generated by CellTypistpredicted_AIFI_L3:
Predicted AIFI_L3 type assigned by CellTypist
Subject Group .h5ad files
Here, we group samples based on their disease group designation:
HC1: ACPA- Healthy controls from the ALTRA cohort
HC2: ACPA- Healthy controls from the Sound Life cohort
ARI-CONV: ACPA+ At-Risk Individuals (ARI) that converted to RA during the study period (CONV)
ARI-NONC: ACPA+ At-Risk Individuals (ARI) that did not convert during the study period (NONC)
ERA: ACPA+ subjects with Early clinical RA
ARI-ERA-HC1: All subjects from the ALTRA cohort (all above except HC2).
We are providing our scRNA-seq data in AnnData (.h5ad) format. For more details about AnnData, see the AnnData Documentation Page.
Each file provided below contains either the full set or a subset of the samples used in this study. Sample counts, cell counts, and approximate file sizes are below:
File Name | N Subjects | N Samples | N Cells | File Size |
---|---|---|---|---|
altra_ARI-CONV_subjects.h5ad | 16 | 66 | 508,300 | 14 GB |
altra_ARI-NONC_subjects.h5ad | 31 | 31 | 977,608 | 26 GB |
altra_ERA_subjects.h5ad | 11 | 11 | 149,782 | 4.2 GB |
altra_HC1_subjects.h5ad | 31 | 31 | 394,174 | 12 GB |
altra_all_samples.h5ad | 89 | 139 | 2,029,864 | 75 GB |
sound-life_HC2_subjects.h5ad | 29 | 86 | 1,398,699 | 50 GB |
ALTRA and Sound Life Subject Group .h5ad Files
File Name | Description | Download Link |
---|---|---|
altra_ARI-CONV_subjects.h5ad | Converting At-risk subjects (ARI-CONV) | |
altra_ARI-NONC_subjects.h5ad | Non-converting At-risk subjects (ARI-NONC) | |
altra_ERA_subjects.h5ad | Early RA subjects (ERA) | |
altra_HC1_subjects.h5ad | Healthy Control subjects (HC1) | |
altra_all_samples.h5ad | All ALTRA samples (ARI, ERA, HC1) | |
sound-life_HC2_subjects.h5ad | Healthy Control subjects (HC2) |
Cell type frequency data
After labeling cell types, we tabulated the cell count for each of the 139 ALTRA samples utilized in our study at each level of resolution in our Immune Health Atlas. We also include frequencies at an additional level of resolution, called AIFI_L3_reviewed, which are derived by manual review and additional subsetting of AIFI_L3 cell populations.
Below, we provide these cell counts, the fraction of counts for each sample, and the centered log ratio (CLR) transformation of those fractions that we utilized for our analyses. CLR values are provided both with and without a pseudocount added to adjust types with no cells in each sample.
In addition, we utilized the absolute lymphocyte counts (ALC) provided in our clinical blood count lab results for each sample (available below) to compute estimated cell type abundance based on normalization to ALC. For more details, see our Data Analysis Methods.
Descriptors of columns in the cell type frequency tables can be accessed by clicking on the header below:
Sample Metadata columnscohort.cohortGuid
: Cohort ID (BR1 or BR2)subject.subjectGuid
: Subject IDsubject.biologicalSex
: Subject Sex (Female or Male)subject.cmv
: Subject CMV Status (Negative or Positive)subject.bmi
: Subject BMI (integer)subject.race
: Subject racesubject.ethnicity
: Subject ethnicitysubject.birthYear
: Subject Birth Yearsubject.ageAtFirstDraw
: Subject Age at earliest blood draw in studysample.sampleKitGuid
: Sample Kit IDsample.visitName
: Sample Visit Namesample.drawDate
: Sample Draw Date (Year-Month)sample.subjectAgeAtDraw
: Subject age at time of draw, based on year of Draw Date and Birth Yearspecimen.specimenGuid
: Specimen ID (pbmc_sample_id in .h5 files)
Frequency-related columns
(for AIFI_L1 as an example; AIFI_L1 is replaced with AIFI_L2 and AIFI_L3 for those levels)
AIFI_L1
: Cell Type assignmentAIFI_L1_count
: Count of cells within this sample with cell type assignmenttotal_cells
: Total cells within this samplescrna.lymphocyte_count
: Sum of T, NK, and B cellsbc.lymphocyte_count
: Absolute Lymphocyte Count (ALC) from clinical Blood Counts (bc.)alc_ratio
: ALC per scRNA Lymphocyte CountAIFI_L1_frac_total
: Fraction of cells with cell type assignment divided by Total cells for this sampleAIFI_L1_alc
: ALC estimate for this cell type assignmentAIFI_L1_clr
: Centered Log Ratio computed using AIFI_L1_frac_total for all types within this sampleAIFI_L1_count_pseudo
: Count of cells adjusted with a pseudocount (AIFI_L1_count + 1)AIFI_L1_clr_pseudo
Centered Log Ratio computed using AIFI_L1_count_pseudo
ALTRA scRNA-seq Cell Type Frequencies
File Name | Description | Download Link |
---|---|---|
altra_AIFI_L1_frequencies.csv | Frequency data for broad cell populations (AIFI_L1) | |
altra_AIFI_L2_frequencies.csv | Frequency data for mid-resolution cell populations (AIFI_L2) | |
altra_AIFI_L3_frequencies.csv | Frequency data for high resolution cell populations (AIFI_L3) | |
altra_AIFI_L3_reviewed_frequencies.csv | Frequency data for high resolution manually reviewed cell populations (AIFI_L3) |
GEO .h5 data access
In addition to the assembled data, above, we have deposited the demultiplexed, per-sample Cell Ranger output .h5 files in the Gene Expression Omnibus (GEO) for public access.
These .h5 files can be found on GEO at accession ID GSE274680.
These files can be read into analysis packages using functions designed for reading 10x Genomics Cell Ranger .h5 files.
Raw data in dbGaP
Raw data in FASTQ format can be obtained through the controlled access repository dbGaP at accession ID phs003944.v1.p1 .