Analysis methods

To analyze data from the ALTRA and Sound Life cohorts, we utilized data and analysis environments within the Human Immune System Explorer (HISE) system to trace data processing, analytical code, and analysis environments from raw data to generate the final analysis results presented here. A graph representation of the analysis trace for this project is available in the Certificate of Reproducibility.

This analysis is broken down by into sections for each major data modality:

  • Plasma protein analysis from Olink proteomics data
    • Batch bridging and normalization
    • Cross-sectional comparison analysis
  • Flow Cytometry
    • Processing, clustering, and differential abundance
  • Single-cell RNA-seq data
    • Preprocessing, cleaning, and label transfer
    • Pseudobulk differential gene expression
    • Differential abundance analysis
    • Longitudinal modeling of progression (scRNA & Olink)
    • B cell IgH isotype analysis
  • TEA-seq data
    • Data preprocessing and cell labeling
    • Peak calling and differential peak analysis
    • chromVAR transcription factor activity analysis

Additional analyses and notebooks used to generate all figures in He, et al. (2025) can be found in our Github repository: 
aifimmunology/ALTRA-manuscript

Plasma proteomics

Protein abundance values were reported as log2 normalized protein expression (NPX) by Olink. Data were reviewed for overall quality prior to analysis. For cross-batch data normalization, bridge offsets were determined for each batch and each analyte separately by taking the median of the per-sample NPX differences between the later batch result and the earliest (reference) batch result for the 12 bridging controls. Offsets were subtracted from the analyte measurements of all samples in the later batch to obtain the normalized NPX values.

View Notebook: 00-R_aim1_ARI_vs_CON_data_selection.ipynb
View Notebook: 00-R_aim2_ERA_vs_CON_data_selection.ipynb

Repeated assays (TNF, IL-6, and CXCL8) with non-significant Spearman correlations (P > 0.05) across panels and batches were excluded from further analysis. Proteins with NPX values > 2 standard deviations from mean when comparing healthy samples from the first and last batches were excluded from analysis.

View Notebook: 01-R_aim2_ERA_vs_CON_modeling_no_converters.ipynb
View Notebook: 02-R_aim1_ARI_vs_CON_modeling.ipynb

Ordinary least squares linear regression models were used to detect differences between groups, utilizing the stats package in R. Models were adjusted for age, sex, BMI (ARI vs. HC1) or age and sex covariates for remaining comparisons. The Storey-Tibshirani procedure (Storey and Tibshirani, 2003) was employed for multiple testing correction, with q-values reported. Repeated assays (TNF, IL-6, and CXCL8) were treated as distinct measurements and analyzed separately, as recommended by Olink. For each comparison, proteins were excluded from the analyses if > 50% of NPX values were below the limit of detection in either group. Longitudinal modeling of abundance changes associated with time to clinical RA are described in the Longitudinal Modeling section below.

View Notebook: Figure_S1E.ipynb

Geneset over-representation analysis was performed on significant proteins for a custom collection of genesets: MSigDB Hallmark v7.2 (Liberzon, et al., 2015), KEGG v7.2 (Kanehisa, et al., 2025) and Reactome v7.2 (Gillespie, et al, 2022) using WebGestaltR (v0.4.6, method = ORA, 104 permutations; Elizarraras, et al., 2024) with all protein coding genes as the background geneset. P values were calculated using hypergeometric tests and adjusted using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995). 

Flow Cytometry

Data Preprocessing, Clustering and Differential Abundance 

Data from each panel were processed and analyzed independently. After spectral unmixing and manual compensation adjustments, cytometry data underwent pre-processing to remove technical artifacts, exclude doublets, and eliminate dead cells using a quadratic discriminant analysis model trained on manually gated data. A logicle transformation was applied to all fluorescent channels.  Pre-trained machine learning models developed from healthy samples based on the Cyanno framework (Kaushik, et al., 2021) were applied to facilitate cell type identification.

View Notebook: RA_02-sample_selection_PB1_flow_celltype_subsets_panels.ipynb
View Notebook: RA_02-sample_selection_PM1_flow_celltype_subsets_panels.ipynb
View Notebook: RA_02-sample_selection_PS1_flow_celltype_subsets_panels.ipynb
View Notebook: RA_02-sample_selection_PT1_flow_celltype_subsets_panels.ipynb

The major populations (T, B, Myeloid, NK cells) were then subsetted from the corresponding target panel and downsampled to 40,000 cells per sample for downstream unsupervised clustering analysis.

View Notebook: RA_03-reclustering_PB1_flow_celltype_subsets_panels.ipynb
View Notebook: RA_03-reclustering_PM1_flow_celltype_subsets_panels.ipynb
View Notebook: RA_03-reclustering_PS1_flow_celltype_subsets_panels.ipynb
View Notebook: RA_03-reclustering_PT1_flow_celltype_subsets_panels.ipynb

Subsequent dimensional reduction, batch-level harmonization and clustering on clinical samples within each panel were performed using Scanpy (Wolf, Angerer, and Theis, 2018). Leiden clustering was performed at following resolutions: B, NK, myeloid cells at 1.0 in (PS, PB, PT, PM panels) and T cells at 2.0 in (PT, PS panels).

View Notebook: RA_05-freq_stats_PB1_flow_celltype_subsets_panel.ipynb
View Notebook: RA_05-freq_stats_PM1_flow_celltype_subsets_panel.ipynb
View Notebook: RA_05-freq_stats_PS1_flow_celltype_subsets_panel.ipynb
View Notebook: RA_05-freq_stats_PT1_flow_celltype_subsets_panel.ipynb

Cluster-level counts and frequencies within each major cell type were calculated per sample and panel. Cluster phenotypes were assessed by comparing marker median expression across clusters and UMAP visualization. Statistical analysis of cellular abundance is detailed in the Differential Abundance Analysis section.

scRNA

Preprocessing, cleaning, and label transfer

Data preprocessing for scRNA was conducted per a computational reproducibility framework so that the analysis steps can be reproduced in the Human Immune System Explorer (Meijer, et al., 2025).

View Notebook: 01-Python_label_predictions_celltypist.ipynb

Cell type label transfer was performed using the Celltypist model (Domínguez Conde, et al., 2022) generated from a recently released Human Immune Health Atlas (Gustafson, et al., 2024). Detailed definition of the cell subsets can be found in a the Human Immune Health Atlas website.

View Notebook: 02-Python_doublet_detection.ipynb

Doublets were removed using Scrublet (Wolock, Lopez, and Klein, 2019) with default settings.

View Notebook: 03-Python_assembly_of_subsets.ipynb

Data from individual samples was assembled into 4 large sets for further processing.

View Notebook: 04-Python_subset_qc_filtering.ipynb

High quality cells were selected based on the following cutoffs: <10% mitochondrial reads, number of genes detected between 200 and 5,000, RNA unique molecular identifiers (UMIs)/cells between 1,500 and 750,000. The transcripts of the remaining cells were first normalized, and then followed by variable feature selection, scaling, PCA, UMAP embedding as described in Scanpy.

View Notebook: 05a-Python_cluster_harmonize_partition_cell_types_0-20.ipynb
View Notebook: 05b-Python_cluster_harmonize_partition_cell_types_21-50.ipynb
View Notebook: 05c-Python_cluster_harmonize_partition_cell_types_51-71.ipynb
View Notebook: 05d-Python_plot_cluster_umaps_partition_cell_types.ipynb

Following the cell type prediction, a secondary manual step was taken to remove the remaining doublet cells.​​ All 71 cell types were individually subsetted and corrected for batch- and subject-level variation using Harmony (Korsunsky, et al., 2019), followed by Leiden clustering.

View Notebook: 05e-Python_reparam_umap_partition_cell_types.ipynb

Subsequently, clusters with gene expression indicative of doublets and high doublet scores, resulted in exclusion of 6.7% of cells from further analysis.

View Notebook: 06.1-Python_deepclean_partition_cell_types.ipynb

Clusters with distinct gene profiles within the predicted cell type were manually reviewed and assigned unique cell type labels to differentiate from predicted cell types. When clustering memory B cells in scRNA-seq data, we excluded immunoglobulin genes in order to focus on phenotypic clustering and subsetting of B cells to best capture functionally distinct populations (Glass, et al., 2022).

We then secondarily analyzed IgH isotype composition within these clusters (see below).

Pseudobulk differential gene expression analysis

Differential gene expression (DEG) analysis between groups was conducted using DESeq2 (Love, et al., 2014).

View Notebook Set: 07-Pseudobulk

For each sample and cell type pair, gene transcripts were aggregated using the get_pseudobulk function from decoupleR (Badia-I-Mompel, et al., 2022). Sample-cell type pairs with fewer than 10 cells or 1,000 total gene counts, as well as genes expressed in fewer than 10% of cells or in less than one-third of samples, or with fewer than 15 counts across all samples, were excluded from downstream analysis. Rare cell types present in fewer than five samples per group were also omitted from DEG analysis.

Comparison Notebook
At-risk individuals (ARI) vs. Healthy controls (HC1) View Notebook: 01_ARIvsHC1_DEG_analysis.ipynb
Early RA (ERA) vs. Healthy controls (HC1) View Notebook: 02_ERAvsHC1_DEG_analysis.ipynb
Non-conv. vs. converting At-Risk Individuals (ARI) View Notebook: 04_CONVvsNONC_baseline_DEG_analysis.ipynb

Pseudobulk counts for the remaining cell types were analyzed with DESeq2 to compare disease and healthy control (HC1) groups with the formula (counts ~ age + sex + BMI + group). Additionally, batch effects, particularly from batch B182, which exhibited a relatively strong effect compared to other batches across multiple cell types during initial quality control, were adjusted for in the model.

For DEG analysis of longitudinal changes associated with time to clinical RA, pseudobulk counts were transformed with variance stabilizing transformation (VST) before being analyzed using a linear mixed model with the lmerTest package (described below). The Storey-Tibshirani procedure was employed for multiple testing correction, with q-values less than 0.1 considered significant.

Differential abundance analysis 

View Notebook: 01_scRNA_AIFI_l3_DA_analysis_certPro.ipynb

Differential abundance analysis of immune cell populations across cytometry and scRNA data was conducted by first extracting counts and frequencies for each cluster or cell subset from all samples. To avoid zeros in ratio calculations, a pseudocount of 1 was added to the raw counts. Adjusted frequencies were then transformed using the centered log ratio (CLR) to account for the compositional nature of cell frequencies, utilizing the compositions package in R (van den Boogaart, Tolosana-Delgado, and Bren, 2023). A linear regression model was employed to compare CLR values between ARI and HC1, adjusting for age, sex, and BMI. Multiple hypothesis testing was controlled using the Benjamini-Hochberg method as the number of tests were small, with FDR < 0.1 considered significant.  

View Notebook: 02a_scCODA_reanalysis_DA_AIM1.ipynb
View Notebook: 02b_scCODA_conv_nonc.ipynb

In addition, we employed scCODA (Dann, et al., 2021) to analyze the differential abundance in scRNA data comparing ARI and HC1. To ensure reproducibility of the model, we performed scCODA model using Hamiltonian Monte Carlo sampling method, with 7 different reference cell types (Core naïve B cell, Core naïve CD8 T cell, GZMB- CD27+ EM CD4 T cell, GZMK+ CD56dim NK cell, GZMK- CD56dim NK cell, KLRB1+ memory CD4 Treg, KLRF1+ GZMB+ CD27- EM CD8 T cell, pDC) whose frequencies remain stable in ARI and HC1 and also with abundant counts (>5 cells) in all samples. Cell types with credible abundance changes in all 7 models at FDR<0.1 threshold were considered significant.

Longitudinal modeling of progression to clinical RA

View Notebook: 01_R_Pseudobulk_longitudinal_DEGs.ipynb

Longitudinal analysis of ARI progression to clinical RA across all modalities (proteomic, scRNA, flow cytometry) for DEG and DA analyses were conducted using a linear mixed model in R (lmer function in lmerTest package; Kuznetsova, Brokhoff, and Christensen, 2017), with Satterthwaite p-value approximation. The model was adjusted for age and BMI at the time of clinical RA diagnosis with subjects as the random effect (intercept only). Given the distinct sex differences and the limited number of pre-diagnosis samples available (fig. S1B), the analysis was restricted to samples from 13 female subjects within 750 days of clinical RA diagnosis. The model formula was as follows:
Y ~ time to clinical RA diagnosis + age  + BMI  + (1|subject)
with the dependent variable (Y) being either the VST-transformed gene expression or the CLR-transformed cell frequency. The Storey-Tibshirani procedure was employed for multiple testing correction, with q-values less than 0.1 considered significant. For visualization, the effect sizes were annualized by multiplying by 365. 

scRNA-seq B cell IgH Isotype Analysis

View Notebook: 01_py_AssignIsotype.ipynb

QC-processed 10x Genomics Chromium 3’ scRNA-seq data was used to estimate proportions of switched IgH isotype B cells in study samples. First, positive expression for each of the nine IgH constant genes (IGHM, IGHD, IGHG1-4, IGHA1-2, and IGHE) was defined per B cell as a normalized UMI count greater than or equal to 0.5 (normalized to 10,000 UMI per cell). Cells positive for either or both IGHM and IGHD were classified as ‘non-switched’ with isotype assignments of IGHM, IGHD, or IGHMD. The remaining cells were classified as ‘class-switched’ if positive for any other IGH gene. In cases of multi-positivity for class-switched genes, the positive gene located nearest the variable region in the genomic DNA was assigned as the isotype to be consistent with class-switch recombination biology (IgH constant region ‘switched’ loci order after the variable region: IGHG3, IGHG1, IGHA1, IGHG2, IGHG4, IGHE, IGHA2). Both the class-switched status and isotype for cells without detected IgH gene expression were denoted as ‘undetermined’. After isotype classification, IgH constant gene germline transcription (GLT) level was defined as the normalized gene counts for those IgH loci downstream of the isotype-determining gene. As an example, IgG3+ isotype B cells may have quantifiable GLT from the IGHG1, IGHA1, IGHG2, IGHG4, IGHE, and IGHA2 genes. IgH constant region isotype gene GLT resulting in spliced, polyadenylated and untranslated transcript is a well-established phenomenon in B cell isotype class-switching (Stavnezer, 1996; Lorenz, Jung, and Radbruch, 1995). This isotype labeling and GLT analysis method cannot be applied to 10x Genomics Chromium 5’ gene expression data with the same accuracy, due to the orientation of the captured mRNA molecule having increased loss of IgH constant gene segments in cDNA fragmentation steps of library preparation. 3’ scRNA-seq data was pseudobulked by B cell label and isotype to perform differential expression analysis of switched IgH genes (minimum 20 cells and 1,000 counts per sample).

View Notebook: Figure_3E.ipynb

DESeq2 was used to compare ARI to HC1, with the model formula:
exp ~ sex + age + status
for each cell population with at least 5 individuals per status or group. P values were adjusted using the Benjamini-Hochberg method.

TEA-seq

Data Preprocessing and Cell Labeling in TEA-Seq

View Notebook: RA_101_Initial_Object_Creation.ipynb

Data preprocessing was performed as previously described (Swanson, et al., 2021). ADT and HTO count matrices from BarCounter were combined into a Mudata object (Bredikhin, Kats, and Stegle, 2022). High quality cells were selected based on the following cutoff: > 250 genes per cell, between 500 and 20,000 RNA UMIs per cell, < 10,000 ADT UMIs per cell, < 30% mitochondrial reads. One well (P1C2W6) was excluded from downstream analysis due to abnormally low RNA UMI counts and gene detection.

View Notebook: RA_102_python_dsb_normalization.ipynb

ADT normalization was performed using the dsb package, which denoises ADT signals from background staining using empty droplets (Mulè, Martins, and Tsang, 2022). RNA normalization, variable selection, scaling, and UMAP embedding were conducted using the standard Scanpy workflow.

View Notebook: RA_101b_ATAC_Initial_Object_Creation.ipynb

Fragment files for ATAC data were processed using the ArchR package (Granja, et al., 2021).

View Notebook: RA_101b_python_construct_object.ipynb

LSI reduced dimensions, UMAP and clusters generated from ArchR were then imported to the Mudata object for integrated analysis. 3-way weighted nearest neighbors and UMAP that incorporated ADT, RNA, and ATAC data were then generated using MUON.

View Notebook: RA_103_python_relabel_immune_health.ipynb

We performed unsupervised clustering within each major cell type (T, B, myeloid, and NK cells) and labeled the clusters based on the top differential expressed genes and ADTs (fig. S11A). In addition, label transfer based on RNA only was performed using the AIFI Immune Cell Atlas as described above in the scRNA section.

scATAC Peak Calling and Differential Peak Analysis

View Notebook: RA_102c_global_ATAC_analysis.ipynb

Peak calling for scATAC data in TEA-Seq was conducted using MOCHA (Rachid Zaim, et al., 2024). In brief, scATAC data were aggregated into cell type-sample pseudobulk matrices, and sample-specific 500 base pair tiles were identified. Consensus peaks were determined as regions that were open in more than 20% of samples. The zero-inflated Wilcoxon test implemented in getDifferentialAccessibleTiles function was then used to determine the differential accessible peaks between ARI and HC2 samples with settings of minimum median intensity (signalThreshold) = 14 and minimum difference in average dropout rates (minZeroDiff) = 0. FDR < 0.1 were considered significant. Peaks were annotated based on UCSC hg38 reference genome and regions within 2,000 base pairs upstream and 100 base pairs downstream of the transcription start site (TSS) were considered as promoter regions.

chromVAR Transcription Factor Activity Analysis

View Notebook: RA_102c_global_ATAC_analysis.ipynb

Single cell and pseudobulk transcription factor (TF) activities were inferred using chromVAR package (Schep, et al., 2017). For single-cell TF activity, MOCHA-identified peaks were imported into the ArchR object, motifs were annotated using the CIS-BP database, and deviations were calculated using the addDeviationsMatrix function in ArchR. For pseudobulk-level analyses, TF deviations within CD4 naïve T cells were calculated using the computeDeviations function, based on the MOCHA-identified pseudobulk sample peak matrix, and compared with GC-matched background peaks. 

References

Badia-I-Mompel P, Vélez Santiago J, Braunger J, Geiss C, Dimitrov D, Müller-Dott S, et al. decoupleR: ensemble of computational methods to infer biological activities from omics data. Bioinform Adv. 2022;2: vbac016.
doi:10.1093/bioadv/vbac016

Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc. 1995;57: 289–300.
doi:10.1111/j.2517-6161.1995.tb02031.x

Bredikhin D, Kats I, Stegle O. MUON: multimodal omics analysis framework. Genome Biol. 2022;23: 42.
doi:10.1186/s13059-021-02577-8

Dann E, Henderson NC, Teichmann SA, Morgan MD, Marioni JC. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat Biotechnol. 2021.
doi:10.1038/s41587-021-01033-z

Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Wells SB, Gomes T, et al. Cross-tissue immune cell analysis reveals tissue-specific features in humans. Science. 2022;376: eabl5197.
doi:10.1126/science.abl5197

Elizarraras JM, Liao Y, Shi Z, Zhu Q, Pico AR, Zhang B. WebGestalt 2024: faster gene set analysis and new support for metabolomics and multi-omics. Nucleic Acids Res. 2024;52: W415–W421.
doi:10.1093/nar/gkae456

Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50: D687–D692.
doi:10.1093/nar/gkab1028

Glass MC, Glass DR, Oliveria J-P, Mbiribindi B, Esquivel CO, Krams SM, et al. Human IL-10-producing B cells have diverse states that are induced from multiple B cell subsets. Cell Rep. 2022;39.
doi:10.1016/j.celrep.2022.110728

Granja JM, Corces MR, Pierce SE, Bagdatli ST, Choudhry H, Chang HY, et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021;53: 403–411.
doi:10.1038/s41588-021-00790-6

Gustafson CE, Skene PJ, Goldrath AW, Li X-J, Torgerson TR, Becker LA, et al. AIFI Immune Health Atlas. In: Human Immune System Explorer [Internet].
doi:10.57785/e9e1-wh09

Kanehisa M, Furumichi M, Sato Y, Matsuura Y, Ishiguro-Watanabe M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 2025;53: D672–D677.
doi:10.1093/nar/gkae909

Kaushik A, Dunham D, He Z, Manohar M, Desai M, Nadeau KC, et al. CyAnno: A semi-automated approach for cell type annotation of mass cytometry datasets. Bioinformatics. 2021;37: 4164–4171.
doi:10.1093/bioinformatics/btab409

Korsunsky I, Millard N, Fan J, Slowikowski K, Zhang F, Wei K, et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods. 2019;16: 1289–1296.
doi:10.1038/s41592-019-0619-0

Kuznetsova A, Brockhoff PB, Christensen RHB. lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software. 2017. pp. 1–26.
doi:10.18637/jss.v082.i13

Liberzon A, Birger C, Thorvaldsdóttir H, Ghandi M, Mesirov JP, Tamayo P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 2015;1: 417–425.
doi:10.1016/j.cels.2015.12.004

Lorenz M, Jung S, Radbruch A. Switch transcripts in immunoglobulin class switching. Science. 1995;267: 1825–1828.
doi:10.1126/science.7892607

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15: 550.
doi:10.1186/s13059-014-0550-8

Meijer P, Howard N, Liang J, Kelsey A, Subramanian S, Johnson E, et al. Provide proactive reproducible analysis transparency with every publication. R Soc Open Sci. 2025;12: 241936.
doi:10.1098/rsos.241936

Mulè MP, Martins AJ, Tsang JS. Normalizing and denoising protein expression data from droplet-based single cell profiling. Nat Commun. 2022;13: 2099.
doi:10.1038/s41467-022-29356-8

Rachid Zaim S, Pebworth M-P, McGrath I, Okada L, Weiss M, Reading J, et al. MOCHA’s advanced statistical modeling of scATAC-seq data enables functional genomic inference in large human cohorts. Nat Commun. 2024;15: 6828.
doi:10.1038/s41467-024-50612-6

Schep AN, Wu B, Buenrostro JD, Greenleaf WJ. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat Methods. 2017;14: 975–978.
doi:10.1038/nmeth.4401

Stavnezer J. Immunoglobulin class switching. Curr Opin Immunol. 1996;8: 199–205.
doi:10.1016/s0952-7915(96)80058-6

Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100: 9440–9445.
doi:10.1073/pnas.1530509100

Swanson E, Lord C, Reading J, Heubeck AT, Genge PC, Thomson Z, et al. Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq. Elife. 2021;10.
doi: 10.7554/eLife.63632

van den Boogaart KG, Tolosana-Delgado R, Bren M. compositions: Compositional Data Analysis. 2023.
Available: https://cran.r-project.org/web/packages/compositions/index.html

Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15.
doi:10.1186/s13059-017-1382-0

Wolock SL, Lopez R, Klein AM. Scrublet: Computational Identification of Cell Doublets in Single-Cell Transcriptomic Data. Cell Syst. 2019;8: 281-291.e9.
doi:10.1016/j.cels.2018.11.005