Methods
Key Resources Table
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| Critical commercial assays | ||
| Library Construction Kit, 16 reactions | 10x Genomics | Cat# PN-1000190 |
| QIAamp DNA Mini Kit | Qiagen | Cat# 51306 |
| Illumina GSAv3.0 array (Infinium Global Screening Array-24 Kit) | Illumina | Cat# 20030770 |
| Chemicals, peptides, and recombinant proteins | ||
| Fetal Bovine Serum | Sigma-Aldrich | Cat# F2442; Lot#s 19G014 and 20A363 |
| ACK lysing buffer | Thermo Fisher Scientific | Cat# A10492 |
| CryoStor CS10 cell freezing media | STEMCELL Technologies | Cat# 100-1061 |
| PBS, pH 7.4 | — | — |
| Buffer EB (Elution Buffer) | — | — |
| Biological samples | ||
| Human peripheral blood mononuclear cells (PBMCs) | This paper | N/A |
| Other | ||
| Chromium Controller | 10x Genomics | Cat# PN-120270 |
| Software and algorithms | ||
| Original code | This paper | — |
| GenomeStudio version 2.0 | Illumina | https://support.illumina.com/array/array_software/genomestudio/downloads.html |
| PLINK Input Report Plug-in v2.1.4 | Illumina | https://support.illumina.com/array/array_software/genomestudio/downloads.html |
| PLINK 2.0 | Chang et al. | https://www.cog-genomics.org/plink/2.0/ |
| bcftools | Danecek et al. | https://github.com/samtools/bcftools |
| R | The R Project | https://www.r-project.org/ |
| Python | Python Software Foundation | https://www.python.org/downloads/ |
| DRAGEN Single-Cell RNA pipeline (v3.8.4) | Illumina | https://sapac.support.illumina.com/sequencing/sequencing_software/dragen-bio-it-platform/downloads.html |
| Cell Ranger version 7.0.1 | 10x Genomics | https://www.10xgenomics.com/support/software/cell-ranger/downloads |
| Demuxlet | Kang et al. | https://github.com/statgen/demuxlet |
| Seurat version 4.1.1 | Hao et al. | https://satijalab.org/seurat/articles/install_v5 |
| RCAv2 | Schmidt et al. | https://github.com/prabhakarlab/RCAv2 |
| DoubletFinder version 2.0.3 | McGinnis et al. | https://github.com/chris-mcginnis-ucsf/DoubletFinder |
| ggplot2 | Wickham et al. | https://ggplot2.tidyverse.org/ |
| scanpy version 1.11.4 | Wolf et al. | https://github.com/scverse/scanpy |
| pertpy version 1.0.1 | Heumos et al. | https://github.com/scverse/pertpy |
| Deposited data | ||
| RIDA gene-cell matrices and metadata | This paper | By data access applications to the corresponding authors |
| RIDA deidentified FASTQ files | This paper | By data access applications to the corresponding authors |
Experimental Model and Study Participant Details
Study Design and Cohort
The data in this study are drawn from the Russian Immune Diversity Atlas version 1.0 (RIDA_v1.0), a multi-site biobank of peripheral blood samples collected across four sites in the Russian Federation: Moscow, Kazan, Grozny, and Yakutsk.
TODO: ethics approval — protocol number to be added.
TODO: cohort description — sample sizes per ancestry group, inclusion/exclusion criteria, demographics (age, sex, BMI) to be added.
PBMC Isolation
Peripheral blood (8 mL) was collected from each participant using CPT tubes containing sodium heparin (BD Vacutainer CPT, cat. no. 362753). PBMCs were isolated following a harmonized protocol applied across all participating sites. Fetal bovine serum (Sigma-Aldrich, cat. no. F2442, lots 19G014 and 20A363) was used during isolation and subsequent pooling and washing steps.
Blood samples in CPT tubes were kept at room temperature and processed within two hours of collection using density-gradient centrifugation. All centrifugation steps were performed with reduced acceleration and deceleration settings to minimize cell stress. Samples were inverted 8–10 times, then centrifuged in a swing-out rotor at 1,500 × g for 30 min at 20 °C. After removal of the plasma fraction, the PBMC layer was harvested and centrifuged again at 300 × g for 15 min at 20 °C. The cell pellet was treated with ACK lysing buffer (Thermo Fisher Scientific, cat. no. A10492) to remove residual erythrocytes.
Cells were washed twice in PBS (pH 7.4) supplemented with 1% FBS and 1 mM EDTA, with centrifugation at 300 × g for 15 min at 20 °C after each wash. PBMCs were cryopreserved in CryoStor CS10 freezing medium (STEMCELL Technologies, cat. no. 079555). Cryovials were cooled at −80 °C overnight in a controlled-rate freezing container and subsequently transferred to liquid nitrogen for long-term storage.
Genetic Multiplexing and Sample Pooling
Thawing, washing, and pooling of PBMCs for genetic multiplexing were performed at the main study site in Moscow. Cryopreserved donor vials were rapidly thawed in a 37 °C water bath for 1–2 min until ice crystals were no longer visible, then transferred into pre-warmed thawing medium consisting of RPMI (Gibco, cat. no. 21870076) supplemented with 5% human serum (Sigma-Aldrich, cat. no. H4522), 1% penicillin–streptomycin (Gibco, cat. no. 15140122), and 1% L-glutamine (Gibco, cat. no. 25030081).
Cells from each vial were centrifuged at 300 × g for 5 min at 21 °C, then washed once with pre-warmed medium containing RPMI, 10% FBS, 1% penicillin–streptomycin, and 1% L-glutamine. Two additional washes were performed using pre-warmed PBS supplemented with 0.04% bovine serum albumin (BSA; Capricorn Scientific, cat. no. BSA-1S). After washing, each sample was filtered through a 30 μm MACS SmartStrainer (Miltenyi Biotec) to eliminate aggregates and debris. All subsequent manipulations were conducted on ice.
Cell counts were determined by mixing samples 1:1 with trypan blue and analyzing with an automated cell counter (Thermo Fisher Countess II FL). Cells were resuspended in PBS containing 0.04% BSA at a final concentration of 1.5 × 10⁶ cells/mL. For each experimental batch, equal numbers and volumes of cells from individual donors were combined. The pooled suspension was recounted prior to proceeding with the 10x Genomics single-cell assays.
Method Details
10x Genomics 5′ v2 RNA-Sequencing
Fifteen donors and one admixed Korean sample (as technical replicate) were pooled per batch.
Genomic DNA Isolation, Genotyping, Genotype Quality Control, and Genotype Imputation
TODO: content to be added.
Single-Cell RNA-Sequencing Pre-processing and Quality Control
Quality control was performed separately for each major cell population to account for lineage-specific transcriptional features and library complexity. Low-quality cells and potential doublets were excluded based on standard metrics, including gene counts, UMI counts, and mitochondrial transcript proportion, with thresholds adapted to each population.
Integration and Quality Control
Batch correction and data integration were carried out using mrVI, enabling joint embedding of cells across donors, ancestries, and disease states while preserving biological variability. Following integration, first-level cell-type annotation was performed using RCAv2, which assigns labels based on reference transcriptomic profiles. Subsequent sub-clustering within major lineages was conducted in the integrated latent space, and refined annotations were assigned manually using established lineage- and state-specific marker genes.
Differential Abundance Analysis
Differential cell-type abundance was assessed using Milo, which performs neighborhood-based testing on single-cell k-nearest neighbor (kNN) graphs. Following data integration and construction of the kNN graph in the joint embedding space, overlapping cellular neighborhoods were defined to capture local cellular states independent of discrete clustering. For each neighborhood, cell counts per sample were aggregated and modeled using a negative binomial generalized linear model, fitted using the edgeR solver, with disease status and ancestry included as covariates where appropriate. Statistical significance was evaluated using spatial false discovery rate (FDR) correction to account for graph connectivity between neighborhoods. Neighborhood-level results were subsequently mapped back onto the UMAP embedding to visualize regions exhibiting significant differential abundance between conditions.
Differential Gene Expression Analysis
To identify differentially expressed genes (DEGs) between T2D and non-T2D subjects, we employed the DESeq2 differential expression pipeline. Pseudobulk data were generated by summing raw counts per donor to enable more robust statistical testing, discarding donor–cell type groups with fewer than 10 cells. Genes detected (non-zero) in fewer than 20% of pseudobulk samples were removed. Batch effects were then corrected separately for each cell type using ComBat-seq (sva package), incorporating sex, age, T2D status, and ancestry as biological covariates. Continuous covariates were standardized prior to modeling.
We did not report the overall T2D effect estimated from the model ~ sex + age + ancestry + T2D, because in the presence of ancestry-specific heterogeneity this coefficient represents a weighted average across groups, where ancestries with larger sample sizes or stronger, more consistent transcriptional responses contribute disproportionately to the estimate. Instead, we focused on ancestry-specific contrasts and concordant effects, which more accurately capture both shared and population-specific transcriptional programs associated with T2D. The corrected counts were subsequently used for DESeq2 analysis with the model design ~ sex + age + ancestry × T2D. The DESeq function was run with fitType = “local” to estimate dispersions via local regression. P values were adjusted using the Benjamini–Hochberg method. DEGs were defined at an adjusted P value threshold of 0.05 and an absolute log₂ fold-change cutoff of ≥ 0.5.
Pathway Activity Analysis
Gene set activity was inferred using the Univariate Linear Model (ULM) framework implemented in the decoupler package. For each differential expression contrast, gene-level statistics were defined as Wald z-scores derived from DESeq2 (log₂ fold change divided by its standard error). All genes with non-missing statistics were included without pre-filtering to preserve the full null distribution. Gene sets were obtained from MSigDB v2026.1 annotations accessed via OmniPath (downloaded January 2026), restricted to Hallmark, KEGG, Reactome, WikiPathways, BioCarta, and Gene Ontology Biological Process collections. Only gene sets containing at least 10 overlapping genes were tested, with gene set size varying between 10 and 400 genes (7,774 gene sets were included in the analysis). ULM fits a linear model for each gene set, regressing gene-level statistics against gene-set membership to estimate pathway activity scores. Statistical significance was assessed using model-derived p-values and corrected for multiple testing using the Benjamini–Hochberg FDR. Pathways with FDR < 0.05 were considered significantly enriched. Dendritic cell populations were not included in pathway activity analysis because the number of cells and corresponding pseudobulk profiles was insufficient for robust statistical testing.
Regulatory Network Inference
Regulons were inferred using pySCENIC (v0.12.0) by running GRNBoost2 on the raw-count scRNA-seq expression matrix 40 times, followed by motif-based pruning using the motifs-v9-nr.hgnc motif-to-TF annotation table (m0.001) and the hg38 RefSeq r80 ±10 kb TSS cisTarget ranking database (hg38__refseq-r80__10kb_up_and_down_tss.mc9nr). Regulons were aggregated across runs, retaining only those recovered in ≥ 25% of runs. Regulon activities were computed with AUCell on the ComBat-seq–corrected pseudobulk profiles to improve robustness and reduce sparsity. For each cell type and ancestry, regulon activities were compared between T2D and healthy samples using a Wilcoxon rank-sum test with FDR correction for multiple testing.
Inference and Analysis of Cell Differentiation Trajectories
Pre-processing. Trajectory analysis was performed independently for major cell types: CD4⁺ T cells, CD8⁺ T cells, B cells, NK cells, and monocytes. To ensure trajectory consistency, specific populations were excluded prior to trajectory inference: cells of “Mixed” ethnicity, Tregs (CD4⁺ T), MAIT cells (CD8⁺ T), atypical/plasma/transitional ZEB2ʰⁱ B cells, and CD14ʰⁱ activated/ISGʰⁱ monocytes. For the remaining cells, the top 4,000 highly variable genes (HVGs) were identified separately for each major cell type based on two criteria: (1) normalized dispersion and (2) expression in at least 10 healthy and 10 T2D cells. Normalization and HVG selection were performed using the normalize_total, log1p, and highly_variable_genes functions (Scanpy) with default parameters.
Trajectory inference and differential expression. Trajectories were reconstructed using the learn_graph function (Monocle3) based on the first two UMAP components and clusters corresponding to Level 4 cell annotations. Parameters for learn_graph were selected manually. Cells were subsequently assigned pseudotime values via the order_cells function with root nodes manually selected within progenitor cell states: CD4⁺ naive core (CD4⁺ T), CD8⁺ naive (CD8⁺ T), naive resting B cells, CD14ʰⁱ homeostasis (monocytes), and CD56ʰⁱ (NK cells).
To identify genes with significant inter-ethnic expression changes along inferred trajectories, we employed the tradeSeq package. Expression patterns of the 4,000 HVGs were modelled as non-linear functions of the Monocle3 pseudotime values via the fitGAM function (knots = 5; conditions assigned to ethnicities; cell weights equal to 1 for each cell; Z-scaled age, sex, and a virtual batch provided as covariates via the U parameter). To isolate inter-ethnic variation, gene expression patterns were fitted separately for cells from the healthy and T2D cohorts. Differential expression between ethnic groups was assessed using the conditionTest function (pairwise = TRUE, l2fc = 1). For each pair of conditions, p-values were adjusted for multiple testing using the Benjamini–Hochberg FDR method. Genes meeting the threshold of FDR < 0.05 were retained for downstream clustering.
Expression profile clustering and functional annotation. To identify co-regulated gene patterns across conditions, smooth expression profiles across 100 equidistant pseudotime points were predicted for significant genes using predictSmooth (nPoints = 100). These values were organized into a gene-by-pseudotime matrix, where rows represented smoothed gene expression across the two compared conditions and columns represented the equidistant pseudotime points. The matrix was row-wise standardized using Z-score transformation. A dissimilarity matrix (d = 1 − ρ) was constructed based on pairwise Spearman’s correlation (ρ). Hierarchical clustering was performed using Ward’s D2 linkage (method = “ward.D2”). The ComplexHeatmap package was used to visualize the gene-by-pseudotime matrix. Gene Ontology functional enrichment analysis was performed for each cluster using clusterProfiler to characterize the biological significance of the identified clusters.
Cell-Cell Communication Analysis
TODO: content to be added.
Quantification and Statistical Analysis
All statistical tests were performed using R, R packages, or Python, with specific testing details listed in the individual method sections. All statistical tests performed were two-tailed.