Differential Gene Expression Analysis Pipeline

This page documents the DESeq2-based DEG pipeline used in this study. See Methods for the manuscript prose.

Pipeline Steps

1. Pseudobulk Aggregation

  • Raw counts summed per donor–cell-type pair
  • Donor–cell type groups with fewer than 10 cells discarded

2. Gene Filtering

  • Genes detected (non-zero) in < 20% of pseudobulk samples removed

3. Batch Correction

  • ComBat-seq (sva package) applied separately per cell type
  • Covariates: sex, age, T2D status, ancestry
  • Continuous covariates standardized prior to modeling

4. Model Design

Corrected counts modeled as:

~ sex + age + ancestry × T2D

5. Dispersion Estimation

  • DESeq fitType = "local" (local regression)

6. Contrast Strategy

  • Primary: ancestry-specific contrasts (e.g., T2D vs healthy within each ancestry group)
  • No overall T2D coefficient reported: because ancestry-specific heterogeneity means the overall coefficient is a sample-size-weighted average that can be skewed by high-magnitude signals in a subset of ancestries

7. Significance Thresholds

  • Adjusted P value < 0.05 (Benjamini–Hochberg)
  • Absolute log₂ fold-change ≥ 0.5

8. Output for Downstream Analysis

  • Wald z-scores (log₂FC / SE) passed to decoupler ULM for pathway activity analysis