SubgroupSignalStudy

Simulation-based power analysis for detecting rare gene-environment interactions...

epidemiologysimulationpower-analysisPythonNumPySciPy

Premise

Rare genetic variants might interact with prenatal acetaminophen exposure to increase autism risk in specific subgroups. But existing epidemiological studies aren't designed to detect rare gene-environment interactions: standard power calculations assume main effects, and subgroup effects scale as 1/f² where f is the variant frequency. Before running an empirical study, the question is: what designs and sample sizes could even detect a real effect in a group that might represent 0.1–2% of the population?

How it evolved

Started as a power calculation and grew into a full simulation framework when it became clear that a single number wasn't enough: the answer depends heavily on study design. Four designs were implemented in a comparative framework: population-based cohort, sibling-controlled (within-family, McNemar's test), case-control, and RCT with randomization, compliance, and crossover parameters. Added demographic stratification using 2024 CDC autism prevalence data by sex and ethnicity, Simpson's paradox demonstration with stratified datasets, and production infrastructure for long-running analyses: checkpoint/recovery, parallel processing with independent RNG streams, real-time ETA estimation, and multi-level logging.

Technical crux

Two implementation decisions mattered. First, odds ratio conversion: interaction effects must be modeled as OR → probability (new_p = p0·OR / (1 − p0 + p0·OR)), not risk ratio multiplication: the difference grows above ~5% baseline rates and autism prevalence is right at that threshold. Second, RNG independence: the original parallel implementation used global np.random.seed(), causing correlated samples across workers and underestimating variance. np.random.SeedSequence.spawn() fixes this, which is non-trivial when 10K+ simulation runs are needed for stable power curves. The sibling-controlled design uses exact binomial McNemar's correction for small samples, which most ad hoc power analyses skip.

Findings

Power curves across four study designs for subgroup frequencies 0.1–2% and interaction ORs 2.0–4.0. Key result: detecting a variant at 0.1% frequency with OR 3.0 requires ~500K participants in an observational cohort; sibling-controlled design cuts this to ~250K. Demographic stratification by sex and ethnicity demonstrates Simpson's paradox conditions under realistic CDC prevalence parameters. Framework runs in configurable resolution modes (5 min to 2 days) with full checkpoint/recovery. Paper outline drafted.

Open questions

The simulation uses synthetic populations calibrated to 2024 CDC data. Real autism genetic architecture is far more complex than a single rare-variant model. Polygenic background, gene-gene interactions, and exposure measurement error are modeled as noise rather than structure. The practical blocker: whether any existing biobank (SPARK, ABCD, UK Biobank) has both genotype data and prenatal exposure records at sufficient scale for the required sample sizes. The most important unresolved question is not statistical: if the interaction exists and is detectable, what's the intervention? Acetaminophen avoidance recommendations for a genetically undefined subgroup aren't actionable without a viable screening pathway.

Related writing / 1 part / 18 min

A billionaire who needs a kidney calls an epidemiologist. That epidemiologist looks at car crash rates, donor registration density, center acceptance criteria, and a dozen other variables most patients have never heard of. Here is exactly what that analysis looks like.

transplantmonte-carloepidemiologydata-science

Read article

Detailed case study in progress.

X in

2024