Lesson 6: From Association to Biological Claims

ID: GWAS-L06
Type: Interpretation Discipline
Audience: Public
Theme: Calibration, plausibility, and responsible reporting

Why This Lesson Exists

A statistically significant association is not automatically a biological discovery.

GWAS identifies statistical relationships between genotype and phenotype. It does not automatically establish causality, mechanism, or clinical relevance.

This lesson focuses on disciplined interpretation.

Recompute Results (Self-Contained)

source("scripts/R/cdi-plot-theme.R")
library(ggplot2)

pheno <- read.csv("data/demo-phenotype.csv", stringsAsFactors = FALSE)
covar <- read.csv("data/demo-covariates.csv", stringsAsFactors = FALSE)
geno  <- read.csv("data/demo-genotypes.csv", row.names = 1)

df <- merge(pheno, covar, by = "sample_id")

common_ids <- intersect(df$sample_id, rownames(geno))
df   <- df[match(common_ids, df$sample_id), ]
geno <- geno[common_ids, , drop = FALSE]

test_snp_lm <- function(x, df){
  if (any(is.na(x))){
    x[is.na(x)] <- mean(x, na.rm = TRUE)
  }
  fit <- lm(trait ~ x + age + sex + PC1 + PC2, data = df)
  co <- summary(fit)$coefficients
  c(
    beta = unname(co["x", "Estimate"]),
    se   = unname(co["x", "Std. Error"]),
    p    = unname(co["x", "Pr(>|t|)"])
  )
}

res <- t(apply(geno, 2, test_snp_lm, df = df))
res <- as.data.frame(res)
res$snp <- rownames(res)
res$logp <- -log10(res$p)

alpha_nominal <- 0.05
res$is_sig <- res$p < alpha_nominal

res <- res[order(res$p), ]
head(res, 5)

            beta        se            p   snp     logp is_sig
rs153  0.6862770 0.1030083 2.714428e-10 rs153 9.566322   TRUE
rs754 -0.7736402 0.1873225 5.388030e-05 rs754 4.268570   TRUE
rs229 -0.4507871 0.1184809 1.901932e-04 rs229 3.720805   TRUE
rs424  0.3982625 0.1161028 7.366738e-04 rs424 3.132725   TRUE
rs593  0.3478399 0.1066971 1.315523e-03 rs593 2.880901   TRUE

Effect Size Distribution

ggplot(res, aes(x = beta)) +
  cdi_geom_histogram(bins = 40, colored = TRUE) +
  cdi_scale_histogram_fill() +
  labs(
    title = "Effect size distribution across SNPs",
    subtitle = "Most SNP effects should be near zero in complex traits",
    x = "Estimated beta (trait change per allele)",
    y = "Number of SNPs"
  ) +
  cdi_theme()

Interpretation

In complex traits, most variants have small effects. A symmetric distribution centered near zero is expected under polygenicity. Large-magnitude effects should be treated with caution and verified carefully.

Volcano-Style Calibration Plot

pal <- cdi_palette()

plot_df <- res
plot_df$sig_label <- ifelse(plot_df$is_sig,
                            "Nominally significant",
                            "Not significant")

ggplot(plot_df, aes(x = beta, y = logp)) +
  geom_point(aes(color = sig_label),
             alpha = 0.8,
             size = 1.8) +
  geom_vline(xintercept = 0,
             linewidth = 0.6,
             color = pal$muted) +
  geom_hline(yintercept = -log10(alpha_nominal),
             linetype = "dashed",
             linewidth = 0.6,
             color = pal$highlight) +
  scale_color_manual(values = c(
    "Not significant" = pal$teal_light,
    "Nominally significant" = pal$highlight
  )) +
  labs(
    title = "Effect size vs statistical evidence",
    subtitle = "Vertical line: no effect. Horizontal line: nominal threshold.",
    x = "Estimated beta",
    y = "-log10(p-value)",
    color = ""
  ) +
  cdi_theme()

Interpretation

The characteristic “V” shape is typical in GWAS.

Points near beta = 0 show little evidence.
Stronger effects with stable estimates rise above the threshold.
Large betas with weak evidence often indicate instability or sparse genotype counts.

Significant points are candidates — not conclusions.

Sensitivity to Population Structure

top_snp <- res$snp[1]
x <- geno[, top_snp]

if (any(is.na(x))){
  x[is.na(x)] <- mean(x, na.rm = TRUE)
}

fit_adjusted   <- lm(trait ~ x + age + sex + PC1 + PC2, data = df)
fit_unadjusted <- lm(trait ~ x + age + sex, data = df)

adj   <- coef(summary(fit_adjusted))["x", ]
unadj <- coef(summary(fit_unadjusted))["x", ]

adj

    Estimate   Std. Error      t value     Pr(>|t|) 
6.862770e-01 1.030083e-01 6.662345e+00 2.714428e-10

unadj

    Estimate   Std. Error      t value     Pr(>|t|) 
6.539695e-01 1.056920e-01 6.187505e+00 3.500449e-09

Interpretation

If adjustment causes large shifts in effect magnitude or direction, the association may reflect ancestry structure rather than locus-specific biology.

Robust signals remain directionally consistent after adjustment.

Interpretation Hierarchy

Responsible GWAS interpretation follows layers:

Statistical evidence
Calibration and inflation diagnostics
Robustness to covariate adjustment
Replication
Biological plausibility
Functional validation

Skipping layers increases the risk of overstatement.

What This Free Track Achieved

You now understand:

Study design logic
Phenotype calibration
Genotype QC
Population structure control
Per-SNP modeling
Diagnostic calibration
Interpretation discipline

The goal was not to produce hits.

The goal was to produce defensible reasoning.