Lesson 4: Association Testing Concepts

ID: GWAS-L04
Type: Statistical Foundations
Audience: Public
Theme: Per-SNP testing, adjustment, and interpretation boundaries

What Association Testing Actually Does

In GWAS, we test whether a genetic variant is associated with a trait.

For a quantitative trait, the simplest model is a linear regression:

Trait ~ SNP + Covariates

The SNP is usually coded as 0, 1, 2, representing allele dosage.

The output provides:

an effect size (beta)
a standard error
a p-value

A small p-value indicates evidence of association, not proof of causality.

Load Analysis Inputs

In this lesson, we combine:

phenotype and covariates (from Lesson 2)
a genotype matrix (from the demo data)

source("scripts/R/cdi-plot-theme.R")
library(ggplot2)

pheno <- read.csv("data/demo-phenotype.csv", stringsAsFactors = FALSE)
covar <- read.csv("data/demo-covariates.csv", stringsAsFactors = FALSE)

df <- merge(pheno, covar, by = "sample_id")

geno <- read.csv("data/demo-genotypes.csv", row.names = 1)

dim(df)

[1] 200   6

dim(geno)

[1]  200 2000

Inspect the Analysis Table

Before fitting any models, confirm the analysis table contains what you think it contains.

str(df)

'data.frame':   200 obs. of  6 variables:
 $ sample_id: chr  "sample-001" "sample-002" "sample-003" "sample-004" ...
 $ trait    : num  1.48 2.49 1.68 1.96 2.23 ...
 $ age      : int  38 42 64 46 47 66 51 30 37 40 ...
 $ sex      : chr  "F" "F" "F" "F" ...
 $ PC1      : num  -0.715 -0.753 -0.939 -1.053 -0.437 ...
 $ PC2      : num  -0.9107 -1.2466 0.2583 -0.0307 -1.4696 ...

head(df)

   sample_id    trait age sex        PC1         PC2
1 sample-001 1.484454  38   F -0.7152422 -0.91065959
2 sample-002 2.485771  42   F -0.7526890 -1.24657225
3 sample-003 1.676582  64   F -0.9385387  0.25830482
4 sample-004 1.963092  46   F -1.0525133 -0.03065893
5 sample-005 2.226643  47   F -0.4371595 -1.46962895
6 sample-006 2.206656  66   F  0.3311792  0.12258954

summary(df$trait)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.3067  1.3695  2.0493  2.1273  2.8598  5.3643

You should see the core fields used throughout this guide:

sample_id
trait
age
sex
PC1
PC2

If these are missing or mis-typed, association testing may still run, but interpretation becomes unreliable.

Align Samples

Genotype data uses row names as sample IDs. We must ensure phenotype rows match genotype rows.

common_ids <- intersect(df$sample_id, rownames(geno))

df <- df[df$sample_id %in% common_ids, ]
geno <- geno[common_ids, , drop = FALSE]

# Match order exactly
df <- df[match(common_ids, df$sample_id), ]

stopifnot(all(df$sample_id == rownames(geno)))

dim(df)

[1] 200   6

dim(geno)

[1]  200 2000

Minimal Adjustment Model

In Lesson 2, we motivated adjustment for age, sex, and principal components.

In this demo, we use:

Trait ~ age + sex + PC1 + PC2

Association testing adds one SNP at a time to this baseline model.

Example: One SNP Test

To make the process concrete, test a single SNP and interpret the output.

snp_name <- colnames(geno)[1]
x <- geno[, snp_name]

# Mean impute for this SNP (simple demo approach)
if (any(is.na(x))){
  x[is.na(x)] <- mean(x, na.rm = TRUE)
}

fit <- lm(trait ~ x + age + sex + PC1 + PC2, data = df)
summary(fit)$coefficients["x", ]

  Estimate Std. Error    t value   Pr(>|t|) 
0.29666525 0.15447184 1.92051353 0.05625948

Interpretation

The coefficient for x is the estimated trait change per allele.
The p-value tests whether that coefficient differs from zero.

This is an association test. It is not yet a biological claim.

Scaling Up: Per-SNP Regression

GWAS repeats the same model across many SNPs.

We will create a small function that returns:

beta
standard error
p-value

test_snp_lm <- function(x, df){
  if (any(is.na(x))){
    x[is.na(x)] <- mean(x, na.rm = TRUE)
  }

  fit <- lm(trait ~ x + age + sex + PC1 + PC2, data = df)
  co <- summary(fit)$coefficients

  c(
    beta = unname(co["x", "Estimate"]),
    se = unname(co["x", "Std. Error"]),
    p = unname(co["x", "Pr(>|t|)"])
  )
}

Now apply it across all SNPs.

res <- t(apply(geno, 2, test_snp_lm, df = df))
res <- as.data.frame(res)
res$snp <- rownames(res)

res <- res[, c("snp", "beta", "se", "p")]
head(res)

    snp        beta        se          p
rs1 rs1  0.29666525 0.1544718 0.05625948
rs2 rs2 -0.16324751 0.1546238 0.29238528
rs3 rs3  0.12014575 0.2072047 0.56269466
rs4 rs4  0.11951910 0.1291166 0.35576846
rs5 rs5 -0.09684449 0.1517165 0.52401469
rs6 rs6  0.05570001 0.1119807 0.61946512

Multiple Testing

A GWAS typically tests hundreds of thousands to millions of SNPs.

With so many tests:

some p-values will be small by chance
naive thresholds (like 0.05) are not meaningful

This is why GWAS uses stringent thresholds (often around 5e-8 in real studies).

In this demo, we focus on concepts rather than strict genome-wide significance.

Effect Size vs Significance

A small p-value can occur with:

a large effect size
a small standard error
a large sample size

A large effect size can be non-significant if sample size is small or variance is large.

Effect size answers:

How strong is the association?

P-value answers:

How surprising is the association under the null?

Both must be interpreted together.

Quick Diagnostic: Top Hits

res <- res[order(res$p), ]
head(res, 10)

          snp       beta        se            p
rs153   rs153  0.6862770 0.1030083 2.714428e-10
rs754   rs754 -0.7736402 0.1873225 5.388030e-05
rs229   rs229 -0.4507871 0.1184809 1.901932e-04
rs424   rs424  0.3982625 0.1161028 7.366738e-04
rs593   rs593  0.3478399 0.1066971 1.315523e-03
rs1925 rs1925 -0.3657837 0.1129199 1.409698e-03
rs1529 rs1529  0.3740211 0.1158913 1.466877e-03
rs900   rs900  0.3483107 0.1094973 1.709060e-03
rs1063 rs1063  0.3584473 0.1127226 1.715172e-03
rs156   rs156  0.4618530 0.1510465 2.545306e-03

At this stage, a top hit is not yet a biological claim. It is a candidate signal that needs validation and calibration.

What This Lesson Adds to the Reasoning Chain

Association testing provides statistical evidence. It does not provide mechanism.

Before interpreting a hit, we still need to ask:

Is the signal robust to covariate adjustment?
Is it driven by structure or outliers?
Is the effect size plausible?
Is it likely replicated, or just a chance finding?

Those checks start in Lesson 5 with visualization and diagnostics.