Lesson 6 Association Testing Models

6.1 What we test in GWAS

In GWAS, we test whether genotype variation at a variant is associated with a phenotype.

In practice, we fit models that relate:

outcome: quantitative trait or case-control status
predictor: genotype (0, 1, 2 for additive coding)
covariates: age, sex, and ancestry principal components

In this lesson, we:

build a small analysis table from the demo dataset
fit a baseline association model for one SNP
run a small multi-SNP scan
generate a QQ plot as a diagnostic summary


from cdi_viz.theme import cdi_notebook_init, show_and_save_mpl

cdi_notebook_init(chapter="06", title_x=0)

6.2 Load the demo dataset


import pandas as pd
import numpy as np

genotypes = pd.read_csv("data/gwas-demo-dataset/genotypes.csv")
phenotypes = pd.read_csv("data/gwas-demo-dataset/phenotypes.csv")

print("genotypes:", genotypes.shape)
print("phenotypes:", phenotypes.shape)
print("genotype columns (first 10):", list(genotypes.columns[:10]))
print("phenotype columns:", list(phenotypes.columns))

genotypes: (120, 41)
phenotypes: (120, 9)
genotype columns (first 10): ['sample_id', 'rs100002', 'rs100005', 'rs100031', 'rs100018', 'rs100021', 'rs100032', 'rs100035', 'rs100003', 'rs100037']
phenotype columns: ['sample_id', 'trait_binary', 'trait_quant', 'age', 'sex', 'pc1', 'pc2', 'pc3', 'batch']

6.3 Merge phenotypes and genotypes

We merge on sample_id and prepare covariates.

If sex is encoded as F/M, we convert to 0/1.


import statsmodels.api as sm

df = phenotypes.merge(genotypes, on="sample_id", how="inner")

if "sex" in df.columns:
    df["sex"] = df["sex"].map({"F": 0, "M": 1})

candidate_covars = ["age", "sex", "pc1", "pc2", "pc3"]
covars = [c for c in candidate_covars if c in df.columns]

trait_col = "trait_quant"
if trait_col not in df.columns:
    raise KeyError(f"Expected phenotype column '{trait_col}' not found. Columns: {list(df.columns)}")

snp_cols = [c for c in genotypes.columns if c != "sample_id"]

print("Covariates used:", covars)
print("Number of SNP columns:", len(snp_cols))
print(df[[trait_col] + covars + snp_cols[:3]].head())

Covariates used: ['age', 'sex', 'pc1', 'pc2', 'pc3']
Number of SNP columns: 40
   trait_quant  age  sex     pc1     pc2     pc3  rs100002  rs100005  rs100031
0      -0.0108   43    0  1.6018 -1.0624 -0.8633       0.0       1.0       0.0
1       2.0082   56    1 -0.2394 -0.5294 -0.1475       1.0       0.0       NaN
2      -0.7331   55    1 -1.0235 -0.8769 -0.1525       1.0       0.0       0.0
3      -0.9815   44    0  0.1793 -0.0943  0.3834       0.0       0.0       0.0
4       1.5433   66    0  0.2200 -1.7577  0.9998       0.0       0.0       0.0

6.4 Baseline association model for one SNP

We demonstrate a single-variant association test using linear regression:

trait_quant ~ genotype + covariates


snp_demo = snp_cols[0]

X_cols = [snp_demo] + covars
X = df[X_cols].astype(float)
X = sm.add_constant(X)
y = df[trait_col].astype(float)

lin_model = sm.OLS(y, X, missing="drop").fit()

print("Demo SNP:", snp_demo)
print(lin_model.summary())

Demo SNP: rs100002
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            trait_quant   R-squared:                       0.373
Model:                            OLS   Adj. R-squared:                  0.339
Method:                 Least Squares   F-statistic:                     10.99
Date:                Thu, 29 Jan 2026   Prob (F-statistic):           1.39e-09
Time:                        01:22:33   Log-Likelihood:                -140.55
No. Observations:                 118   AIC:                             295.1
Df Residuals:                     111   BIC:                             314.5
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.4663      0.247     -5.936      0.000      -1.956      -0.977
rs100002       0.2355      0.152      1.554      0.123      -0.065       0.536
age            0.0292      0.005      5.700      0.000       0.019       0.039
sex            0.1026      0.154      0.665      0.507      -0.203       0.408
pc1            0.3238      0.076      4.284      0.000       0.174       0.474
pc2           -0.1157      0.074     -1.565      0.120      -0.262       0.031
pc3            0.0089      0.078      0.113      0.910      -0.147       0.164
==============================================================================
Omnibus:                        0.088   Durbin-Watson:                   1.874
Prob(Omnibus):                  0.957   Jarque-Bera (JB):                0.237
Skew:                          -0.040   Prob(JB):                        0.888
Kurtosis:                       2.795   Cond. No.                         161.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

6.5 Minimal association summary


assoc_summary = pd.DataFrame({
    "snp": [snp_demo],
    "beta": [float(lin_model.params.get(snp_demo, np.nan))],
    "se": [float(lin_model.bse.get(snp_demo, np.nan))],
    "p_value": [float(lin_model.pvalues.get(snp_demo, np.nan))],
    "n": [int(lin_model.nobs)],
})

print(assoc_summary)

        snp      beta        se   p_value    n
0  rs100002  0.235523  0.151561  0.123035  118

6.6 Scaling up: test many SNPs (teaching subset)

A real GWAS runs the same model across a very large number of variants. Here we test a small subset to illustrate the pattern and to support a QQ plot.


subset_snps = snp_cols[:300]

pvals = []
for snp in subset_snps:
    X_cols = [snp] + covars
    Xs = df[X_cols].astype(float)
    Xs = sm.add_constant(Xs)
    m = sm.OLS(y, Xs, missing="drop").fit()
    pvals.append(float(m.pvalues.get(snp, np.nan)))

pvals = np.array([p for p in pvals if np.isfinite(p)])

print("Number of SNPs tested:", len(subset_snps))
print("Number of finite p values:", int(len(pvals)))
print("Min p value:", float(np.min(pvals)))

Number of SNPs tested: 40
Number of finite p values: 40
Min p value: 0.016010293099204306

6.7 QQ plot

A QQ plot compares expected and observed p values.

Interpretation guide:

alignment with the diagonal suggests well-calibrated tests
early deviation suggests inflation or confounding
tail deviation may indicate true associations

In small demo datasets, patterns can look noisy.


import matplotlib.pyplot as plt

observed = np.sort(pvals)
expected = np.arange(1, len(observed) + 1) / (len(observed) + 1)

plt.figure()
plt.scatter(-np.log10(expected), -np.log10(observed), s=12)
m = max(-np.log10(expected)) if len(expected) else 1
plt.plot([0, m], [0, m])
plt.xlabel("Expected -log10(p value)")
plt.ylabel("Observed -log10(p value)")

show_and_save_mpl()

Saved PNG → figures/06_001.png

6.8 Key takeaways

Association testing fits a model per variant
Covariates help reduce confounding
QQ plots provide a global diagnostic of p value calibration

Continue to → Lesson 07: Interpreting GWAS Results