Before fitting any models, confirm the analysis table contains what you think it contains.
str(df)
'data.frame': 200 obs. of 6 variables:
$ sample_id: chr "sample-001" "sample-002" "sample-003" "sample-004" ...
$ trait : num 1.48 2.49 1.68 1.96 2.23 ...
$ age : int 38 42 64 46 47 66 51 30 37 40 ...
$ sex : chr "F" "F" "F" "F" ...
$ PC1 : num -0.715 -0.753 -0.939 -1.053 -0.437 ...
$ PC2 : num -0.9107 -1.2466 0.2583 -0.0307 -1.4696 ...
head(df)
sample_id trait age sex PC1 PC2
1 sample-001 1.484454 38 F -0.7152422 -0.91065959
2 sample-002 2.485771 42 F -0.7526890 -1.24657225
3 sample-003 1.676582 64 F -0.9385387 0.25830482
4 sample-004 1.963092 46 F -1.0525133 -0.03065893
5 sample-005 2.226643 47 F -0.4371595 -1.46962895
6 sample-006 2.206656 66 F 0.3311792 0.12258954
summary(df$trait)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.3067 1.3695 2.0493 2.1273 2.8598 5.3643
You should see the core fields used throughout this guide:
sample_id
trait
age
sex
PC1
PC2
If these are missing or mis-typed, association testing may still run, but interpretation becomes unreliable.
Align Samples
Genotype data uses row names as sample IDs. We must ensure phenotype rows match genotype rows.
common_ids <-intersect(df$sample_id, rownames(geno))df <- df[df$sample_id %in% common_ids, ]geno <- geno[common_ids, , drop =FALSE]# Match order exactlydf <- df[match(common_ids, df$sample_id), ]stopifnot(all(df$sample_id ==rownames(geno)))dim(df)
[1] 200 6
dim(geno)
[1] 200 2000
Minimal Adjustment Model
In Lesson 2, we motivated adjustment for age, sex, and principal components.
In this demo, we use:
Trait ~ age + sex + PC1 + PC2
Association testing adds one SNP at a time to this baseline model.
Example: One SNP Test
To make the process concrete, test a single SNP and interpret the output.
snp_name <-colnames(geno)[1]x <- geno[, snp_name]# Mean impute for this SNP (simple demo approach)if (any(is.na(x))){ x[is.na(x)] <-mean(x, na.rm =TRUE)}fit <-lm(trait ~ x + age + sex + PC1 + PC2, data = df)summary(fit)$coefficients["x", ]
Estimate Std. Error t value Pr(>|t|)
0.29666525 0.15447184 1.92051353 0.05625948
Interpretation
The coefficient for x is the estimated trait change per allele.
The p-value tests whether that coefficient differs from zero.
This is an association test. It is not yet a biological claim.
Scaling Up: Per-SNP Regression
GWAS repeats the same model across many SNPs.
We will create a small function that returns:
beta
standard error
p-value
test_snp_lm <-function(x, df){if (any(is.na(x))){ x[is.na(x)] <-mean(x, na.rm =TRUE) } fit <-lm(trait ~ x + age + sex + PC1 + PC2, data = df) co <-summary(fit)$coefficientsc(beta =unname(co["x", "Estimate"]),se =unname(co["x", "Std. Error"]),p =unname(co["x", "Pr(>|t|)"]) )}