Q&A 12 How do you perform a genome-wide SNP scan to generate GWAS results?
12.1 Explanation
Once you’ve merged your phenotype, PCA covariates, and genotype matrix into a single data frame (gwas_input), you can perform a genome-wide association scan.
This involves fitting a linear model for each SNP, adjusting for population structure (e.g., PCs), and extracting the effect size and p-value for each SNP. These results are saved in a table for downstream visualization using Manhattan or QQ plots.
Each model looks like:
Plant height ~ PC1 + PC2 + PC3 + SNP_i
We loop over all SNP columns in the dataset that match known prefixes (id, ud, wd, dd, fd) and collect the outputs.
12.2 R Code
# Load required package
library(tidyverse)
# Step 1: Identify SNP columns using prefix pattern
snp_cols <- grep("^(id|ud|wd|dd|fd)[0-9]+$", names(gwas_input), value = TRUE)
# Step 2: Check number of SNPs selected
length(snp_cols) # Should return 3755[1] 3755
[1] "id1000007" "id1000051" "id1000080" "id1000091" "id1000093"
# Step 3: Initialize list to collect GWAS results
gwas_results <- list()
# Step 4: Loop over each SNP and fit linear model
for (snp in snp_cols) {
# Construct formula dynamically
formula <- as.formula(paste("`Plant height` ~ PC1 + PC2 + PC3 +", snp))
# Fit model safely
model <- tryCatch(
lm(formula, data = gwas_input),
error = function(e) NULL
)
# Step 5: If successful, extract coefficient and p-value
if (!is.null(model)) {
coef_table <- summary(model)$coefficients
snp_row <- tail(rownames(coef_table), 1)
gwas_results[[snp]] <- tibble(
SNP = snp,
Estimate = coef_table[snp_row, "Estimate"],
P_value = coef_table[snp_row, "Pr(>|t|)"]
)
}
}
# Step 6: Combine results and sort by p-value
gwas_df <- bind_rows(gwas_results) %>%
arrange(P_value)
# Step 7: Save the GWAS results for visualization
write_csv(gwas_df, "data/gwas_results.csv")
# Step 8: Preview top hits
head(gwas_df, 10)# A tibble: 10 × 3
SNP Estimate P_value
<chr> <dbl> <dbl>
1 id1024159 -16.6 7.55e-19
2 ud10000717 19.6 2.02e-14
3 id10003000 17.7 1.10e-13
4 id6000791 11.9 1.40e-12
5 id3010870 -12.2 1.50e-12
6 id2014431 19.3 3.05e-12
7 id1024488 11.5 3.51e-12
8 id11001027 14.5 3.53e-12
9 id1020565 -14.8 5.70e-12
10 id2001618 14.3 1.04e-11
✅ Takeaway: This loop scans 3,755 SNPs genome-wide and outputs a clean summary table with effect sizes and p-values — ready for visualization using Manhattan and QQ plots.