Q&A 12 How do you perform a genome-wide SNP scan to generate GWAS results?

12.1 Explanation

Once you’ve merged your phenotype, PCA covariates, and genotype matrix into a single data frame (gwas_input), you can perform a genome-wide association scan.

This involves fitting a linear model for each SNP, adjusting for population structure (e.g., PCs), and extracting the effect size and p-value for each SNP. These results are saved in a table for downstream visualization using Manhattan or QQ plots.

Each model looks like:

Plant height ~ PC1 + PC2 + PC3 + SNP_i

We loop over all SNP columns in the dataset that match known prefixes (id, ud, wd, dd, fd) and collect the outputs.

12.2 R Code

# Load required package
library(tidyverse)

# Step 1: Identify SNP columns using prefix pattern
snp_cols <- grep("^(id|ud|wd|dd|fd)[0-9]+$", names(gwas_input), value = TRUE)

# Step 2: Check number of SNPs selected
length(snp_cols)      # Should return 3755

[1] 3755

head(snp_cols, 5)     # Preview first 5 SNPs

[1] "id1000007" "id1000051" "id1000080" "id1000091" "id1000093"

# Step 3: Initialize list to collect GWAS results
gwas_results <- list()

# Step 4: Loop over each SNP and fit linear model
for (snp in snp_cols) {
  # Construct formula dynamically
  formula <- as.formula(paste("`Plant height` ~ PC1 + PC2 + PC3 +", snp))
  
  # Fit model safely
  model <- tryCatch(
    lm(formula, data = gwas_input),
    error = function(e) NULL
  )
  
  # Step 5: If successful, extract coefficient and p-value
  if (!is.null(model)) {
    coef_table <- summary(model)$coefficients
    snp_row <- tail(rownames(coef_table), 1)
    
    gwas_results[[snp]] <- tibble(
      SNP = snp,
      Estimate = coef_table[snp_row, "Estimate"],
      P_value = coef_table[snp_row, "Pr(>|t|)"]
    )
  }
}

# Step 6: Combine results and sort by p-value
gwas_df <- bind_rows(gwas_results) %>%
  arrange(P_value)

# Step 7: Save the GWAS results for visualization
write_csv(gwas_df, "data/gwas_results.csv")

# Step 8: Preview top hits
head(gwas_df, 10)

# A tibble: 10 × 3
   SNP        Estimate  P_value
   <chr>         <dbl>    <dbl>
 1 id1024159     -16.6 7.55e-19
 2 ud10000717     19.6 2.02e-14
 3 id10003000     17.7 1.10e-13
 4 id6000791      11.9 1.40e-12
 5 id3010870     -12.2 1.50e-12
 6 id2014431      19.3 3.05e-12
 7 id1024488      11.5 3.51e-12
 8 id11001027     14.5 3.53e-12
 9 id1020565     -14.8 5.70e-12
10 id2001618      14.3 1.04e-11

✅ Takeaway: This loop scans 3,755 SNPs genome-wide and outputs a clean summary table with effect sizes and p-values — ready for visualization using Manhattan and QQ plots.