Q&A 18 How do you identify genome-wide significant SNP hits and save them for downstream analysis?

18.1 Explanation

Once GWAS results are generated and corrected for multiple testing, the next step is to identify statistically significant SNPs. A common threshold is the Bonferroni-adjusted p-value, which accounts for the number of tests.

You can:

  • Filter SNPs below the Bonferroni threshold
  • Annotate them with chromosome and base-pair position using the .map file
  • Save the output for use in downstream steps like gene annotation or reporting

18.2 R Code

# Load required libraries
library(tidyverse)

# Step 1: Load GWAS results
gwas_df <- read_csv("data/gwas_results.csv")

# Step 2: Calculate Bonferroni threshold
n_tests <- nrow(gwas_df)
bonf_threshold <- 0.05 / n_tests  # ≈ 1.33e-05 for 3755 SNPs

# Step 3: Filter significant hits
significant_hits <- gwas_df %>%
  filter(P_value < bonf_threshold) %>%
  arrange(P_value)

# Step 4: Load SNP position data
map_df <- read_tsv("data/sativas413.map",
                   col_names = c("CHR", "SNP", "GEN_DIST", "BP"),
                   show_col_types = FALSE)

# Step 5: Annotate significant SNPs with position
annotated_hits <- left_join(significant_hits, map_df, by = "SNP") %>%
  select(SNP, CHR, BP, Estimate, P_value)

# Step 6: Save for downstream analysis
write_csv(annotated_hits, "data/significant_snps_bonferroni.csv")

# Step 7: Preview top hits
head(annotated_hits, 10)
# A tibble: 10 × 5
   SNP          CHR       BP Estimate  P_value
   <chr>      <dbl>    <dbl>    <dbl>    <dbl>
 1 id1024159      1 38111539    -16.6 7.55e-19
 2 ud10000717    10 11129237     19.6 2.02e-14
 3 id10003000    10 11508505     17.7 1.10e-13
 4 id6000791      6   905567     11.9 1.40e-12
 5 id3010870      3 24492231    -12.2 1.50e-12
 6 id2014431      2 32420895     19.3 3.05e-12
 7 id1024488      1 38618456     11.5 3.51e-12
 8 id11001027    11  3296722     14.5 3.53e-12
 9 id1020565      1 33058921    -14.8 5.70e-12
10 id2001618      2  2873552     14.3 1.04e-11

✅ Takeaway: Identifying and saving genome-wide significant SNPs ensures a clean input for downstream analysis such as gene annotation, pathway mapping, or publication reporting.