Q&A 7 How do you filter SNPs and samples based on missing data and minor allele frequency?

7.1 Explanation

Before running GWAS, it’s important to apply basic quality control (QC) to the genotype matrix. This ensures that:

  • SNPs with too many missing genotypes are excluded
  • SNPs with very low variability (low minor allele frequency) are removed
  • Samples with excessive missing data (optional) are filtered out

These steps improve statistical power and reduce false associations.

7.2 R Code

# Load required libraries
library(tidyverse)

# Step 1: Remove sample columns (FID, IID)
count_only <- genotype_count[, -c(1, 2)]

# Step 2: Filter SNPs by missingness (e.g., keep SNPs with <10% missing values)
snp_missing <- colMeans(is.na(count_only))
snp_keep <- names(snp_missing[snp_missing < 0.1])
filtered_count <- count_only[, snp_keep]

# Step 3: Filter SNPs by minor allele frequency (MAF >= 0.05)
calc_maf <- function(x) {
  p <- mean(x, na.rm = TRUE) / 2
  min(p, 1 - p)
}
snp_maf <- map_dbl(filtered_count, calc_maf)
maf_keep <- names(snp_maf[snp_maf >= 0.05])
final_count <- filtered_count[, maf_keep]

# Step 4: Reattach FID and IID
filtered_geno <- bind_cols(genotype_count[, 1:2], final_count)

# Step 5: Summary of filtering
cat("Original SNPs:", ncol(count_only), "\n")
Original SNPs: 36901 
cat("After missing filter:", length(snp_keep), "\n")
After missing filter: 31443 
cat("After MAF filter:", length(maf_keep), "\n")
After MAF filter: 3755 

βœ… Takeaway: Apply SNP-level filters for missing data and low MAF to improve data quality. This ensures that only informative and reliable markers are used in your GWAS analysis.