Q&A 7 How do you filter SNPs and samples based on missing data and minor allele frequency?
7.1 Explanation
Before running GWAS, itβs important to apply basic quality control (QC) to the genotype matrix. This ensures that:
- SNPs with too many missing genotypes are excluded
- SNPs with very low variability (low minor allele frequency) are removed
- Samples with excessive missing data (optional) are filtered out
These steps improve statistical power and reduce false associations.
7.2 R Code
# Load required libraries
library(tidyverse)
# Step 1: Remove sample columns (FID, IID)
count_only <- genotype_count[, -c(1, 2)]
# Step 2: Filter SNPs by missingness (e.g., keep SNPs with <10% missing values)
snp_missing <- colMeans(is.na(count_only))
snp_keep <- names(snp_missing[snp_missing < 0.1])
filtered_count <- count_only[, snp_keep]
# Step 3: Filter SNPs by minor allele frequency (MAF >= 0.05)
calc_maf <- function(x) {
p <- mean(x, na.rm = TRUE) / 2
min(p, 1 - p)
}
snp_maf <- map_dbl(filtered_count, calc_maf)
maf_keep <- names(snp_maf[snp_maf >= 0.05])
final_count <- filtered_count[, maf_keep]
# Step 4: Reattach FID and IID
filtered_geno <- bind_cols(genotype_count[, 1:2], final_count)
# Step 5: Summary of filtering
cat("Original SNPs:", ncol(count_only), "\n")Original SNPs: 36901
After missing filter: 31443
After MAF filter: 3755
β Takeaway: Apply SNP-level filters for missing data and low MAF to improve data quality. This ensures that only informative and reliable markers are used in your GWAS analysis.