Lesson 3: Genotype QC and Population Structure

ID: GWAS-L03
Type: Conceptual + Implementation
Audience: Public
Theme: Variant filtering, sample QC, and structure diagnostics

Why Genotype QC Matters

Association testing assumes:

Genotypes are measured reliably
Variants are polymorphic
Missingness is not systematic
Population structure is properly characterized

If these assumptions fail, false positives can arise before modeling even begins.

Genotype QC is not a preprocessing formality. It is statistical risk control.

Load Genotype Matrix

For demonstration, we use a simulated genotype matrix where:

Rows = samples
Columns = SNPs
Values = 0, 1, 2 (allele dosage)

geno <- read.csv("data/demo-genotypes.csv", row.names = 1)
dim(geno)

[1]  200 2000

geno[1:5, 1:6]

           rs1 rs2 rs3 rs4 rs5 rs6
sample-001   0   0   0   0   1   2
sample-002   0   0   0   1   0   1
sample-003   0   0   0   1   0   1
sample-004   1   0   0   0   0   1
sample-005   0   0   0   0   0   0

Missingness Per SNP

Variants with excessive missing data can bias association tests.

snp_missing <- colMeans(is.na(geno))
summary(snp_missing)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0250  0.0350  0.0379  0.0500  0.1350

A typical threshold in practice might be 2–5% missingness. For this demo, we filter SNPs with >5% missingness.

geno_filtered <- geno[, snp_missing <= 0.05]
dim(geno_filtered)

[1]  200 1578

Minor Allele Frequency (MAF)

Very rare variants can:

Inflate variance estimates
Produce unstable effect sizes
Generate extreme p-values due to sparse counts

Compute allele frequency and MAF.

allele_freq <- colMeans(geno_filtered, na.rm = TRUE) / 2
maf <- pmin(allele_freq, 1 - allele_freq)

summary(maf)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0285  0.1601  0.2784  0.2764  0.3952  0.5000

Filter SNPs with MAF < 0.05.

geno_filtered <- geno_filtered[, maf >= 0.05]
dim(geno_filtered)

[1]  200 1559

Interpretation:

Removing very rare variants increases model stability in small datasets. In large biobank-scale GWAS, rare variant strategies differ and require specialized modeling.

Sample-Level Missingness

Samples with excessive missing genotypes can distort association signals.

sample_missing <- rowMeans(is.na(geno_filtered))
summary(sample_missing)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003207 0.008980 0.023412 0.029548 0.042335 0.124439

Filter samples with >5% missingness.

keep_samples <- sample_missing <= 0.05
geno_filtered <- geno_filtered[keep_samples, ]
dim(geno_filtered)

[1]  162 1559

Principal Component Analysis (PCA)

Population structure is estimated from genotype variation.

Before PCA, we use simple mean imputation to avoid dropping large parts of the genotype matrix due to missingness. This is a pragmatic step for structure estimation, not a claim about true genotypes.

geno_mat <- as.matrix(geno_filtered)

for (j in seq_len(ncol(geno_mat))){
  idx <- is.na(geno_mat[, j])
  if (any(idx)){
    mu <- mean(geno_mat[, j], na.rm = TRUE)
    geno_mat[idx, j] <- mu
  }
}

stopifnot(!anyNA(geno_mat))

For teaching stability and speed, we run PCA on a subset of SNPs.

set.seed(1)
m_use <- min(1000, ncol(geno_mat))
keep_snps <- sample(seq_len(ncol(geno_mat)), size = m_use)
geno_mat_pca <- geno_mat[, keep_snps, drop = FALSE]
dim(geno_mat_pca)

[1]  162 1000

Scaling can fail if any SNP has zero variance (all values identical). We drop such SNPs before PCA.

sds <- apply(geno_mat_pca, 2, stats::sd)

keep_var <- is.finite(sds) & sds > 0
geno_mat_pca <- geno_mat_pca[, keep_var, drop = FALSE]

dim(geno_mat_pca)

[1]  162 1000

geno_scaled <- scale(geno_mat_pca)

stopifnot(!anyNA(geno_scaled))
stopifnot(all(is.finite(geno_scaled)))

pca <- prcomp(geno_scaled, center = FALSE, scale. = FALSE)

summary(pca)$importance[2, 1:5]

    PC1     PC2     PC3     PC4     PC5 
0.01193 0.01147 0.01133 0.01127 0.01114

Visualize PC1 vs PC2

source("scripts/R/cdi-plot-theme.R")
library(ggplot2)

pca_df <- data.frame(
  PC1 = pca$x[, 1],
  PC2 = pca$x[, 2]
)

pal <- cdi_palette()

ggplot(pca_df, ggplot2::aes(x = PC1, y = PC2)) +
  ggplot2::geom_point(
    color = pal$teal,
    alpha = 0.85,
    size = 2.4
  ) +
  ggplot2::labs(
    title = "Population structure (PCA)",
    subtitle = "Principal components capture ancestry-related variation",
    x = "PC1",
    y = "PC2"
  ) +
  cdi_theme()

Interpretation

Principal components summarize correlated genotype variation.

Clusters or gradients suggest:

Ancestry differences
Batch effects
Subpopulation structure

Ignoring population structure can produce systematic association inflation.

What QC Achieved

After QC, we have:

Removed poorly genotyped variants
Excluded unstable rare variants
Filtered low-quality samples
Estimated ancestry structure

Association testing now operates on a more stable genotype matrix.