Q&A 10 How do you include PCA covariates in a GWAS model?
10.1 Explanation
To reduce false positives caused by hidden genetic structure, itβs standard in GWAS to include the top principal components (PCs) as covariates in the model. These PCs come from PCA applied to the genotype matrix and must be matched back to each sample using the same identifiers (FID, IID).
This Q&A walks through the process of combining phenotype, PCA, and genotype data, then fitting a single SNP-trait association model.
10.2 R Code
# Load libraries
library(tidyverse)
# Step 1: Load and align phenotype data with sample metadata (by FID)
fam_data <- read_table("data/sativas413.fam",
col_names = c("FID", "IID", "PID", "MID", "sex", "phenotype"),
show_col_types = FALSE)
phenotype_data <- read_tsv("data/sativas413_phenotypes.txt", show_col_types = FALSE) %>%
rename(FID = 1) # Sample IDs in phenotype file match fam_data$FID
sample_metadata <- left_join(fam_data, phenotype_data, by = "FID")
# Step 2: Rename PCA columns to include proper IDs
pca_df <- pca_df %>%
rename(FID = 1, IID = 2)
# Step 3: Merge PCA data into sample metadata
sample_data <- left_join(sample_metadata, pca_df, by = c("FID", "IID"))
# Step 4: Standardize ID columns in genotype data
geno_imputed <- geno_imputed %>%
rename(FID = 1, IID = 2)
# Step 5: Merge genotype with metadata
geno_data <- geno_imputed[, -1] # Drop FID, keep IID and SNPs
gwas_input <- left_join(sample_data, geno_data, by = "IID")
# Step 6: Select trait and covariates
trait <- "Plant height" # Use the column name as a string
covariates <- c("PC1", "PC2", "PC3")
# Save the merged GWAS input to an RDS file for future use
saveRDS(gwas_input, file = "data/gwas_input.rds")
# Step 7: Construct the model formula
snp_name <- names(geno_data)[2] # Replace with desired SNP
formula_str <- paste0("`", trait, "` ~ ", snp_name, " + ", paste(covariates, collapse = " + "))
model <- lm(as.formula(formula_str), data = gwas_input)
# Step 8: View model summary
summary(model)
Call:
lm(formula = as.formula(formula_str), data = gwas_input)
Residuals:
Min 1Q Median 3Q Max
-54.024 -12.429 -0.346 11.594 76.555
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 115.83448 1.04707 110.627 < 2e-16 ***
id1000007 2.17767 1.54100 1.413 0.158
PC1 0.20502 0.02761 7.426 7.49e-13 ***
PC2 -0.19534 0.04436 -4.404 1.39e-05 ***
PC3 -0.29738 0.07399 -4.019 7.05e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.63 on 378 degrees of freedom
(30 observations deleted due to missingness)
Multiple R-squared: 0.2277, Adjusted R-squared: 0.2195
F-statistic: 27.86 on 4 and 378 DF, p-value: < 2.2e-16
β Takeaway: Always ensure PCA scores include correctly labeled
FIDandIIDso they can be merged with metadata and genotype matrices before fitting GWAS models.