Q&A 10 How do you include PCA covariates in a GWAS model?

10.1 Explanation

To reduce false positives caused by hidden genetic structure, it’s standard in GWAS to include the top principal components (PCs) as covariates in the model. These PCs come from PCA applied to the genotype matrix and must be matched back to each sample using the same identifiers (FID, IID).

This Q&A walks through the process of combining phenotype, PCA, and genotype data, then fitting a single SNP-trait association model.

10.2 R Code

# Load libraries
library(tidyverse)

# Step 1: Load and align phenotype data with sample metadata (by FID)
fam_data <- read_table("data/sativas413.fam", 
                       col_names = c("FID", "IID", "PID", "MID", "sex", "phenotype"), 
                       show_col_types = FALSE)

phenotype_data <- read_tsv("data/sativas413_phenotypes.txt", show_col_types = FALSE) %>%
  rename(FID = 1)  # Sample IDs in phenotype file match fam_data$FID

sample_metadata <- left_join(fam_data, phenotype_data, by = "FID")

# Step 2: Rename PCA columns to include proper IDs
pca_df <- pca_df %>%
  rename(FID = 1, IID = 2)

# Step 3: Merge PCA data into sample metadata
sample_data <- left_join(sample_metadata, pca_df, by = c("FID", "IID"))

# Step 4: Standardize ID columns in genotype data
geno_imputed <- geno_imputed %>%
  rename(FID = 1, IID = 2)

# Step 5: Merge genotype with metadata
geno_data <- geno_imputed[, -1]  # Drop FID, keep IID and SNPs
gwas_input <- left_join(sample_data, geno_data, by = "IID")

# Step 6: Select trait and covariates
trait <- "Plant height"  # Use the column name as a string
covariates <- c("PC1", "PC2", "PC3")

# Save the merged GWAS input to an RDS file for future use
saveRDS(gwas_input, file = "data/gwas_input.rds")

# Step 7: Construct the model formula
snp_name <- names(geno_data)[2]  # Replace with desired SNP
formula_str <- paste0("`", trait, "` ~ ", snp_name, " + ", paste(covariates, collapse = " + "))
model <- lm(as.formula(formula_str), data = gwas_input)

# Step 8: View model summary
summary(model)

Call:
lm(formula = as.formula(formula_str), data = gwas_input)

Residuals:
    Min      1Q  Median      3Q     Max 
-54.024 -12.429  -0.346  11.594  76.555 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 115.83448    1.04707 110.627  < 2e-16 ***
id1000007     2.17767    1.54100   1.413    0.158    
PC1           0.20502    0.02761   7.426 7.49e-13 ***
PC2          -0.19534    0.04436  -4.404 1.39e-05 ***
PC3          -0.29738    0.07399  -4.019 7.05e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.63 on 378 degrees of freedom
  (30 observations deleted due to missingness)
Multiple R-squared:  0.2277,    Adjusted R-squared:  0.2195 
F-statistic: 27.86 on 4 and 378 DF,  p-value: < 2.2e-16

βœ… Takeaway: Always ensure PCA scores include correctly labeled FID and IID so they can be merged with metadata and genotype matrices before fitting GWAS models.