Q&A 8 How do you impute missing genotype values before GWAS analysis?

8.1 Explanation

Many GWAS and population structure methods (like PCA or kinship matrix computation) require complete genotype matrices. If you have filtered for missingness but still have a few NA values, a simple approach is to impute missing genotypes using the mean dosage for each SNP.

This is fast, reproducible, and good enough for visualization and many linear models.

8.2 R Code

# Load library
library(tidyverse)

# Step 1: Extract dosage matrix (without FID/IID)
dosage_matrix <- filtered_geno[, -c(1, 2)]

# Step 2: Impute missing values using column means
imputed_matrix <- dosage_matrix %>%
  mutate(across(everything(), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Step 3: Add back FID and IID
geno_imputed <- bind_cols(filtered_geno[, 1:2], imputed_matrix)

# Step 4: Preview
head(geno_imputed[, 1:5])

# A tibble: 6 × 5
  X1            X2 id1000007 id1000051 id1000080
  <chr>      <dbl>     <int>     <int>     <int>
1 081215-A05     1         0         0         0
2 081215-A06     3         2         0         2
3 081215-A07     4         2         0         2
4 081215-A08     5         0         0         2
5 090414-A09     6         2         0         2
6 090414-A10     7         0         0         0

glimpse(geno_imputed[1:5, 1:10])

Rows: 5
Columns: 10
$ X1        <chr> "081215-A05", "081215-A06", "081215-A07", "081215-A08", "090…
$ X2        <dbl> 1, 3, 4, 5, 6
$ id1000007 <int> 0, 2, 2, 0, 2
$ id1000051 <int> 0, 0, 0, 0, 0
$ id1000080 <int> 0, 2, 2, 2, 2
$ id1000091 <int> 0, 2, 2, 0, 2
$ id1000093 <int> 0, 2, 2, 2, 2
$ id1000115 <int> 0, 0, 0, 2, 0
$ ud1000033 <dbl> 0, 0, 0, 0, 0
$ id1000264 <int> 0, 2, 0, 0, 0

✅ Takeaway: Simple mean imputation fills missing genotype values efficiently. It’s suitable for PCA, kinship, and linear models when high accuracy isn’t critical or when advanced imputation isn’t available.