Q&A 6 How do you recode allele strings into numeric count format for GWAS?
6.1 Explanation
After tidying the .ped genotype matrix into a clean format where each SNP column contains values like "A A", "A G", or "G G", most GWAS tools require those genotypes to be numeric:
Genotype count format:
-0= Homozygous for major allele
-1= Heterozygous
-2= Homozygous for minor allele
-NA= Missing or uncalled genotype
To do this, we:
- Identify the two alleles observed at each SNP
- Determine the minor allele (less frequent)
- Count how many copies of the minor allele each individual has (
0,1, or2)
✅ Clarifying the structure of the genotype matrix:
Each pair of alleles (like "A G", "G G", "T C") represents a genotype for a single SNP in a single individual.
So when you load the .ped file and separate it into allele pairs:
- Each pair = one genotype
- Each column = one SNP
- Each row = one sample
🧠 This distinction is important when converting genotype strings to numeric formats for GWAS.
6.2 R Code
# Load libraries
library(tidyverse)
# Step 1: Drop FID and IID from genotype_tidy to isolate genotype columns
geno_alleles <- genotype_tidy[, -c(1, 2)]
# Step 2: Convert allele strings to numeric minor allele counts
geno_minor_allele_count <- map_dfc(geno_alleles, function(allele_vec) {
# Split all genotype strings (e.g., "A G") into individual alleles
alleles <- unlist(str_split(allele_vec, " "))
allele_counts <- table(alleles)
# Skip SNPs that are monomorphic or malformed
if (length(allele_counts) < 2) return(rep(NA, length(allele_vec)))
# Identify the minor allele (less frequent)
minor_allele <- names(sort(allele_counts))[1]
# Count how many copies of the minor allele are in each genotype
sapply(allele_vec, function(gt) {
if (gt %in% c("0 0", "0 1", "1 0", "1 1", "0", "1")) return(NA) # filter malformed
split_alleles <- unlist(str_split(gt, " "))
if (length(split_alleles) != 2) return(NA)
sum(split_alleles == minor_allele)
})
})
# Step 3: Add back sample identifiers
genotype_count <- bind_cols(genotype_tidy[, 1:2], geno_minor_allele_count)
# Step 4: Preview the cleaned matrix
glimpse(genotype_count[, 1:5])Rows: 413
Columns: 5
$ X1 <chr> "081215-A05", "081215-A06", "081215-A07", "081215-A08", "090…
$ X2 <dbl> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 20, …
$ id1000001 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ id1000003 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ id1000005 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
✅ Takeaway: Recoding genotype strings into numeric dosages (0, 1, 2) is essential for statistical GWAS models. It standardizes input and prepares your data for PCA, association testing, or genomic prediction.