Q&A 6 How do you recode allele strings into numeric count format for GWAS?

6.1 Explanation

After tidying the .ped genotype matrix into a clean format where each SNP column contains values like "A A", "A G", or "G G", most GWAS tools require those genotypes to be numeric:

Genotype count format:
- 0 = Homozygous for major allele
- 1 = Heterozygous
- 2 = Homozygous for minor allele
- NA = Missing or uncalled genotype

To do this, we:

Identify the two alleles observed at each SNP
Determine the minor allele (less frequent)
Count how many copies of the minor allele each individual has (0, 1, or 2)

✅ Clarifying the structure of the genotype matrix:

Each pair of alleles (like "A G", "G G", "T C") represents a genotype for a single SNP in a single individual.

So when you load the .ped file and separate it into allele pairs:

Each pair = one genotype
Each column = one SNP
Each row = one sample

🧠 This distinction is important when converting genotype strings to numeric formats for GWAS.

6.2 R Code

# Load libraries
library(tidyverse)

# Step 1: Drop FID and IID from genotype_tidy to isolate genotype columns
geno_alleles <- genotype_tidy[, -c(1, 2)]

# Step 2: Convert allele strings to numeric minor allele counts
geno_minor_allele_count <- map_dfc(geno_alleles, function(allele_vec) {
  # Split all genotype strings (e.g., "A G") into individual alleles
  alleles <- unlist(str_split(allele_vec, " "))
  allele_counts <- table(alleles)

  # Skip SNPs that are monomorphic or malformed
  if (length(allele_counts) < 2) return(rep(NA, length(allele_vec)))

  # Identify the minor allele (less frequent)
  minor_allele <- names(sort(allele_counts))[1]

  # Count how many copies of the minor allele are in each genotype
  sapply(allele_vec, function(gt) {
    if (gt %in% c("0 0", "0 1", "1 0", "1 1", "0", "1")) return(NA)  # filter malformed
    split_alleles <- unlist(str_split(gt, " "))
    if (length(split_alleles) != 2) return(NA)
    sum(split_alleles == minor_allele)
  })
})

# Step 3: Add back sample identifiers
genotype_count <- bind_cols(genotype_tidy[, 1:2], geno_minor_allele_count)

# Step 4: Preview the cleaned matrix
glimpse(genotype_count[, 1:5])

Rows: 413
Columns: 5
$ X1        <chr> "081215-A05", "081215-A06", "081215-A07", "081215-A08", "090…
$ X2        <dbl> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 20, …
$ id1000001 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ id1000003 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ id1000005 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

✅ Takeaway: Recoding genotype strings into numeric dosages (0, 1, 2) is essential for statistical GWAS models. It standardizes input and prepares your data for PCA, association testing, or genomic prediction.