Q&A 6 How do you recode allele strings into numeric count format for GWAS?

6.1 Explanation

After tidying the .ped genotype matrix into a clean format where each SNP column contains values like "A A", "A G", or "G G", most GWAS tools require those genotypes to be numeric:

Genotype count format:
- 0 = Homozygous for major allele
- 1 = Heterozygous
- 2 = Homozygous for minor allele
- NA = Missing or uncalled genotype

To do this, we:

  1. Identify the two alleles observed at each SNP
  2. Determine the minor allele (less frequent)
  3. Count how many copies of the minor allele each individual has (0, 1, or 2)

Clarifying the structure of the genotype matrix:

Each pair of alleles (like "A G", "G G", "T C") represents a genotype for a single SNP in a single individual.

So when you load the .ped file and separate it into allele pairs:

  • Each pair = one genotype
  • Each column = one SNP
  • Each row = one sample

🧠 This distinction is important when converting genotype strings to numeric formats for GWAS.

6.2 R Code

# Load libraries
library(tidyverse)

# Step 1: Drop FID and IID from genotype_tidy to isolate genotype columns
geno_alleles <- genotype_tidy[, -c(1, 2)]

# Step 2: Convert allele strings to numeric minor allele counts
geno_minor_allele_count <- map_dfc(geno_alleles, function(allele_vec) {
  # Split all genotype strings (e.g., "A G") into individual alleles
  alleles <- unlist(str_split(allele_vec, " "))
  allele_counts <- table(alleles)

  # Skip SNPs that are monomorphic or malformed
  if (length(allele_counts) < 2) return(rep(NA, length(allele_vec)))

  # Identify the minor allele (less frequent)
  minor_allele <- names(sort(allele_counts))[1]

  # Count how many copies of the minor allele are in each genotype
  sapply(allele_vec, function(gt) {
    if (gt %in% c("0 0", "0 1", "1 0", "1 1", "0", "1")) return(NA)  # filter malformed
    split_alleles <- unlist(str_split(gt, " "))
    if (length(split_alleles) != 2) return(NA)
    sum(split_alleles == minor_allele)
  })
})

# Step 3: Add back sample identifiers
genotype_count <- bind_cols(genotype_tidy[, 1:2], geno_minor_allele_count)

# Step 4: Preview the cleaned matrix
glimpse(genotype_count[, 1:5])
Rows: 413
Columns: 5
$ X1        <chr> "081215-A05", "081215-A06", "081215-A07", "081215-A08", "090…
$ X2        <dbl> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 20, …
$ id1000001 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ id1000003 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ id1000005 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Takeaway: Recoding genotype strings into numeric dosages (0, 1, 2) is essential for statistical GWAS models. It standardizes input and prepares your data for PCA, association testing, or genomic prediction.