Q&A 5 How do you tidy the genotype matrix from a .ped file in R?

5.1 Explanation

PLINK .ped files store genotype data in a wide format, where each SNP is represented by two columns per individual (one for each allele). This structure is inefficient for downstream analysis, so we convert it to a tidy format where:

  • Each row represents one sample
  • Each column represents one SNP
  • Alleles are combined (e.g., "A A", "A G", "G G")

We also use SNP metadata from the .map file — now consistently referred to as snp_info — to label the columns.

5.2 R Code

# Load necessary library
library(tidyverse)

# Step 1: Load genotype and SNP info (if not already in memory)
ped_data <- read_rds("data/sativas413.rds")
snp_info <- read_table("data/sativas413.map", 
                       col_names = c("chr", "snp_id", "gen_dist", "bp_pos"), 
                       show_col_types = FALSE)

# Step 2: Separate sample IDs and genotype calls
sample_ids <- ped_data[, 1:2]              # FID and IID
genotype_matrix <- ped_data[, -(1:6)]      # Alleles only

# Step 3: Verify expected SNP count
n_snps <- ncol(genotype_matrix) / 2
stopifnot(n_snps == nrow(snp_info))

# Step 4: Combine each pair of allele columns into genotype strings
genotype_calls <- map_dfc(seq(1, ncol(genotype_matrix), by = 2), function(i) {
  paste(genotype_matrix[[i]], genotype_matrix[[i + 1]])
})
names(genotype_calls) <- snp_info$snp_id  # Use SNP IDs as column names

# Step 5: Combine with sample IDs
genotype_tidy <- bind_cols(sample_ids, genotype_calls)

# Step 6: Preview output
head(genotype_tidy[1:10, 1:5])  # Show FID, IID, and first 8 SNPs
# A tibble: 6 × 5
  X1            X2 id1000001 id1000003 id1000005
  <chr>      <dbl> <chr>     <chr>     <chr>    
1 081215-A05     1 T T       T T       C C      
2 081215-A06     3 C C       C C       C C      
3 081215-A07     4 C C       C C       C C      
4 081215-A08     5 C C       C C       T T      
5 090414-A09     6 C C       C C       C C      
6 090414-A10     7 T T       T T       C C      
glimpse(genotype_tidy[1:5, 1:10])  # Show FID, IID, and first 8 SNPs
Rows: 5
Columns: 10
$ X1        <chr> "081215-A05", "081215-A06", "081215-A07", "081215-A08", "090…
$ X2        <dbl> 1, 3, 4, 5, 6
$ id1000001 <chr> "T T", "C C", "C C", "C C", "C C"
$ id1000003 <chr> "T T", "C C", "C C", "C C", "C C"
$ id1000005 <chr> "C C", "C C", "C C", "T T", "C C"
$ id1000007 <chr> "G G", "A A", "A A", "G G", "A A"
$ id1000008 <chr> "T T", "G G", "G G", "G G", "G G"
$ id1000011 <chr> "A A", "0 0", "G G", "A A", "G G"
$ id1000013 <chr> "C C", "C C", "C C", "T T", "C C"
$ id1000015 <chr> "T T", "G G", "G G", "G G", "G G"

Takeaway: Tidying .ped genotype data into a clean sample-by-SNP format makes it much easier to analyze, visualize, or convert to numeric dosages. Use snp_info consistently as the SNP metadata reference to avoid conflicts.