Q&A 5 How do you tidy the genotype matrix from a .ped file in R?
5.1 Explanation
PLINK .ped files store genotype data in a wide format, where each SNP is represented by two columns per individual (one for each allele). This structure is inefficient for downstream analysis, so we convert it to a tidy format where:
- Each row represents one sample
- Each column represents one SNP
- Alleles are combined (e.g.,
"A A","A G","G G")
We also use SNP metadata from the .map file — now consistently referred to as snp_info — to label the columns.
5.2 R Code
# Load necessary library
library(tidyverse)
# Step 1: Load genotype and SNP info (if not already in memory)
ped_data <- read_rds("data/sativas413.rds")
snp_info <- read_table("data/sativas413.map",
col_names = c("chr", "snp_id", "gen_dist", "bp_pos"),
show_col_types = FALSE)
# Step 2: Separate sample IDs and genotype calls
sample_ids <- ped_data[, 1:2] # FID and IID
genotype_matrix <- ped_data[, -(1:6)] # Alleles only
# Step 3: Verify expected SNP count
n_snps <- ncol(genotype_matrix) / 2
stopifnot(n_snps == nrow(snp_info))
# Step 4: Combine each pair of allele columns into genotype strings
genotype_calls <- map_dfc(seq(1, ncol(genotype_matrix), by = 2), function(i) {
paste(genotype_matrix[[i]], genotype_matrix[[i + 1]])
})
names(genotype_calls) <- snp_info$snp_id # Use SNP IDs as column names
# Step 5: Combine with sample IDs
genotype_tidy <- bind_cols(sample_ids, genotype_calls)
# Step 6: Preview output
head(genotype_tidy[1:10, 1:5]) # Show FID, IID, and first 8 SNPs# A tibble: 6 × 5
X1 X2 id1000001 id1000003 id1000005
<chr> <dbl> <chr> <chr> <chr>
1 081215-A05 1 T T T T C C
2 081215-A06 3 C C C C C C
3 081215-A07 4 C C C C C C
4 081215-A08 5 C C C C T T
5 090414-A09 6 C C C C C C
6 090414-A10 7 T T T T C C
Rows: 5
Columns: 10
$ X1 <chr> "081215-A05", "081215-A06", "081215-A07", "081215-A08", "090…
$ X2 <dbl> 1, 3, 4, 5, 6
$ id1000001 <chr> "T T", "C C", "C C", "C C", "C C"
$ id1000003 <chr> "T T", "C C", "C C", "C C", "C C"
$ id1000005 <chr> "C C", "C C", "C C", "T T", "C C"
$ id1000007 <chr> "G G", "A A", "A A", "G G", "A A"
$ id1000008 <chr> "T T", "G G", "G G", "G G", "G G"
$ id1000011 <chr> "A A", "0 0", "G G", "A A", "G G"
$ id1000013 <chr> "C C", "C C", "C C", "T T", "C C"
$ id1000015 <chr> "T T", "G G", "G G", "G G", "G G"
✅ Takeaway: Tidying
.pedgenotype data into a clean sample-by-SNP format makes it much easier to analyze, visualize, or convert to numeric dosages. Usesnp_infoconsistently as the SNP metadata reference to avoid conflicts.