Lesson 3 Genotype and Phenotype Data Structures
3.1 Why data structure matters in GWAS
GWAS is not only a statistical problem.
It is also a data organization problem.
Genotype and phenotype data often originate from different sources, follow different formats, and evolve over time.
A clear understanding of how these data are structured is essential for reproducible and interpretable analyses.
3.2 Genotype data at a high level
Genotype data encode genetic variation for each individual in the study.
In GWAS, this usually means: - millions of variants - thousands to hundreds of thousands of samples - discrete genotype values representing allele counts
Because of this scale, genotype data are almost always stored in specialized formats rather than simple text tables.
3.3 Common genotype representations
At the conceptual level, genotypes can be represented as:
- 0, 1, or 2 copies of the reference or alternate allele
- missing values for uncertain or unavailable calls
These values form a large matrix: - rows correspond to individuals - columns correspond to variants
In practice, this matrix is stored in compressed, indexed formats to enable efficient access.
3.4 Phenotype data
Phenotype data describe the traits being analyzed.
Unlike genotype data, phenotype data are typically stored in tabular form, where: - each row represents an individual - each column represents a trait, covariate, or metadata field
Phenotype tables often include: - the primary trait of interest - demographic variables - clinical or experimental metadata
3.5 Linking genotypes and phenotypes
A critical requirement in GWAS is that genotype and phenotype data align correctly.
This alignment is achieved through: - unique sample identifiers - consistent naming conventions - careful handling of missing or excluded samples
Mismatches between genotype and phenotype records are a common source of silent errors.
3.6 Covariates as structured data
Covariates are part of the phenotype data but play a special role.
They are: - included in association models - used to control for confounding - often derived from external or transformed variables
Keeping covariates clearly documented and versioned is important for reproducibility.
3.7 Metadata and provenance
Beyond genotypes and phenotypes, GWAS relies on metadata.
Examples include: - sample collection information - genotyping platform details - batch identifiers - processing versions
Metadata provide context and are essential for diagnosing artifacts and interpreting results.
3.8 Key takeaways
- GWAS depends on careful organization of genotype and phenotype data
- Genotypes are stored in specialized, large-scale formats
- Phenotypes and covariates are usually tabular
- Correct sample alignment is critical
- Metadata support interpretation and reproducibility
Continue to → Lesson 04: Quality Control Decisions