Lesson 3 Genotype and Phenotype Data Structures

3.1 Why data structure matters in GWAS

GWAS is not only a statistical problem.
It is also a data organization problem.

Genotype and phenotype data often originate from different sources, follow different formats, and evolve over time.
A clear understanding of how these data are structured is essential for reproducible and interpretable analyses.

3.2 Genotype data at a high level

Genotype data encode genetic variation for each individual in the study.

In GWAS, this usually means: - millions of variants - thousands to hundreds of thousands of samples - discrete genotype values representing allele counts

Because of this scale, genotype data are almost always stored in specialized formats rather than simple text tables.

3.3 Common genotype representations

At the conceptual level, genotypes can be represented as:

  • 0, 1, or 2 copies of the reference or alternate allele
  • missing values for uncertain or unavailable calls

These values form a large matrix: - rows correspond to individuals - columns correspond to variants

In practice, this matrix is stored in compressed, indexed formats to enable efficient access.

3.4 Phenotype data

Phenotype data describe the traits being analyzed.

Unlike genotype data, phenotype data are typically stored in tabular form, where: - each row represents an individual - each column represents a trait, covariate, or metadata field

Phenotype tables often include: - the primary trait of interest - demographic variables - clinical or experimental metadata

3.5 Linking genotypes and phenotypes

A critical requirement in GWAS is that genotype and phenotype data align correctly.

This alignment is achieved through: - unique sample identifiers - consistent naming conventions - careful handling of missing or excluded samples

Mismatches between genotype and phenotype records are a common source of silent errors.

3.6 Covariates as structured data

Covariates are part of the phenotype data but play a special role.

They are: - included in association models - used to control for confounding - often derived from external or transformed variables

Keeping covariates clearly documented and versioned is important for reproducibility.

3.7 Metadata and provenance

Beyond genotypes and phenotypes, GWAS relies on metadata.

Examples include: - sample collection information - genotyping platform details - batch identifiers - processing versions

Metadata provide context and are essential for diagnosing artifacts and interpreting results.

3.8 Key takeaways

  • GWAS depends on careful organization of genotype and phenotype data
  • Genotypes are stored in specialized, large-scale formats
  • Phenotypes and covariates are usually tabular
  • Correct sample alignment is critical
  • Metadata support interpretation and reproducibility