A Reproducibility and Reference Material

A.1 Purpose of this appendix

This appendix collects supporting material that does not belong in the main lesson flow but is essential for good GWAS practice.

The main lessons focus on concepts and workflows.
This appendix focuses on reproducibility, conventions, and references that learners can return to as needed.

A.2 Reproducibility principles

All analyses in this guide aim to be:

  • reproducible on a fresh system
  • explicit about inputs and assumptions
  • transparent about decisions that affect results

Key practices include:

  • fixed random seeds where applicable
  • explicit software versions
  • saving intermediate results when appropriate
  • separating exploratory work from reported results

A.3 Software environment

A typical environment for this guide includes:

  • Python (3.10+)
  • NumPy, pandas
  • matplotlib
  • statsmodels
  • domain-specific tools introduced later in the guide

Exact versions may evolve over time.
When results matter, always record the environment used.

A.4 Data conventions used in this guide

Throughout the guide, we use consistent conventions:

  • samples are identified by a unique sample_id
  • genotypes are coded additively as 0, 1, or 2
  • missing values are represented explicitly
  • phenotypes and covariates are stored in tidy tables

These conventions make it easier to reason about models and results.

A.5 On thresholds and defaults

Many GWAS steps involve thresholds:

  • missingness cutoffs
  • minor allele frequency filters
  • significance thresholds

Defaults shown in examples are illustrative, not universal.

Always consider:

  • study design
  • sample size
  • population structure
  • downstream goals

There is no single correct set of thresholds.

A.6 Reporting GWAS results

When reporting GWAS findings:

  • distinguish statistical significance from biological relevance
  • report effect sizes and uncertainty, not only p values
  • describe quality control and model choices clearly
  • avoid overinterpretation of single-study signals

Clear reporting is as important as correct analysis.

A.7 Further reading

Learners are encouraged to consult primary sources and reviews, including:

  • GWAS methodology reviews
  • papers on population structure and mixed models
  • best-practice guidelines for reproducible research

Specific references are listed in the References section of this guide.