Q&A 2 How do you prepare a public GWAS dataset for R-based analysis?
2.1 Explanation
To support reproducible and cross-platform workflows, itβs helpful to separate raw data preparation from your analysis code. In this example, we use a Bash script to download and organize a publicly available rice diversity panel dataset from Zhao et al., 2011.
The script performs the following steps:
- π₯ Downloads PLINK-formatted genotype files (
.ped,.map,.fam)
- π¦ Unzips and flattens the directory structure for easy access
- π·οΈ Renames the phenotype file for clarity (
sativas413_phenotypes.txt)
- π Moves all outputs into a consistent
data/directory with asativa413_prefix
This setup creates a clean and well-organized foundation for downstream R-based GWAS analysis.
2.2 Bash Script
#!/bin/bash
# π Prepare rice GWAS genotype and phenotype data (PLINK format)
# --- Paths and filenames ---
ZIP_URL="http://ricediversity.org/data/sets/44kgwas/RiceDiversity.44K.MSU6.Genotypes_PLINK.zip"
ZIP_FILE="data/rice_gwas_genotypes.zip"
EXTRACT_DIR="data/RiceDiversity_44K_Genotypes_PLINK"
PHENO_URL="http://www.ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt"
PHENO_OUT="data/sativas413_phenotypes.txt"
# --- Step 1: Ensure data folder exists ---
mkdir -p data
# --- Step 2: Download genotype zip file (if not already present) ---
if [ ! -f "$ZIP_FILE" ]; then
echo "β¬οΈ Downloading genotype data..."
wget --no-check-certificate -O "$ZIP_FILE" "$ZIP_URL"
else
echo "β
Genotype zip already exists: $ZIP_FILE"
fi
# --- Step 3: Unzip genotype data ---
echo "π Extracting genotype files..."
unzip -o "$ZIP_FILE" -d data/
# --- Step 4: Move files up and clean nested folder ---
if [ -d "$EXTRACT_DIR" ]; then
mv "$EXTRACT_DIR"/* data/
rm -rf "$EXTRACT_DIR"
fi
rm -rf data/__MACOSX
rm -f "$ZIP_FILE"
# --- Step 5: Download phenotype file and rename ---
echo "β¬οΈ Downloading phenotype file..."
wget --no-check-certificate -P data/ "$PHENO_URL"
mv data/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt "$PHENO_OUT"
echo "β
GWAS data successfully prepared in the data/ folder."The Bash script is saved as:
To run it:
2.3 File Structure
After execution, the data/ folder contains:
project/
βββ script/
β βββ gwas_data.sh
βββ data/
βββ sativa413.map
βββ sativa413.ped
βββ sativa413.fam
βββ sativas413_phenotypes.txt
β Takeaway: Using a Bash script to automate dataset download and organization keeps your R analysis environment clean, standardized, and fully reproducible. All required files are now available in the
data/folder under thesativa413_prefix.