Q&A 2 How do you prepare a public GWAS dataset for R-based analysis?

2.1 Explanation

To support reproducible and cross-platform workflows, it’s helpful to separate raw data preparation from your analysis code. In this example, we use a Bash script to download and organize a publicly available rice diversity panel dataset from Zhao et al., 2011.

The script performs the following steps:

📥 Downloads PLINK-formatted genotype files (.ped, .map, .fam)
📦 Unzips and flattens the directory structure for easy access
🏷️ Renames the phenotype file for clarity (sativas413_phenotypes.txt)
📁 Moves all outputs into a consistent data/ directory with a sativa413_ prefix

This setup creates a clean and well-organized foundation for downstream R-based GWAS analysis.

2.2 Bash Script

#!/bin/bash

# 🚀 Prepare rice GWAS genotype and phenotype data (PLINK format)

# --- Paths and filenames ---
ZIP_URL="http://ricediversity.org/data/sets/44kgwas/RiceDiversity.44K.MSU6.Genotypes_PLINK.zip"
ZIP_FILE="data/rice_gwas_genotypes.zip"
EXTRACT_DIR="data/RiceDiversity_44K_Genotypes_PLINK"
PHENO_URL="http://www.ricediversity.org/data/sets/44kgwas/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt"
PHENO_OUT="data/sativas413_phenotypes.txt"

# --- Step 1: Ensure data folder exists ---
mkdir -p data

# --- Step 2: Download genotype zip file (if not already present) ---
if [ ! -f "$ZIP_FILE" ]; then
  echo "⬇️ Downloading genotype data..."
  wget --no-check-certificate -O "$ZIP_FILE" "$ZIP_URL"
else
  echo "✅ Genotype zip already exists: $ZIP_FILE"
fi

# --- Step 3: Unzip genotype data ---
echo "📂 Extracting genotype files..."
unzip -o "$ZIP_FILE" -d data/

# --- Step 4: Move files up and clean nested folder ---
if [ -d "$EXTRACT_DIR" ]; then
  mv "$EXTRACT_DIR"/* data/
  rm -rf "$EXTRACT_DIR"
fi
rm -rf data/__MACOSX
rm -f "$ZIP_FILE"

# --- Step 5: Download phenotype file and rename ---
echo "⬇️ Downloading phenotype file..."
wget --no-check-certificate -P data/ "$PHENO_URL"
mv data/RiceDiversity_44K_Phenotypes_34traits_PLINK.txt "$PHENO_OUT"

echo "✅ GWAS data successfully prepared in the data/ folder."

The Bash script is saved as:

script/gwas_data.sh

To run it:

bash script/gwas_data.sh

2.3 File Structure

After execution, the data/ folder contains:

project/
├── script/
│   └── gwas_data.sh
└── data/
    ├── sativa413.map
    ├── sativa413.ped
    ├── sativa413.fam
    └── sativas413_phenotypes.txt

✅ Takeaway: Using a Bash script to automate dataset download and organization keeps your R analysis environment clean, standardized, and fully reproducible. All required files are now available in the data/ folder under the sativa413_ prefix.