Module 10 — Capstone: One-Command QC Script
Time: 60–90 min
Goal: Write a robust script that down-samples (optional), runs fastqc and multiqc, and writes a text summary.
What you'll submit
- Your script
run_qc.sh - The
.outsummary it produces - 4–5 sentence reflection (what surprised you, one improvement)
1) Directory prep
mkdir -p ~/de-onramp/capstone/{data,results,logs}
cp ~/de-onramp/lesson5/data/SRR*.fastq.gz ~/de-onramp/capstone/data/
cd ~/de-onramp/capstone
conda activate rnaseq101
2) Script workflow overview
graph TD
A[Input FASTQ] --> B{Downsample?}
B -->|Yes| C[seqtk sample]
B -->|No| D[Use original]
C --> E[FastQC Analysis]
D --> E
E --> F[MultiQC Report]
F --> G[Calculate Stats]
G --> H[Generate Summary]
H --> I[Output Files]
I --> J[fastqc/ directory]
I --> K[multiqc_report.html]
I --> L[summary.txt]
style A fill:#e3f2fd
style I fill:#e8f5e8
style H fill:#fff3e0
3) Script skeleton (start here, then extend)
Create run_qc.sh:
#!/usr/bin/env bash
set -euo pipefail
usage() {
echo "Usage: $0 -i <in.fastq(.gz)> [-o <outdir>] [-p <fraction>]"
echo " -p fraction: optional downsample fraction (0< p <=1), default: 1.0"
exit 1
}
IN=""; OUT="results"; P=1.0
while getopts ":i:o:p:" opt; do
case $opt in
i) IN="$OPTARG";;
o) OUT="$OPTARG";;
p) P="$OPTARG";;
*) usage;;
esac
done
[[ -z "${IN}" ]] && usage
mkdir -p "$OUT"/{fastqc,logs}
BASENAME=$(basename "$IN" .gz); BASENAME=${BASENAME%.fastq}
WORK="$OUT/${BASENAME}"
# Downsample (if P<1) to a temp file
TMP="${WORK}.tmp.fastq.gz"
if awk "BEGIN{exit !($P < 1.0)}"; then
seqtk sample -s 1 "$IN" "$P" | gzip > "$TMP"
else
ln -sf "$(realpath "$IN")" "$TMP"
fi
# Run FastQC
fastqc -o "$OUT/fastqc" "$TMP"
# Aggregate
multiqc -o "$OUT" "$OUT/fastqc" > "$OUT/logs/multiqc.log" 2>&1 || true
# Summaries: reads, bases, GC (rough), file size
# Use gzcat for macOS compatibility
READS=$(( (uname -s | grep -q Darwin) && gzcat "$TMP" || zcat "$TMP" ) | awk 'END{print NR/4}')
BASES=$(( (uname -s | grep -q Darwin) && gzcat "$TMP" || zcat "$TMP" ) | awk 'NR%4==2{bp+=length($0)} END{print bp+0}')
GC=$(( (uname -s | grep -q Darwin) && gzcat "$TMP" || zcat "$TMP" ) | awk 'NR%4==2{gsub(/[^GgCc]/,"");gc+=length($0);t+=length($0)} END{if(t) printf("%.2f",100*gc/t); else print 0}')
SIZE=$(ls -lh "$TMP" | awk '{print $5}')
{
echo "Input: $IN"
echo "Downsample fraction: $P"
echo "Tmp file: $TMP (size: $SIZE)"
echo "Reads: $READS"
echo "Bases: $BASES"
echo "GC%: $GC"
echo "FastQC out: $OUT/fastqc"
echo "MultiQC report: $OUT/multiqc_report.html"
} > "$OUT/summary.txt"
Make it executable:
4) Run it (examples)
./run_qc.sh -i data/SRR.fastq.gz -o results -p 0.05
# or, if you created SRR.10k.fastq.gz
./run_qc.sh -i data/SRR.10k.fastq.gz -o results
5) Check outputs
Open results/multiqc_report.html.
6) (Optional) Hardening ideas
- Detect paired-end and run FastQC on both mates.
- Add
-tto FastQC for threads; surface-tas a script arg. - Write a CSV/TSV summary for multiple inputs.
Exit Ticket (email)
Subject: DE M10 Exit Ticket –
Attach:
run_qc.shresults/summary.txt- 4–5 sentence reflection (what you'd change next time)