Module 6 — FASTQ 101 (Hands-on)
Time: ~60 min
Goal: Understand 4-line records, count reads safely, and compute simple stats.
Type it, don't paste it
This one is worth typing—head, wc, awk, and zcat/gzcat will become muscle memory.
1) FASTQ anatomy (recap)
Each read = 4 lines:
- @ header (ID + metadata)
- Sequence (A/C/G/T/N)
- + (optional header repeat)
- Quality string (ASCII characters encoding Phred scores)
Good overview references from NCBI/EBI.
2) Make a tiny FASTQ
mkdir -p ~/de-onramp/lesson3 && cd ~/de-onramp/lesson3
cat > tiny.fq << 'EOF'
@r1
ACGTTGCA
+
IIIIHHHF
@r2
GGGTTTAA
+
FFFFFIII
EOF
3) Sanity checks (robust counting)
Avoid grep '^@' (the @ can appear in quality lines). Count records by lines ÷ 4:
For gz files:
4) Peek at sequences & lengths
# Show the first two records (8 lines)
head -n 8 tiny.fq
# Longest read length
awk 'NR%4==2{ if(length($0)>m) m=length($0) } END{ print m }' tiny.fq
5) Quick GC%
GC% for the first N reads (e.g., 1000), falling back to all if fewer:
N=1000
awk -v N="$N" 'NR%4==2{
seq=$0; gc_seq=seq; gsub(/[^GgCc]/,"",gc_seq)
gc+=length(gc_seq); bp+=length(seq); n++
if(n==N) exit
} END{ if(bp==0) print 0; else printf("%.2f\n", 100*gc/bp) }' tiny.fq
For gzipped input:
6) Quality strings (concept check)
Quality characters map to Phred scores (usually Phred+33). You won't decode them by hand, but recognize: higher ASCII → higher quality. (We'll rely on fastqc/multiqc next to visualize.)
EBI/NCBI format notes cover header & quality conventions.
7) Mini-lab
- Create
toy.fqwith 3 reads of different lengths. - Count reads robustly (lines/4).
- Compute longest read length and GC% of first 3 reads.
- Write a one-sentence note: what would you look at in raw output before writing a processing loop?
Exit Ticket (email)
Subject: DE M6 Exit Ticket –
Paste:
- Your read count, longest length, and GC% results.
- The one-sentence "look before you loop" habit, in your own words.