Module 6 — FASTQ 101 (Hands-on)

Time: ~60 min
Goal: Understand 4-line records, count reads safely, and compute simple stats.

Type it, don't paste it

This one is worth typing—head, wc, awk, and zcat/gzcat will become muscle memory.

1) FASTQ anatomy (recap)

Each read = 4 lines:

@ header (ID + metadata)
Sequence (A/C/G/T/N)
+ (optional header repeat)
Quality string (ASCII characters encoding Phred scores)

Good overview references from NCBI/EBI.

2) Make a tiny FASTQ

mkdir -p ~/de-onramp/lesson3 && cd ~/de-onramp/lesson3
cat > tiny.fq << 'EOF'
@r1
ACGTTGCA
+
IIIIHHHF
@r2
GGGTTTAA
+
FFFFFIII
EOF

3) Sanity checks (robust counting)

Avoid grep '^@' (the @ can appear in quality lines). Count records by lines ÷ 4:

wc -l tiny.fq
# If the last number is L, reads = L/4
awk 'END{print NR/4}' tiny.fq

For gz files:

Linux/WSLmacOS

gzip -c tiny.fq > tiny.fq.gz
zcat tiny.fq.gz | awk 'END{print NR/4}'

gzip -c tiny.fq > tiny.fq.gz
gzcat tiny.fq.gz | awk 'END{print NR/4}'

4) Peek at sequences & lengths

# Show the first two records (8 lines)
head -n 8 tiny.fq

# Longest read length
awk 'NR%4==2{ if(length($0)>m) m=length($0) } END{ print m }' tiny.fq

5) Quick GC%

GC% for the first N reads (e.g., 1000), falling back to all if fewer:

N=1000
awk -v N="$N" 'NR%4==2{
  seq=$0; gc_seq=seq; gsub(/[^GgCc]/,"",gc_seq)
  gc+=length(gc_seq); bp+=length(seq); n++
  if(n==N) exit
} END{ if(bp==0) print 0; else printf("%.2f\n", 100*gc/bp) }' tiny.fq

For gzipped input:

Linux/WSLmacOS

zcat tiny.fq.gz | awk -v N=1000 'NR%4==2{ ...same body... }'

gzcat tiny.fq.gz | awk -v N=1000 'NR%4==2{ ...same body... }'

6) Quality strings (concept check)

Quality characters map to Phred scores (usually Phred+33). You won't decode them by hand, but recognize: higher ASCII → higher quality. (We'll rely on fastqc/multiqc next to visualize.)

EBI/NCBI format notes cover header & quality conventions.

7) Mini-lab

Create toy.fq with 3 reads of different lengths.
Count reads robustly (lines/4).
Compute longest read length and GC% of first 3 reads.
Write a one-sentence note: what would you look at in raw output before writing a processing loop?

Exit Ticket (email)

Subject: DE M6 Exit Ticket –
Paste:

Your read count, longest length, and GC% results.
The one-sentence "look before you loop" habit, in your own words.