Module 5 — Pipes, Redirects & Filters
Time: 90–120 min (this is a big one!) Goal: Master the Unix philosophy: combine simple tools using pipes to accomplish complex data processing tasks. Exit ticket (email me): Your solutions to the genomic data analysis challenge showing command pipelines and outputs.
What You'll Learn
By the end of this module, you will understand:
- How programs communicate: standard input, output, and error streams
- How to redirect output to files with
>,>>, and2> - How to chain commands together with pipes (
|) - Essential filter commands:
grep,cut,sort,uniq,wc - How to use
teeto split output - How to build complex data processing pipelines
- Real-world bioinformatics data wrangling with TSV files
The Unix Philosophy
Before we dive into commands, let's understand the philosophy behind Unix tools:
"Write programs that do one thing and do it well. Write programs to work together." — Doug McIlroy, inventor of Unix pipes
Instead of one massive program that does everything, Unix provides many small, focused tools that each do one thing well. You combine them like LEGO blocks to solve complex problems.
Example:
Rather than a single "analyze-genomic-data" program, you chain together:
- grep (find patterns)
- cut (extract columns)
- sort (order data)
- uniq (find unique entries)
- awk (process text)
This modular approach is perfect for bioinformatics, where data formats and analyses vary wildly.
Understanding Standard Streams
Every program in Unix has three communication channels called streams:
1. Standard Input (stdin)
stdin is where a program reads its input data.
- By default: keyboard input
- Can be: redirected from a file or another program
2. Standard Output (stdout)
stdout is where a program writes its normal output.
- By default: printed to your terminal screen
- Can be: redirected to a file or another program
3. Standard Error (stderr)
stderr is where a program writes error messages.
- By default: printed to your terminal screen (same as stdout)
- Can be: redirected separately from stdout
Visual representation:
┌─────────────┐
│ Program │
├─────────────┤
│ stdin (0) ←┼─ Input data
│ stdout (1) ─┼→ Normal output
│ stderr (2) ─┼→ Error messages
└─────────────┘
Each stream has a number: - stdin = 0 - stdout = 1 - stderr = 2
Why separate stdout and stderr?
So you can: - Save normal output to a file while still seeing errors on screen - Process output with other commands without error messages interfering - Log errors separately from results
Output Redirection: Saving Results to Files
Redirect stdout with >
The > operator redirects stdout to a file, overwriting the file if it exists.
Example:
cd ~/bioinfo-course
mkdir -p module05
cd module05
# Redirect output to a file
ls -lh > directory_listing.txt
# View the file
cat directory_listing.txt
The ls output went to the file instead of your screen.
Another example:
Append with >>
The >> operator redirects stdout to a file, appending to it (adds to the end without erasing existing content).
Example:
echo "Experiment started: $(date)" > lab_notes.txt
echo "Sample 1 processed" >> lab_notes.txt
echo "Sample 2 processed" >> lab_notes.txt
cat lab_notes.txt
Output:
Single > vs Double >>
>overwrites the file (destroys existing content)>>appends to the file (adds to the end)
Common mistake:
Redirect stderr with 2>
To redirect error messages separately from normal output, use 2>:
Example:
# Try to list a nonexistent file
ls /nonexistent/path > output.txt 2> errors.txt
# Check both files
cat output.txt # Empty (no normal output)
cat errors.txt # Contains the error message
Redirect both stdout and stderr
Redirect both to the same file:
2>&1 means "redirect stderr (2) to wherever stdout (1) is going."
Modern simpler syntax (bash 4+):
Example:
Discard output with /dev/null
/dev/null is a special "black hole" file that discards everything sent to it.
Example: Run a command silently (no output at all):
Example: Keep only errors:
Example: Keep only normal output:
Practical Example: Logging an Analysis
# Run an analysis and log everything
analysis_script.sh > results.txt 2> errors.log
# Or capture both in one log
analysis_script.sh &> full_log.txt
Pipes: Chaining Commands Together
A pipe (|) connects the stdout of one command to the stdin of another.
The output of command1 becomes the input to command2.
Visual representation:
Your First Pipe
Count lines in a directory listing:
Breaking it down:
1. ls lists files → stdout
2. | pipes that output to...
3. wc -l counts the lines
Example:
Output: 10
Chaining Multiple Pipes
You can chain as many pipes as you want:
Example: Find the 5 largest files
Breaking it down:
1. ls -lh lists files with human-readable sizes
2. sort -k5 -hr sorts by column 5 (size), human-readable, reverse order (largest first)
3. head -n 5 shows only the first 5 lines
Pipes vs Redirection
Pipes (|) connect commands:
Redirection (>) connects commands to files:
You can use both:
Essential Filter Commands
"Filter commands" are programs that read from stdin, process data, and write to stdout. They're perfect for pipes!
1. grep — Search for Patterns
grep searches for lines matching a pattern.
grep stands for "globally search for a regular expression and print."
Basic usage:
# Create a sample file
cat > samples.txt << 'EOF'
sample001 tumor pass
sample002 normal pass
sample003 tumor fail
sample004 normal pass
sample005 tumor pass
EOF
# Search for lines containing "tumor"
grep 'tumor' samples.txt
Output:
Common grep options:
| Option | What It Does | Example |
|---|---|---|
-i |
Case-insensitive search | grep -i 'TUMOR' samples.txt |
-v |
Invert match (show lines NOT matching) | grep -v 'tumor' samples.txt |
-c |
Count matching lines | grep -c 'tumor' samples.txt |
-n |
Show line numbers | grep -n 'tumor' samples.txt |
-w |
Match whole words only | grep -w 'pass' samples.txt |
-A N |
Show N lines after match | grep -A 2 'fail' samples.txt |
-B N |
Show N lines before match | grep -B 1 'fail' samples.txt |
-C N |
Show N lines before and after | grep -C 1 'fail' samples.txt |
-E |
Extended regex (more powerful patterns) | grep -E 'tumor\|normal' samples.txt |
Examples:
# Case-insensitive search
grep -i 'PASS' samples.txt
# Find lines that DON'T contain "tumor"
grep -v 'tumor' samples.txt
# Count how many samples passed
grep -c 'pass' samples.txt
# Show line numbers
grep -n 'fail' samples.txt
Using grep in pipes:
# List all Python scripts in the current directory
ls | grep '\.py$'
# Find running processes related to python
ps aux | grep python
# Count how many .txt files exist
ls | grep '\.txt$' | wc -l
2. cut — Extract Columns
cut extracts specific columns (fields) from each line.
For tab-separated files:
For comma-separated files:
Common options:
| Option | What It Does |
|---|---|
-f N |
Select field(s) N (columns) |
-d 'X' |
Use X as delimiter (default is tab) |
-f 1,3,5 |
Select columns 1, 3, and 5 |
-f 1-3 |
Select columns 1 through 3 |
Example:
# Create a TSV file
cat > data.tsv << 'EOF'
sample001 tumor 85 pass
sample002 normal 92 pass
sample003 tumor 45 fail
sample004 normal 88 pass
EOF
# Extract sample IDs (column 1)
cut -f 1 data.tsv
# Extract sample ID and quality score (columns 1 and 3)
cut -f 1,3 data.tsv
# Extract columns 1-3
cut -f 1-3 data.tsv
Using cut in pipes:
Working with TSV vs CSV
- TSV (tab-separated):
cut -f 1(default delimiter is tab) - CSV (comma-separated):
cut -d ',' -f 1
TSV is often easier to work with because commas can appear inside fields.
3. sort — Sort Lines
sort arranges lines in order.
Basic usage:
Wait, that gives:
Perfect! But what about...
Output:
That's wrong! It's sorting as text (lexicographic), not numbers.
Fix: numeric sort
Output:
Common sort options:
| Option | What It Does | Example |
|---|---|---|
-n |
Numeric sort | sort -n numbers.txt |
-r |
Reverse order | sort -r file.txt |
-k N |
Sort by column N | sort -k 2 file.tsv |
-t 'X' |
Use X as field delimiter | sort -t ',' -k 2 file.csv |
-u |
Unique (remove duplicates) | sort -u file.txt |
-h |
Human-readable numbers (1K, 2M, 3G) | sort -h sizes.txt |
Examples:
# Sort by second column
sort -k 2 data.tsv
# Sort by third column numerically, reverse order
sort -k 3 -nr data.tsv
# Sort by multiple columns: first by column 2, then by column 3
sort -k 2,2 -k 3,3n data.tsv
Using sort in pipes:
4. uniq — Find Unique Lines
uniq removes or reports duplicate adjacent lines.
Important: uniq only removes adjacent duplicates, so you almost always want to sort first!
Example:
cat > fruits.txt << 'EOF'
apple
banana
apple
apple
orange
banana
EOF
# Without sorting (only adjacent duplicates removed)
uniq fruits.txt
Output:
Not what we wanted!
With sorting:
Output:
Perfect!
Common uniq options:
| Option | What It Does | Example |
|---|---|---|
-c |
Count occurrences | sort file.txt \| uniq -c |
-d |
Show only duplicates | sort file.txt \| uniq -d |
-u |
Show only unique lines (no duplicates) | sort file.txt \| uniq -u |
-i |
Ignore case | sort file.txt \| uniq -i |
Examples:
Output:
Using uniq in pipes:
Breaking it down:
1. cut -f 2 data.tsv → extract sample types
2. sort → group same types together
3. uniq -c → count each type
4. sort -nr → sort by count (numeric, reverse)
5. head -n 1 → show the top one
5. wc — Count Lines, Words, Characters
We've used wc before, but let's see it in detail:
Default: shows lines, words, and characters:
Output:
This means: 4 lines, 16 words, 104 characters.
Common options:
| Option | What It Does |
|---|---|
-l |
Count lines only |
-w |
Count words only |
-c |
Count characters (bytes) |
-m |
Count characters (multibyte-aware) |
Using wc in pipes:
# Count tumor samples
grep 'tumor' data.tsv | wc -l
# Count unique sample types
cut -f 2 data.tsv | sort -u | wc -l
6. tee — Split Output
tee reads from stdin and writes to both stdout AND a file.
Think of a "T" pipe junction—the data flows in one end and splits to two outputs.
Example:
# Save intermediate results while continuing the pipeline
grep 'tumor' data.tsv | tee tumor_samples.txt | wc -l
This:
1. Filters for tumor samples
2. Saves them to tumor_samples.txt
3. Continues piping to wc -l to count them
Multiple outputs:
Append mode:
The -a flag appends instead of overwriting.
Building Complex Pipelines
Now let's combine everything to solve real problems!
Example 1: Analyze a Gene List
Setup:
cd ~/bioinfo-course/module05
cat > genes.txt << 'EOF'
BRCA1 tumor_suppressor chr17 pass
TP53 tumor_suppressor chr17 pass
MYC oncogene chr8 pass
EGFR oncogene chr7 fail
KRAS oncogene chr12 pass
PTEN tumor_suppressor chr10 pass
BRAF oncogene chr7 pass
BRCA2 tumor_suppressor chr13 pass
RB1 tumor_suppressor chr13 fail
EOF
Task: Find tumor suppressor genes that passed QC, sorted by chromosome
Breaking it down:
1. grep 'tumor_suppressor' → filter for tumor suppressors
2. grep 'pass' → filter for those that passed QC
3. sort -k 3 → sort by chromosome (column 3)
Task: Count genes per chromosome
Output:
Task: Find the most common gene type
Example 2: Process a Sample Manifest
Setup:
cat > manifest.tsv << 'EOF'
SampleID Type Date QC_Score Status
S001 tumor 2024-01-10 95 pass
S002 normal 2024-01-10 88 pass
S003 tumor 2024-01-11 45 fail
S004 normal 2024-01-11 92 pass
S005 tumor 2024-01-12 87 pass
S006 normal 2024-01-12 91 pass
S007 tumor 2024-01-13 30 fail
S008 normal 2024-01-13 89 pass
S009 tumor 2024-01-14 94 pass
S010 normal 2024-01-14 90 pass
EOF
Task: Extract sample IDs that passed QC
Task: Calculate average QC score for passed samples
Output: 91
(We'll learn awk properly in a future module!)
Task: Count samples by type and status
Breaking it down:
1. tail -n +2 manifest.tsv → skip the header line (start from line 2)
2. cut -f 2,5 → extract Type and Status columns
3. sort → group same combinations
4. uniq -c → count each combination
Output:
Wait, that's showing 7 samples but we have 10! Let's fix:
Actually, that's wrong. Let me recalculate:
This uses grep -v to exclude the header line instead.
Practical Exercise: Genomic Data Wrangling
Let's work with realistic genomic data!
Setup: Variant Call Data
cd ~/bioinfo-course/module05
cat > variants.tsv << 'EOF'
Chr Pos Ref Alt Gene Type Quality Filter
chr1 12345 A G GENE1 SNP 98 PASS
chr1 67890 C T GENE2 SNP 45 LowQual
chr2 11111 G A GENE3 SNP 99 PASS
chr2 22222 A T GENE3 SNP 87 PASS
chr3 33333 C G GENE4 SNP 32 LowQual
chr3 44444 T A GENE5 SNP 95 PASS
chr3 55555 G C GENE5 SNP 91 PASS
chr4 66666 A C GENE6 SNP 88 PASS
chr4 77777 C T GENE6 SNP 41 LowQual
chr5 88888 G A GENE7 SNP 97 PASS
chr5 99999 T C GENE8 SNP 93 PASS
chrX 11111 A G GENE9 SNP 89 PASS
chrX 22222 C A GENE10 SNP 48 LowQual
EOF
Challenges
For each challenge, build a pipeline and record your command!
Challenge 1: Count how many variants passed QC (Filter = PASS)
Challenge 2: Extract chromosome and position for all failed variants
Challenge 3: List unique genes that have variants
Solution
Or:Challenge 4: Count variants per chromosome
Challenge 5: Find the variant with the highest quality score
Challenge 6: List genes on chr3 that have passing variants
Challenge 7: Count how many passing variants are on chromosome X
Challenge 8: Create a file with only passing variants, sorted by quality (highest first)
Solution
The Exit Ticket Challenge: RNA-seq Sample Analysis
Scenario
You've received an RNA-seq sample manifest from the sequencing core. Your task: analyze the manifest and generate summary statistics using pipes and filters.
Setup
cd ~/bioinfo-course/module05
cat > rnaseq_manifest.tsv << 'EOF'
SampleID Condition Batch ReadCount MappingRate Status Notes
P001_T1 control Batch1 25000000 95.2 pass good
P001_T2 control Batch1 26000000 94.8 pass good
P002_T1 treated Batch1 24500000 93.1 pass good
P002_T2 treated Batch1 23000000 89.3 pass acceptable
P003_T1 control Batch2 22000000 96.1 pass good
P003_T2 control Batch2 21500000 95.8 pass good
P004_T1 treated Batch2 19000000 88.7 pass acceptable
P004_T2 treated Batch2 17000000 72.4 fail low_mapping
P005_T1 control Batch3 26500000 94.3 pass good
P005_T2 control Batch3 25800000 93.9 pass good
P006_T1 treated Batch3 24300000 91.2 pass good
P006_T2 treated Batch3 15000000 68.1 fail low_mapping
P007_T1 control Batch3 27000000 96.5 pass excellent
P007_T2 control Batch3 26800000 95.9 pass excellent
P008_T1 treated Batch3 23900000 90.8 pass good
P008_T2 treated Batch3 16500000 70.3 fail low_mapping
EOF
Your Tasks
Build pipelines to answer these questions. Record each command and its output.
Task 1: How many samples passed QC?
Task 2: How many samples failed QC?
Task 3: List all sample IDs that failed (just the IDs, one per line)
Task 4: Count samples by condition (control vs treated)
Task 5: Count samples by batch
Task 6: What is the average read count for samples that passed? (Hint: you'll need awk)
Task 7: Find the sample with the highest read count (show the whole line)
Task 8: Find the sample with the lowest mapping rate (show sample ID and mapping rate)
Task 9: Create a new file called qc_pass_samples.tsv containing:
- The header line
- Only samples that passed QC
- Sorted by read count (highest first)
Task 10: Generate a summary report file summary.txt containing:
- Total number of samples
- Number that passed
- Number that failed
- Number of control samples
- Number of treated samples
(You'll need to run multiple commands and redirect outputs to build this file)
Quick Reference
Redirection
| Operator | What It Does | Example |
|---|---|---|
> |
Redirect stdout (overwrite) | ls > files.txt |
>> |
Redirect stdout (append) | echo "data" >> log.txt |
2> |
Redirect stderr | command 2> errors.log |
&> |
Redirect both stdout and stderr | command &> all.log |
2>&1 |
Redirect stderr to stdout | command > all.log 2>&1 |
< |
Redirect stdin from file | command < input.txt |
\| |
Pipe stdout to next command | ls \| wc -l |
Essential Filters
| Command | What It Does | Common Options |
|---|---|---|
grep 'pattern' |
Search for patterns | -i, -v, -c, -n, -E |
cut -f N |
Extract columns | -f, -d |
sort |
Sort lines | -n, -r, -k, -u |
uniq |
Remove adjacent duplicates | -c, -d, -u |
wc |
Count lines/words/characters | -l, -w, -c |
tee file |
Split output to file and stdout | -a (append) |
Common Patterns
# Count unique values
cut -f 2 file.tsv | sort -u | wc -l
# Find most common value
cut -f 2 file.tsv | sort | uniq -c | sort -nr | head -n 1
# Filter and extract
grep 'pattern' file.tsv | cut -f 1,3
# Sort by numeric column
sort -k 3 -nr file.tsv
# Remove header, process data
tail -n +2 file.tsv | cut -f 2 | sort | uniq -c
Exit Ticket
To complete this module, send me an email with:
Subject: Bioinfo M5 Exit Ticket – [Your Name]
Content:
- Your commands and outputs for all 10 tasks in the RNA-seq Sample Analysis Challenge
- The contents of your
qc_pass_samples.tsvfile - The contents of your
summary.txtfile - One paragraph (3-5 sentences) explaining the Unix philosophy and why pipes are powerful for bioinformatics
Summary
Congratulations! You now understand:
✓ How Unix programs communicate via stdin, stdout, and stderr
✓ How to redirect output to files with >, >>, and 2>
✓ How to chain commands with pipes (|)
✓ Essential filter commands: grep, cut, sort, uniq, wc
✓ How to use tee to split output
✓ How to build complex data processing pipelines
✓ Real-world bioinformatics data wrangling techniques
This is one of the most important skills in Unix! Pipes let you combine simple tools to solve complex problems, and this modular approach is perfect for the varied analyses you'll do in bioinformatics.
In the next module, we'll learn about regular expressions (regex) for even more powerful pattern matching.