Module 8 — Text Editing & File Compression
Time: 75–90 min Goal: Master basic text editing with nano, work efficiently with compressed files, and verify file integrity with checksums. Exit ticket (email me): Your completed configuration file, checksum verification output, and analysis script.
What You'll Learn
By the end of this module, you will understand:
- How to edit text files with nano
- Essential nano keyboard shortcuts
- How to safely edit configuration and script files
- Why bioinformatics files are compressed
- How to compress and decompress files with gzip
- How to work with compressed files without decompressing them
- How to verify file integrity with checksums (md5sum, sha256sum)
- Best practices for handling large genomic datasets
Part 1: Text Editing with Nano
Why Learn a Terminal Text Editor?
In bioinformatics, you'll frequently need to: - Edit configuration files on remote servers - Fix script errors - Create analysis parameter files - Add comments to data files
You need a text editor that works in the terminal (no GUI required).
Popular terminal editors: - nano: Beginner-friendly, simple - vim/vi: Powerful but steep learning curve - emacs: Very powerful but complex
We'll learn nano because it's: - Pre-installed on most systems - Easy to use immediately - Has helpful hints displayed at the bottom - Perfect for quick edits
Starting Nano
If the file doesn't exist, nano creates it. If it exists, nano opens it for editing.
Practice:
You'll see the nano editor interface:
GNU nano 5.4 my_notes.txt
[Blank editing area]
^G Get Help ^O Write Out ^W Where Is ^K Cut Text ^J Justify
^X Exit ^R Read File ^\ Replace ^U Paste Text ^T To Spell
The ^ symbol means Ctrl. So ^X means Ctrl+X.
Basic Nano Operations
Typing Text
Just start typing! Nano works like a simple text editor.
Moving Around
| Key | Action |
|---|---|
| Arrow keys | Move cursor |
| Ctrl+A | Jump to beginning of line |
| Ctrl+E | Jump to end of line |
| Ctrl+Y | Page up |
| Ctrl+V | Page down |
| Ctrl+_ (underscore) | Go to line number |
Saving (Writing Out)
- Press Ctrl+O (Write Out)
- Nano asks:
File Name to Write: my_notes.txt - Press Enter to confirm
You'll see: [ Wrote 3 lines ]
Exiting
Press Ctrl+X
If you have unsaved changes, nano asks:
- Y: Save and exit
- N: Exit without saving
- Ctrl+C: Cancel and return to editing
Practice: Your First Edit
- Type several lines of text
- Press Ctrl+O to save
- Press Enter to confirm filename
- Press Ctrl+X to exit
Verify your file:
Essential Nano Shortcuts
Cutting, Copying, and Pasting
| Shortcut | Action | How It Works |
|---|---|---|
| Ctrl+K | Cut line | Cuts current line to clipboard |
| Alt+6 | Copy line | Copies current line without deleting |
| Ctrl+U | Paste | Pastes clipboard content |
Practice:
Type:
- Move cursor to Line 2
- Press Ctrl+K (cuts the line)
- Move cursor to end of file
- Press Ctrl+U (pastes it)
Result:
To cut multiple lines: 1. Press Ctrl+K multiple times 2. All cut lines are in the clipboard 3. Press Ctrl+U to paste them all
Searching
| Shortcut | Action |
|---|---|
| Ctrl+W | Search ("Where Is") |
| Alt+W | Find next occurrence |
| Ctrl+\ | Search and replace |
Practice:
Type:
Save it (Ctrl+O, Enter).
Now:
1. Press Ctrl+W
2. Type dog
3. Press Enter
Cursor jumps to the first "dog".
Press Alt+W to find the next occurrence.
Search and Replace
- Press Ctrl+\ (backslash)
Search:Type what to find (e.g.,dog)- Press Enter
Replace with:Type replacement (e.g.,cat)- Press Enter
- Nano asks:
Replace this instance? - Y: Replace this one
- N: Skip this one
- A: Replace all
- Ctrl+C: Cancel
Practice: Replace all "dog" with "cat" in your file.
Undoing and Redoing
| Shortcut | Action |
|---|---|
| Alt+U | Undo |
| Alt+E | Redo |
Note: On some systems, these might not work. Nano's undo is limited compared to GUI editors, so save frequently!
Getting Help
Press Ctrl+G to see the full help system. Press Ctrl+X to exit help.
Complete Nano Keyboard Shortcut Table
| Shortcut | Action |
|---|---|
| Ctrl+O | Write Out (save) |
| Ctrl+X | Exit |
| Ctrl+G | Get Help |
| Ctrl+K | Cut line |
| Ctrl+U | Paste (Uncut) |
| Alt+6 | Copy line |
| Ctrl+W | Search (Where Is) |
| Alt+W | Find next |
| Ctrl+\ | Search and replace |
| Ctrl+A | Beginning of line |
| Ctrl+E | End of line |
| Ctrl+Y | Page up |
| Ctrl+V | Page down |
| Ctrl+_ | Go to line number |
| Alt+U | Undo |
| Alt+E | Redo |
| Ctrl+C | Show cursor position / Cancel |
| Ctrl+R | Read (insert) another file |
Real Bioinformatics Use Cases
Use Case 1: Editing a Configuration File
Type:
# Analysis Configuration
reference_genome=/data/refs/hg38.fa
input_fastq=/data/samples/patient001_R1.fastq.gz
output_dir=/results/patient001
threads=8
min_quality=20
Save and exit.
Use Case 2: Creating a Sample Manifest
Type:
SampleID Condition Batch Replicate
S001 control Batch1 Rep1
S002 control Batch1 Rep2
S003 treated Batch1 Rep1
S004 treated Batch1 Rep2
(Use Tab key between columns for proper TSV format)
Use Case 3: Quick Script Fix
You have a script with an error:
Original:
Fix:
1. Press Ctrl+W, search for resutls
2. Press Ctrl+\, replace resutls with results
3. Save and exit
Part 2: File Compression
Why Compress Files in Bioinformatics?
Genomic data files are huge: - A single human genome FASTQ: ~200 GB uncompressed - Compressed with gzip: ~60 GB (70% reduction!) - A single RNA-seq sample: 5-20 GB compressed
Compression saves: - Disk space - Transfer time - Storage costs
Common compressed formats:
- .gz — gzip compression (most common)
- .bz2 — bzip2 compression (better compression, slower)
- .zip — ZIP archive (common on Windows)
- .tar.gz or .tgz — Compressed archive (multiple files)
In bioinformatics, gzip (.gz) is the standard.
Compressing Files with gzip
gzip compresses files and adds a .gz extension.
Important: This deletes the original file!
Practice:
cd ~/bioinfo-course/module08
# Create a test file
echo "This is a test file with some repeated text text text text" > test.txt
# Check size
ls -lh test.txt
# Compress it
gzip test.txt
# Check size and name
ls -lh test.txt.gz
The original test.txt is gone, replaced by test.txt.gz.
Keep the Original: gzip -c
To keep the original, use -c (write to stdout) and redirect:
Now both test.txt and test.txt.gz exist.
Decompress with gunzip
This deletes test.txt.gz and creates test.txt.
Or keep the compressed file:
Useful gzip Options
| Option | What It Does | Example |
|---|---|---|
-c |
Write to stdout (keep original) | gzip -c file > file.gz |
-d |
Decompress (same as gunzip) | gzip -d file.gz |
-k |
Keep original file | gzip -k file |
-1 to -9 |
Compression level (1=fast/big, 9=slow/small) | gzip -9 file |
-l |
List compressed file info | gzip -l file.gz |
-r |
Recursively compress directory | gzip -r directory/ |
-t |
Test integrity | gzip -t file.gz |
Check Compression Ratio
Output:
Working with Compressed Files WITHOUT Decompressing
This is crucial for bioinformatics! You don't want to decompress a 60 GB FASTQ file just to look at it!
View with zcat (Compressed cat)
Example:
cd ~/bioinfo-course/module08
# Create and compress a test file
cat > sequences.txt << 'EOF'
>Seq1
ATGCGCTAG
>Seq2
GCTAGCTAC
>Seq3
TACGATCGA
EOF
gzip sequences.txt
# View without decompressing
zcat sequences.txt.gz
Output:
The file stays compressed!
View with zless (Compressed less)
Same navigation as less (Space, b, q, /search, etc.).
Perfect for previewing large compressed FASTQ files:
Search with zgrep (Compressed grep)
Example:
Output: 3
Process with Pipes
# Count lines in compressed file
zcat file.gz | wc -l
# Extract sequences from compressed FASTA
zcat sequences.txt.gz | grep -v '^>'
# Get first 1000 reads from compressed FASTQ
zcat sample_R1.fastq.gz | head -n 4000 | less
OS-Specific Note for macOS
Practical Example: FASTQ File Inspection
cd ~/bioinfo-course/module08
# Create a mock FASTQ file
cat > sample.fastq << 'EOF'
@Read1
ATGCGCTAG
+
IIIIIIIII
@Read2
GCTAGCTAC
+
IIIIIIIII
@Read3
TACGATCGA
+
IIIIIIIII
EOF
# Compress it
gzip sample.fastq
# Preview first read (4 lines)
zcat sample.fastq.gz | head -n 4
# Count total reads (number of lines / 4)
echo "$(( $(zcat sample.fastq.gz | wc -l) / 4 )) reads"
# Count reads with quality score 'I' (high quality)
zgrep -c 'IIIIIIIII' sample.fastq.gz
# Extract just the sequences (every 2nd line starting from line 2)
zcat sample.fastq.gz | awk 'NR % 4 == 2'
Part 3: Checksums — Verify File Integrity
What is a Checksum?
A checksum is a unique "fingerprint" of a file's contents. Even changing a single byte produces a completely different checksum.
Uses: - Verify file wasn't corrupted during download/transfer - Confirm file hasn't been modified - Compare files quickly without reading entire contents
Common algorithms: - MD5: 128-bit hash (fast, but cryptographically weak) - SHA-256: 256-bit hash (secure, standard for verification)
Generating Checksums
MD5 Checksum
Example:
cd ~/bioinfo-course/module08
echo "Important data" > data.txt
# Generate MD5 checksum (Linux/WSL)
md5sum data.txt
Output:
The first part is the checksum (32 hexadecimal characters).
SHA-256 Checksum (Recommended)
Example:
Output:
(64 hex characters)
Saving and Verifying Checksums
Save Checksum to File
Output:
Verify Checksum Later
Output:
This confirms the file hasn't changed.
Test: Modify the File
# Change the file slightly
echo "Modified data" > data.txt
# Verify again
sha256sum -c data.txt.sha256
Output:
The checksum detected the change!
Real Bioinformatics Use Case: Download Verification
When downloading genomic data, you should always verify checksums.
Example workflow:
# Download FASTQ file (simulated)
curl -O ftp://data.server.edu/sample001_R1.fastq.gz
# Download checksum file
curl -O ftp://data.server.edu/sample001_R1.fastq.gz.md5
# Verify
md5sum -c sample001_R1.fastq.gz.md5
If it says "OK", your download is complete and uncorrupted!
Generate Checksums for Multiple Files
# Generate checksums for all FASTQ files
sha256sum *.fastq.gz > all_samples.sha256
# Verify all at once later
sha256sum -c all_samples.sha256
Output:
sample001_R1.fastq.gz: OK
sample001_R2.fastq.gz: OK
sample002_R1.fastq.gz: OK
sample002_R2.fastq.gz: OK
Practice Exercises
Exercise 1: Create and Edit a Script
Type:
#!/bin/bash
# Quality control script for RNA-seq samples
SAMPLE_ID="sample001"
INPUT_DIR="/data/raw_fastq"
OUTPUT_DIR="/results/qc"
echo "Starting QC for $SAMPLE_ID"
mkdir -p $OUTPUT_DIR
# Placeholder for FastQC
echo "Running FastQC..."
# fastqc ${INPUT_DIR}/${SAMPLE_ID}_R1.fastq.gz -o $OUTPUT_DIR
echo "QC complete!"
Save (Ctrl+O, Enter), then exit (Ctrl+X).
Verify:
Exercise 2: Compress and Verify
# Create a data file
cat > experiment_results.txt << 'EOF'
Sample,ReadCount,MappingRate
S001,25000000,95.2
S002,26000000,94.8
S003,24500000,93.1
EOF
# Generate checksum BEFORE compression
sha256sum experiment_results.txt > experiment_results.txt.sha256
# Compress
gzip -k experiment_results.txt
# Verify original (should work)
sha256sum -c experiment_results.txt.sha256
# Try to view compressed without decompressing
zcat experiment_results.txt.gz | head
Exercise 3: FASTQ Simulation
# Create a larger mock FASTQ
cat > reads.fastq << 'EOF'
@Read1
ATGCGCTAGCTAGCTAGCTAGCTAGCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIII
@Read2
GCTAGCTACGATCGATCGATCGATCG
+
IIIIIIIIIIIIIIIIIIIIIIIIIII
@Read3
TACGATCGAGCTAGCTAGCTAGCTAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIII
@Read4
ATCGATCGATCGATCGATCGATCGAT
+
IIIIIIIIIIIIIIIIIIIIIIIIIII
@Read5
GCTAGCTAGCTAGCTAGCTAGCTAGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIII
EOF
# Compress
gzip reads.fastq
# Count reads
echo "$(( $(zcat reads.fastq.gz | wc -l) / 4 )) reads total"
# Extract just sequence IDs
zcat reads.fastq.gz | grep '^@' | cut -c 2-
# Extract just sequences
zcat reads.fastq.gz | awk 'NR % 4 == 2'
# Find reads containing "GCTAGC" motif
zgrep 'GCTAGC' reads.fastq.gz
The Exit Ticket Challenge: Complete Workflow
Scenario
You're setting up a bioinformatics analysis. Create configuration files, compress data, and verify integrity.
Tasks
Task 1: Create an analysis configuration file
Include: - Reference genome path - Input FASTQ paths (sample001_R1.fastq.gz, sample001_R2.fastq.gz) - Output directory - Number of threads (8) - Minimum quality score (20)
Task 2: Create a sample manifest in TSV format
Include 5 samples with columns: - SampleID - Condition (control or treated) - Batch - ReadCount - QC_Status (pass/fail)
Task 3: Create a mock FASTQ file with 10 reads
Use proper FASTQ format (4 lines per read).
Task 4: Compress the FASTQ file, keeping the original
Task 5: Generate SHA-256 checksums for: - The uncompressed FASTQ - The compressed FASTQ - The manifest file
Save all checksums to checksums.sha256
Task 6: Verify all checksums
Task 7: Without decompressing, count the number of reads in the compressed FASTQ
Task 8: Create a simple analysis script (run_analysis.sh) that:
- Prints "Analysis started"
- Prints the sample ID and paths from your config file (just echo statements)
- Prints "Analysis complete"
Make the script executable and test it:
Quick Reference
Nano Shortcuts
| Shortcut | Action |
|---|---|
| Ctrl+O | Save (Write Out) |
| Ctrl+X | Exit |
| Ctrl+K | Cut line |
| Ctrl+U | Paste (Uncut) |
| Ctrl+W | Search |
| Ctrl+\ | Replace |
| Ctrl+A | Start of line |
| Ctrl+E | End of line |
| Alt+U | Undo |
| Ctrl+G | Help |
Compression
| Command | What It Does |
|---|---|
gzip file |
Compress (deletes original) |
gzip -k file |
Compress (keeps original) |
gzip -c file > file.gz |
Compress to stdout |
gunzip file.gz |
Decompress (deletes .gz) |
zcat file.gz |
View compressed file |
zless file.gz |
Page through compressed file |
zgrep pattern file.gz |
Search compressed file |
gzip -l file.gz |
Show compression info |
Checksums
| Command | What It Does |
|---|---|
md5sum file |
Generate MD5 checksum (Linux) |
sha256sum file |
Generate SHA-256 checksum (Linux) |
shasum -a 256 file |
Generate SHA-256 checksum (macOS) |
sha256sum -c file.sha256 |
Verify checksum (Linux) |
shasum -a 256 -c file.sha256 |
Verify checksum (macOS) |
Exit Ticket
To complete this module, send me an email with:
Subject: Bioinfo M8 Exit Ticket – [Your Name]
Content:
- Contents of your
analysis_config.txt - Contents of your
sample_manifest.tsv - Output of checksum verification (
sha256sum -c checksums.sha256) - The command you used to count reads in the compressed FASTQ
- Contents of your
run_analysis.shscript - One paragraph (3-4 sentences) explaining why compression and checksums are important in bioinformatics
Summary
Congratulations! You now understand:
✓ How to edit text files with nano ✓ Essential nano keyboard shortcuts for efficient editing ✓ How to safely edit configuration files and scripts ✓ Why compression is critical for genomic data ✓ How to compress and decompress files with gzip ✓ How to work with compressed files without decompressing (zcat, zless, zgrep) ✓ How to verify file integrity with checksums (md5sum, sha256sum) ✓ Best practices for handling large genomic datasets
These skills are foundational! You'll use them constantly when: - Editing configuration files for bioinformatics tools - Working with compressed FASTQ/BAM files - Verifying downloaded datasets - Creating and modifying analysis scripts - Managing disk space efficiently
In the next module, we'll dive into advanced text processing with awk—a powerful tool for manipulating tabular data.