
Beginner’s Guide to Useful Bash Commands for Handling FASTA Files

Understanding FASTA Files

FASTA files are standard text-based files in bioinformatics used to represent nucleotide or protein sequences. They include:

  1. A header line starting with > followed by an identifier and optional description.
  2. One or more lines of sequence data (nucleotides or amino acids).

Why Bash Commands for FASTA?

Using Bash:

  • Simplifies repetitive tasks.
  • Allows rapid and efficient data manipulation without large software packages.
  • Is highly customizable and versatile.

Bash Commands for Basic FASTA Manipulations

1. Count the Number of Sequences

Count sequences (headers start with >):

grep -c "^>" file.fasta

2. Extract Headers Only

grep "^>" file.fasta

3. Count Total Characters in a Sequence

grep -v "^>" file.fasta | wc -c

4. Linearize FASTA (Remove Wrapping)

awk '{if(NR==1) print $0; else if($0 ~ /^>/) print "\n"$0; else printf $0}' file.fasta > output.fasta

5. Extract Specific Sequence by ID

grep -A1 "ID_OF_SEQUENCE" file.fasta

Intermediate Bash Commands

6. Find Sequences Starting with Specific Base

To list headers where sequences start with A:

grep -B1 "^A" file.fasta

7. Add Prefix or Suffix to Headers

Add a prefix:

sed 's/^>/&PREFIX_/' file.fasta > updated.fasta

Add a suffix:

sed 's/^>.*/&SUFFIX/' file.fasta > updated.fasta

8. Remove Duplicate Sequences

awk '/^>/ {header=$0; getline seq;} {if (!seen[seq]++) print header "\n" seq}' file.fasta > unique.fasta

Advanced Bash Commands

9. Compute GC Content for Each Sequence

awk '/^>/ {if (seq) print (gc/length)*100; gc=0; length=0; print; next} {gc+=gsub(/[GC]/, ""); length+=length($0)} END {if (seq) print (gc/length)*100}' file.fasta

10. Reverse Complement a Sequence

awk '/^>/ {print; next} {seq=$0; gsub(/A/, "t", seq); gsub(/T/, "a", seq); gsub(/C/, "g", seq); gsub(/G/, "c", seq); print toupper(reverse(seq))}' file.fasta

11. Extract Sequences Longer Than X

awk '/^>/ {if (seq && length(seq) > 100) print header "\n" seq; header=$0; seq=""} {if (!/^>/) seq=seq $0} END {if (length(seq) > 100) print header "\n" seq}' file.fasta

Recent Tools and Trends

12. Use seqkit for Enhanced Functionality

seqkit is a recent and powerful toolkit for FASTA/FASTQ processing. Example commands:

  • Count sequences:
    seqkit stats file.fasta
  • Extract subsequences:
    seqkit subseq --chr start:end file.fasta
  • Filter by length:
    seqkit seq -m 100 file.fasta > filtered.fasta

13. Parallel Processing with GNU Parallel

Speed up large FASTA file operations:

cat file.fasta | parallel --pipe grep "^>" > headers.txt

14. Randomize or Shuffle Sequences

Randomize sequence order using shuf:

awk '{if (NR==1) {header=$0} else if ($0 ~ /^>/) {print header "\n" seq; header=$0; seq=""} else seq=seq $0} END {print header "\n" seq}' file.fasta | shuf > shuffled.fasta


  1. Genome Analysis: Filter out low-quality sequences or subsequences.
  2. Protein Analysis: Compute metrics like molecular weight or GC content.
  3. Pipeline Integration: Use Bash commands in bioinformatics pipelines for preprocessing data.

Why Learn These Commands?

  • Empowerment to handle complex data directly in the terminal.
  • Efficiently process large-scale sequence datasets.
  • Leverage cutting-edge tools and techniques for modern research.

This guide ensures beginners can confidently handle and manipulate FASTA files using Bash, while also introducing advanced tools and trends for seasoned bioinformaticians.
