AI-bioinformatics

Beginner’s Guide to Useful Bash Commands for Handling FASTA Files

December 28, 2024 Off By admin
Shares

Understanding FASTA Files

FASTA files are standard text-based files in bioinformatics used to represent nucleotide or protein sequences. They include:

  1. A header line starting with > followed by an identifier and optional description.
  2. One or more lines of sequence data (nucleotides or amino acids).

Why Bash Commands for FASTA?

Using Bash:

  • Simplifies repetitive tasks.
  • Allows rapid and efficient data manipulation without large software packages.
  • Is highly customizable and versatile.

Bash Commands for Basic FASTA Manipulations

1. Count the Number of Sequences

Count sequences (headers start with >):

bash
grep -c "^>" file.fasta

2. Extract Headers Only

bash
grep "^>" file.fasta

3. Count Total Characters in a Sequence

bash
grep -v "^>" file.fasta | wc -c

4. Linearize FASTA (Remove Wrapping)

bash
awk '{if(NR==1) print $0; else if($0 ~ /^>/) print "\n"$0; else printf $0}' file.fasta > output.fasta

5. Extract Specific Sequence by ID

bash
grep -A1 "ID_OF_SEQUENCE" file.fasta

Intermediate Bash Commands

6. Find Sequences Starting with Specific Base

To list headers where sequences start with A:

bash
grep -B1 "^A" file.fasta

7. Add Prefix or Suffix to Headers

Add a prefix:

bash
sed 's/^>/&PREFIX_/' file.fasta > updated.fasta

Add a suffix:

bash
sed 's/^>.*/&SUFFIX/' file.fasta > updated.fasta

8. Remove Duplicate Sequences

bash
awk '/^>/ {header=$0; getline seq;} {if (!seen[seq]++) print header "\n" seq}' file.fasta > unique.fasta

Advanced Bash Commands

9. Compute GC Content for Each Sequence

bash
awk '/^>/ {if (seq) print (gc/length)*100; gc=0; length=0; print; next} {gc+=gsub(/[GC]/, ""); length+=length($0)} END {if (seq) print (gc/length)*100}' file.fasta

10. Reverse Complement a Sequence

bash
awk '/^>/ {print; next} {seq=$0; gsub(/A/, "t", seq); gsub(/T/, "a", seq); gsub(/C/, "g", seq); gsub(/G/, "c", seq); print toupper(reverse(seq))}' file.fasta

11. Extract Sequences Longer Than X

bash
awk '/^>/ {if (seq && length(seq) > 100) print header "\n" seq; header=$0; seq=""} {if (!/^>/) seq=seq $0} END {if (length(seq) > 100) print header "\n" seq}' file.fasta

Recent Tools and Trends

12. Use seqkit for Enhanced Functionality

seqkit is a recent and powerful toolkit for FASTA/FASTQ processing. Example commands:

  • Count sequences:
    bash
    seqkit stats file.fasta
  • Extract subsequences:
    bash
    seqkit subseq --chr start:end file.fasta
  • Filter by length:
    bash
    seqkit seq -m 100 file.fasta > filtered.fasta

13. Parallel Processing with GNU Parallel

Speed up large FASTA file operations:

bash
cat file.fasta | parallel --pipe grep "^>" > headers.txt

14. Randomize or Shuffle Sequences

Randomize sequence order using shuf:

bash
awk '{if (NR==1) {header=$0} else if ($0 ~ /^>/) {print header "\n" seq; header=$0; seq=""} else seq=seq $0} END {print header "\n" seq}' file.fasta | shuf > shuffled.fasta

Applications

  1. Genome Analysis: Filter out low-quality sequences or subsequences.
  2. Protein Analysis: Compute metrics like molecular weight or GC content.
  3. Pipeline Integration: Use Bash commands in bioinformatics pipelines for preprocessing data.

Why Learn These Commands?

  • Empowerment to handle complex data directly in the terminal.
  • Efficiently process large-scale sequence datasets.
  • Leverage cutting-edge tools and techniques for modern research.

This guide ensures beginners can confidently handle and manipulate FASTA files using Bash, while also introducing advanced tools and trends for seasoned bioinformaticians.

Shares