bioinformatics

Bioinformatics Bash Scripting Tutorial: Processing and Analyzing FASTA Sequences

September 26, 2023 Off By admin
Shares

Bash scripting is crucial in bioinformatics for automating repetitive tasks and data processing. Below, I am providing a step-by-step guide, from starting a bash script to processing a hypothetical bioinformatics dataset.

1. Starting a Bash Script

To start writing a Bash script:

  • Open the terminal
  • Navigate to the directory where you want to create your script.
bash
cd /path/to/directory
  • Create a new script file using a text editor like nano or vi.
bash
nano my_script.sh
  • At the top of your script file, write the shebang line to tell the system to interpret your script as a bash script.
bash
#!/bin/bash
  • Write your script. When you finish, save and close the text editor. If you are using nano, press CTRL + X, then Y to save changes, and Enter to exit.
  • Make your script executable.
bash
chmod +x my_script.sh

2. Scenario: Processing a FASTA File

Assume you have a FASTA file, sequences.fasta, containing DNA sequences, and you want to perform the following tasks:

  • Extract sequences longer than 100 bases.
  • Count the number of sequences.
  • Calculate GC content for each sequence.

3. Step-by-Step Tutorial

Step 3.1: Read the FASTA File Create a Bash script process_fasta.sh and start with the shebang line.

bash
#!/bin/bash

Read the FASTA file line by line.

bash
while read -r line; do
# Process each line here
done < sequences.fasta

Step 3.2: Extract Sequences Longer than 100 Bases Within the while loop, you can extract the sequence identifier and the sequence itself, and check the length of the sequence.

bash
#!/bin/bash

sequence=""
header=""
while read -r line; do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $sequence ]]; then
length=${#sequence}
if [[ $length -gt 100 ]]; then
echo $header
echo $sequence
fi
fi
header=$line
sequence=""
else
sequence+=$line
fi
done < sequences.fasta

Step 3.3: Count the Number of Sequences and Calculate GC Content Modify the script to count the sequences and calculate the GC content for each sequence.

bash
#!/bin/bash

sequence=""
header=""
count=0
while read -r line; do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $sequence ]]; then
length=${#sequence}
if [[ $length -gt 100 ]]; then
count=$((count + 1))
gc_count=$(echo $sequence | grep -o "[GCgc]" | wc -l)
gc_content=$(echo "scale=2; ($gc_count / $length) * 100" | bc)
echo -e "$header\nLength: $length\nGC Content: $gc_content%"
fi
fi
header=$line
sequence=""
else
sequence+=$line
fi
done < sequences.fasta

echo "Total number of sequences longer than 100 bases: $count"

Step 3.4: Run the Script Save the script and run it by typing:

bash
chmod +x process_fasta.sh
./process_fasta.sh

This tutorial should give you a general idea of how to write a Bash script for bioinformatics text processing. Please adjust the script to your specific needs and add error handling and input validation as necessary.

Shares