Bioinformatics Bash Scripting Tutorial: Processing and Analyzing FASTA Sequences
September 26, 2023Bash scripting is crucial in bioinformatics for automating repetitive tasks and data processing. Below, I am providing a step-by-step guide, from starting a bash script to processing a hypothetical bioinformatics dataset.
1. Starting a Bash Script
To start writing a Bash script:
- Open the terminal
- Navigate to the directory where you want to create your script.
cd /path/to/directory
- Create a new script file using a text editor like
nano
orvi
.
nano my_script.sh
- At the top of your script file, write the shebang line to tell the system to interpret your script as a bash script.
- Write your script. When you finish, save and close the text editor. If you are using
nano
, pressCTRL + X
, thenY
to save changes, andEnter
to exit. - Make your script executable.
chmod +x my_script.sh
2. Scenario: Processing a FASTA File
Assume you have a FASTA file, sequences.fasta
, containing DNA sequences, and you want to perform the following tasks:
- Extract sequences longer than 100 bases.
- Count the number of sequences.
- Calculate GC content for each sequence.
3. Step-by-Step Tutorial
Step 3.1: Read the FASTA File Create a Bash script process_fasta.sh
and start with the shebang line.
Read the FASTA file line by line.
while read -r line; do
# Process each line here
done < sequences.fasta
Step 3.2: Extract Sequences Longer than 100 Bases Within the while loop, you can extract the sequence identifier and the sequence itself, and check the length of the sequence.
sequence=""
header=""
while read -r line; do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $sequence ]]; then
length=${#sequence}
if [[ $length -gt 100 ]]; then
echo $header
echo $sequence
fi
fi
header=$line
sequence=""
else
sequence+=$line
fi
done < sequences.fasta
Step 3.3: Count the Number of Sequences and Calculate GC Content Modify the script to count the sequences and calculate the GC content for each sequence.
sequence=""
header=""
count=0
while read -r line; do
if [[ ${line:0:1} == '>' ]]; then
if [[ -n $sequence ]]; then
length=${#sequence}
if [[ $length -gt 100 ]]; then
count=$((count + 1))
gc_count=$(echo $sequence | grep -o "[GCgc]" | wc -l)
gc_content=$(echo "scale=2; ($gc_count / $length) * 100" | bc)
echo -e "$header\nLength: $length\nGC Content: $gc_content%"
fi
fi
header=$line
sequence=""
else
sequence+=$line
fi
done < sequences.fasta
echo "Total number of sequences longer than 100 bases: $count"
Step 3.4: Run the Script Save the script and run it by typing:
chmod +x process_fasta.sh
./process_fasta.sh
This tutorial should give you a general idea of how to write a Bash script for bioinformatics text processing. Please adjust the script to your specific needs and add error handling and input validation as necessary.