Step-by-Step Guide: Convert Multiline FASTA to Single-Line FASTA

December 27, 2024 Off By admin

Table of Contents

Introduction

FASTQ and FASTA are standard formats in bioinformatics. Sometimes, it’s necessary to convert multiline FASTA sequences to a single-line format to meet specific software requirements or simplify manual inspection. Below, you’ll find a beginner-friendly manual with step-by-step instructions and scripts using UNIX commands and Perl for this conversion.

Prerequisites

Basic understanding of the FASTA format:
- A header line starting with > followed by sequence description.
- Sequence lines spanning multiple lines.
Access to a UNIX/Linux shell or Windows Subsystem for Linux (WSL). Alternatively, Perl must be installed on your system.
A sample FASTA file, e.g., input.fasta.

Step-by-Step Instructions

Option 1: Using `awk` (Linux/UNIX Shell)

Open a terminal and navigate to the directory containing the FASTA file.
Run the following awk command:
bash
awk '/^>/ {if (NR > 1) printf("\n"); printf("%s\n", $0); next;} {printf("%s", $0);} END {printf("\n");}' input.fasta > output.fasta
- Explanation:
  - ^>: Identifies lines starting with > (headers).
  - if (NR > 1) printf("\n");: Adds a newline before each new header (except the first).
  - printf("%s", $0);: Concatenates sequence lines without a newline.
- Output: A single-line FASTA file named output.fasta.
Verify the output:
bash
cat output.fasta

Option 2: Using `perl` (One-Liner Script)

Open a terminal and navigate to the directory containing the FASTA file.
Run the following Perl command:
bash
perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' input.fasta > output.fasta
- Explanation:
  - $. > 1: Ensures the newline is added only after the first header.
  - /^>/: Detects header lines.
  - chomp: Removes newline characters from sequence lines.
Verify the output:
bash
cat output.fasta

Option 3: Using a Dedicated Perl Script

Create a new Perl script file, e.g., linearize_fasta.pl.
Add the following script content:
perl
#!/usr/bin/perl use strict; use warnings;
while (<>) { if (/^>/) { print "\n" if $. > 1; print; } else { chomp; print; } } print "\n";
Save the script and make it executable:
bash
chmod +x linearize_fasta.pl
Run the script on your input file:
bash
./linearize_fasta.pl input.fasta > output.fasta

Option 4: Using `sed`

For quick inline edits using sed:

Explanation:
- Combines all lines until a header (>) is encountered.

Optional: Quality Check

After converting the FASTA file:

Check that each header is followed by a single line of sequence:
bash
grep -v "^>" output.fasta | awk '{if(length($0)>0) print length($0);}' | sort -nu
- Ensures uniform line lengths.
Count headers:
bash
grep -c "^>" output.fasta

Windows Users

Use WSL to access a UNIX shell and follow the above steps.
Alternatively, install Perl for Windows via Strawberry Perl here.

Conclusion

Converting a multiline FASTA file into a single-line FASTA is simple and can be accomplished using awk, perl, or sed scripts. These approaches are efficient, flexible, and suitable for various operating systems.