FASTA-protein sequence-proteome.

Step-by-Step Guide: Convert Multiline FASTA to Single-Line FASTA

December 27, 2024 Off By admin
Shares

Introduction

FASTQ and FASTA are standard formats in bioinformatics. Sometimes, it’s necessary to convert multiline FASTA sequences to a single-line format to meet specific software requirements or simplify manual inspection. Below, you’ll find a beginner-friendly manual with step-by-step instructions and scripts using UNIX commands and Perl for this conversion.


Prerequisites

  1. Basic understanding of the FASTA format:
    • A header line starting with > followed by sequence description.
    • Sequence lines spanning multiple lines.
  2. Access to a UNIX/Linux shell or Windows Subsystem for Linux (WSL). Alternatively, Perl must be installed on your system.
  3. A sample FASTA file, e.g., input.fasta.

Step-by-Step Instructions

Option 1: Using awk (Linux/UNIX Shell)

  1. Open a terminal and navigate to the directory containing the FASTA file.
  2. Run the following awk command:
    bash
    awk '/^>/ {if (NR > 1) printf("\n"); printf("%s\n", $0); next;} {printf("%s", $0);} END {printf("\n");}' input.fasta > output.fasta
    • Explanation:
      • ^>: Identifies lines starting with > (headers).
      • if (NR > 1) printf("\n");: Adds a newline before each new header (except the first).
      • printf("%s", $0);: Concatenates sequence lines without a newline.
    • Output: A single-line FASTA file named output.fasta.
  3. Verify the output:
    bash
    cat output.fasta

Option 2: Using perl (One-Liner Script)

  1. Open a terminal and navigate to the directory containing the FASTA file.
  2. Run the following Perl command:
    bash
    perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' input.fasta > output.fasta
    • Explanation:
      • $. > 1: Ensures the newline is added only after the first header.
      • /^>/: Detects header lines.
      • chomp: Removes newline characters from sequence lines.
  3. Verify the output:
    bash
    cat output.fasta

Option 3: Using a Dedicated Perl Script

  1. Create a new Perl script file, e.g., linearize_fasta.pl.
  2. Add the following script content:
    perl
    #!/usr/bin/perl
    use strict;
    use warnings;

    while (<>) {
    if (/^>/) {
    print "\n" if $. > 1;
    print;
    } else {
    chomp;
    print;
    }
    }
    print "\n";

  3. Save the script and make it executable:
    bash
    chmod +x linearize_fasta.pl
  4. Run the script on your input file:
    bash
    ./linearize_fasta.pl input.fasta > output.fasta

Option 4: Using sed

For quick inline edits using sed:

bash
sed ':a;N;$!ba;s/\n\([^>]\)/\1/g' input.fasta > output.fasta
  • Explanation:
    • Combines all lines until a header (>) is encountered.

Optional: Quality Check

After converting the FASTA file:

  1. Check that each header is followed by a single line of sequence:
    bash
    grep -v "^>" output.fasta | awk '{if(length($0)>0) print length($0);}' | sort -nu
    • Ensures uniform line lengths.
  2. Count headers:
    bash
    grep -c "^>" output.fasta

Windows Users

  1. Use WSL to access a UNIX shell and follow the above steps.
  2. Alternatively, install Perl for Windows via Strawberry Perl here.

Conclusion

Converting a multiline FASTA file into a single-line FASTA is simple and can be accomplished using awk, perl, or sed scripts. These approaches are efficient, flexible, and suitable for various operating systems.

Shares