Best bioinformatics one-liners

January 9, 2025 Off By admin

Here’s a step-by-step manual with recent tips, tricks, and some of the best bioinformatics one-liners for common tasks. These one-liners are designed to be efficient and quick for simple tasks, leveraging Linux command-line tools like awk, sed, rev, tr, csplit, and more.

Table of Contents

1. Get Sequence Length Distribution from a FASTQ File

This command calculates the distribution of sequence lengths in a FASTQ file.

zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'

Explanation:

zcat: Decompresses the .gz file on the fly.
awk: Processes every 4th line (sequence lines in FASTQ).
lengths[length($0)]++: Counts occurrences of each sequence length.
END: Prints the length distribution at the end.

2. Reverse Complement a DNA Sequence

This command reverse complements a DNA sequence.

echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'

Explanation:

rev: Reverses the sequence.
tr 'ACTG' 'TGAC': Translates each nucleotide to its complement.

3. Split a Multi-FASTA File into Individual Files

This command splits a multi-FASTA file into individual files using csplit.

csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}

Explanation:

csplit: Splits the file at each > (FASTA header).
-z: Removes empty files.
-f sequence_: Prefix for output files.
-n 4: Uses 4 digits for numbering.

4. Linearize a FASTA File

This command converts a multi-line FASTA file into a single-line format.

awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}' file.fasta

Explanation:

awk: Processes each line.
/^>/: Matches FASTA headers.
printf: Prints headers and sequences in a single line.

5. Convert FASTQ to FASTA

This command converts a FASTQ file to FASTA format.

zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz

Explanation:

paste - - - -: Combines every 4 lines into a single line.
perl: Extracts the header and sequence.
gzip -c: Compresses the output.

6. Extract Sequences by IDs from a FASTA File

This command extracts sequences from a FASTA file based on a list of IDs.

cut -c 2- ids.txt | xargs -n 1 samtools faidx file.fasta > out.fasta

Explanation:

cut -c 2-: Removes the > from IDs.
xargs: Passes each ID to samtools faidx.
samtools faidx: Extracts sequences.

7. Reproducible Subsampling of a FASTQ File

This command subsamples a FASTQ file reproducibly.

cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

Explanation:

srand(1234): Sets a seed for reproducibility.
rand() < 0.01: Keeps 1% of reads.
tr '\t' '\n': Restores FASTQ format.

8. Count the Number of Reads in a FASTQ File

This command counts the number of reads in a FASTQ file.

cat file.fq | echo $((`wc -l`/4))

Explanation:

wc -l: Counts the total number of lines.
echo $((.../4)): Divides by 4 to get the number of reads.

9. Convert Multi-Line FASTA to 60 Characters per Line

This command reformats a FASTA file to 60 characters per line.

awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file.fa

Explanation:

FS=: Treats each character as a separate field.
for (i=0;i<=NF/60;i++): Splits the sequence into 60-character chunks.

10. Parallelize Tasks with `xargs`

This command parallelizes tasks using xargs.

ls Sample_*/*.fastq.gz | xargs -i echo fastqc {} \| qsub -cwd

Explanation:

xargs -i: Replaces {} with each input.
qsub -cwd: Submits jobs to a cluster.

11. Immutable Files with `chattr`

This command makes files immutable to prevent accidental changes.

sudo chattr +i file.fq  # Archive a file
sudo chattr -i file.fq  # Unarchive it

Explanation:

chattr +i: Makes the file immutable.
chattr -i: Reverses the immutability.

12. Convert BAM to BED

This command converts a BAM file to BED format.

samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed

Explanation:

samtools view: Extracts alignments.
perl: Processes each alignment to calculate strand and length.

13. Deinterleave a FASTQ File

This command splits an interleaved FASTQ file into two files.

cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t' '\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq

Explanation:

paste: Combines 8 lines into one.
tee: Splits the output into two streams.
cut: Separates the two sets of reads.

14. Calculate Sequence Lengths in a FASTA File

This command calculates the length of each sequence in a FASTA file.

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa

Explanation:

awk: Tracks sequence lengths and prints them.

15. Run FastQC in Parallel

This command runs FastQC on multiple FASTQ files in parallel.

ls Sample_*/*.fastq.gz | xargs -P 8 -I {} fastqc {}

Explanation:

xargs -P 8: Runs 8 jobs in parallel.
fastqc {}: Runs FastQC on each file.

These one-liners are powerful tools for bioinformatics tasks. For more complex workflows, consider scripting or using specialized tools like BioPython or Snakemake.