Best bioinformatics one-liners
January 9, 2025Here’s a step-by-step manual with recent tips, tricks, and some of the best bioinformatics one-liners for common tasks. These one-liners are designed to be efficient and quick for simple tasks, leveraging Linux command-line tools like awk
, sed
, rev
, tr
, csplit
, and more.
1. Get Sequence Length Distribution from a FASTQ File
This command calculates the distribution of sequence lengths in a FASTQ file.
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'
Explanation:
zcat
: Decompresses the.gz
file on the fly.awk
: Processes every 4th line (sequence lines in FASTQ).lengths[length($0)]++
: Counts occurrences of each sequence length.END
: Prints the length distribution at the end.
2. Reverse Complement a DNA Sequence
This command reverse complements a DNA sequence.
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'
Explanation:
rev
: Reverses the sequence.tr 'ACTG' 'TGAC'
: Translates each nucleotide to its complement.
3. Split a Multi-FASTA File into Individual Files
This command splits a multi-FASTA file into individual files using csplit
.
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}
Explanation:
csplit
: Splits the file at each>
(FASTA header).-z
: Removes empty files.-f sequence_
: Prefix for output files.-n 4
: Uses 4 digits for numbering.
4. Linearize a FASTA File
This command converts a multi-line FASTA file into a single-line format.
awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}' file.fasta
Explanation:
awk
: Processes each line./^>/
: Matches FASTA headers.printf
: Prints headers and sequences in a single line.
5. Convert FASTQ to FASTA
This command converts a FASTQ file to FASTA format.
zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz
Explanation:
paste - - - -
: Combines every 4 lines into a single line.perl
: Extracts the header and sequence.gzip -c
: Compresses the output.
6. Extract Sequences by IDs from a FASTA File
This command extracts sequences from a FASTA file based on a list of IDs.
cut -c 2- ids.txt | xargs -n 1 samtools faidx file.fasta > out.fasta
Explanation:
cut -c 2-
: Removes the>
from IDs.xargs
: Passes each ID tosamtools faidx
.samtools faidx
: Extracts sequences.
7. Reproducible Subsampling of a FASTQ File
This command subsamples a FASTQ file reproducibly.
cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq
Explanation:
srand(1234)
: Sets a seed for reproducibility.rand() < 0.01
: Keeps 1% of reads.tr '\t' '\n'
: Restores FASTQ format.
8. Count the Number of Reads in a FASTQ File
This command counts the number of reads in a FASTQ file.
cat file.fq | echo $((`wc -l`/4))
Explanation:
wc -l
: Counts the total number of lines.echo $((.../4))
: Divides by 4 to get the number of reads.
9. Convert Multi-Line FASTA to 60 Characters per Line
This command reformats a FASTA file to 60 characters per line.
awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file.fa
Explanation:
FS=
: Treats each character as a separate field.for (i=0;i<=NF/60;i++)
: Splits the sequence into 60-character chunks.
10. Parallelize Tasks with xargs
This command parallelizes tasks using xargs
.
ls Sample_*/*.fastq.gz | xargs -i echo fastqc {} \| qsub -cwd
Explanation:
xargs -i
: Replaces{}
with each input.qsub -cwd
: Submits jobs to a cluster.
11. Immutable Files with chattr
This command makes files immutable to prevent accidental changes.
sudo chattr +i file.fq # Archive a file sudo chattr -i file.fq # Unarchive it
Explanation:
chattr +i
: Makes the file immutable.chattr -i
: Reverses the immutability.
12. Convert BAM to BED
This command converts a BAM file to BED format.
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed
Explanation:
samtools view
: Extracts alignments.perl
: Processes each alignment to calculate strand and length.
13. Deinterleave a FASTQ File
This command splits an interleaved FASTQ file into two files.
cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t' '\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq
Explanation:
paste
: Combines 8 lines into one.tee
: Splits the output into two streams.cut
: Separates the two sets of reads.
14. Calculate Sequence Lengths in a FASTA File
This command calculates the length of each sequence in a FASTA file.
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa
Explanation:
awk
: Tracks sequence lengths and prints them.
15. Run FastQC in Parallel
This command runs FastQC on multiple FASTQ files in parallel.
ls Sample_*/*.fastq.gz | xargs -P 8 -I {} fastqc {}
Explanation:
xargs -P 8
: Runs 8 jobs in parallel.fastqc {}
: Runs FastQC on each file.
These one-liners are powerful tools for bioinformatics tasks. For more complex workflows, consider scripting or using specialized tools like BioPython
or Snakemake
.