chatgpt-hypothesis-genertion

Best bioinformatics one-liners

January 9, 2025 Off By admin
Shares

Here’s a step-by-step manual with recent tips, tricks, and some of the best bioinformatics one-liners for common tasks. These one-liners are designed to be efficient and quick for simple tasks, leveraging Linux command-line tools like awksedrevtrcsplit, and more.


1. Get Sequence Length Distribution from a FASTQ File

This command calculates the distribution of sequence lengths in a FASTQ file.

bash
Copy
zcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'

Explanation:

  • zcat: Decompresses the .gz file on the fly.
  • awk: Processes every 4th line (sequence lines in FASTQ).
  • lengths[length($0)]++: Counts occurrences of each sequence length.
  • END: Prints the length distribution at the end.

2. Reverse Complement a DNA Sequence

This command reverse complements a DNA sequence.

bash
Copy
echo 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'

Explanation:

  • rev: Reverses the sequence.
  • tr 'ACTG' 'TGAC': Translates each nucleotide to its complement.

3. Split a Multi-FASTA File into Individual Files

This command splits a multi-FASTA file into individual files using csplit.

bash
Copy
csplit -z -q -n 4 -f sequence_ sequences.fasta /\>/ {*}

Explanation:

  • csplit: Splits the file at each > (FASTA header).
  • -z: Removes empty files.
  • -f sequence_: Prefix for output files.
  • -n 4: Uses 4 digits for numbering.

4. Linearize a FASTA File

This command converts a multi-line FASTA file into a single-line format.

bash
Copy
awk '/^>/{if(N>0) printf("\n"); ++N; printf("%s\t",$0);next;} {printf("%s",$0);}END{printf("\n");}' file.fasta

Explanation:

  • awk: Processes each line.
  • /^>/: Matches FASTA headers.
  • printf: Prints headers and sequences in a single line.

5. Convert FASTQ to FASTA

This command converts a FASTQ file to FASTA format.

bash
Copy
zcat file.fastq.gz | paste - - - - | perl -ane 'print ">$F[0]\n$F[2]\n";' | gzip -c > file.fasta.gz

Explanation:

  • paste - - - -: Combines every 4 lines into a single line.
  • perl: Extracts the header and sequence.
  • gzip -c: Compresses the output.

6. Extract Sequences by IDs from a FASTA File

This command extracts sequences from a FASTA file based on a list of IDs.

bash
Copy
cut -c 2- ids.txt | xargs -n 1 samtools faidx file.fasta > out.fasta

Explanation:

  • cut -c 2-: Removes the > from IDs.
  • xargs: Passes each ID to samtools faidx.
  • samtools faidx: Extracts sequences.

7. Reproducible Subsampling of a FASTQ File

This command subsamples a FASTQ file reproducibly.

bash
Copy
cat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() < 0.01) print $0}' | tr '\t' '\n' > out.fq

Explanation:

  • srand(1234): Sets a seed for reproducibility.
  • rand() < 0.01: Keeps 1% of reads.
  • tr '\t' '\n': Restores FASTQ format.

8. Count the Number of Reads in a FASTQ File

This command counts the number of reads in a FASTQ file.

bash
Copy
cat file.fq | echo $((`wc -l`/4))

Explanation:

  • wc -l: Counts the total number of lines.
  • echo $((.../4)): Divides by 4 to get the number of reads.

9. Convert Multi-Line FASTA to 60 Characters per Line

This command reformats a FASTA file to 60 characters per line.

bash
Copy
awk -v FS= '/^>/{print;next}{for (i=0;i<=NF/60;i++) {for (j=1;j<=60;j++) printf "%s", $(i*60 +j); print ""}}' file.fa

Explanation:

  • FS=: Treats each character as a separate field.
  • for (i=0;i<=NF/60;i++): Splits the sequence into 60-character chunks.

10. Parallelize Tasks with xargs

This command parallelizes tasks using xargs.

bash
Copy
ls Sample_*/*.fastq.gz | xargs -i echo fastqc {} \| qsub -cwd

Explanation:

  • xargs -i: Replaces {} with each input.
  • qsub -cwd: Submits jobs to a cluster.

11. Immutable Files with chattr

This command makes files immutable to prevent accidental changes.

bash
Copy
sudo chattr +i file.fq  # Archive a file
sudo chattr -i file.fq  # Unarchive it

Explanation:

  • chattr +i: Makes the file immutable.
  • chattr -i: Reverses the immutability.

12. Convert BAM to BED

This command converts a BAM file to BED format.

bash
Copy
samtools view file.bam | perl -F'\t' -ane '$strand=($F[1]&16)?"-":"+";$length=1;$tmp=$F[5];$tmp =~ s/(\d+)[MD]/$length+=$1/eg;print "$F[2]\t$F[3]\t".($F[3]+$length)."\t$F[0]\t0\t$strand\n";' > file.bed

Explanation:

  • samtools view: Extracts alignments.
  • perl: Processes each alignment to calculate strand and length.

13. Deinterleave a FASTQ File

This command splits an interleaved FASTQ file into two files.

bash
Copy
cat file.fq | paste - - - - - - - - | tee >(cut -f1-4 | tr '\t' '\n' > out1.fq) | cut -f5-8 | tr '\t' '\n' > out2.fq

Explanation:

  • paste: Combines 8 lines into one.
  • tee: Splits the output into two streams.
  • cut: Separates the two sets of reads.

14. Calculate Sequence Lengths in a FASTA File

This command calculates the length of each sequence in a FASTA file.

bash
Copy
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa

Explanation:

  • awk: Tracks sequence lengths and prints them.

15. Run FastQC in Parallel

This command runs FastQC on multiple FASTQ files in parallel.

bash
Copy
ls Sample_*/*.fastq.gz | xargs -P 8 -I {} fastqc {}

Explanation:

  • xargs -P 8: Runs 8 jobs in parallel.
  • fastqc {}: Runs FastQC on each file.

These one-liners are powerful tools for bioinformatics tasks. For more complex workflows, consider scripting or using specialized tools like BioPython or Snakemake.

Shares