The-AWK-Programming-Language.-bioinformatics

Using awk script in bioinformatics analysis

February 27, 2024 Off By admin

Table of Contents

Introduction to awk and its basic syntax

Explanation of how awk processes input and splits it into fields

awk is a powerful text processing language that is commonly used for processing structured text data. When awk reads in a line of input, it splits the line into fields based on a field separator, which is a regular expression that matches the delimiter between fields. By default, the field separator is any amount of whitespace, so awk will split the input into fields based on spaces, tabs, and newlines. Here’s an example of how awk splits input into fields:

bash

1$ echo "This is a test" | awk '{print $1, $2, $3}'
 2This is a

In this example, the input string “This is a test” is split into three fields based on the default field separator. The first field is “This”, the second field is “is”, and the third field is “a”. The awk command {print $1, $2, $3} prints out the first three fields, which results in the output “This is a”. You can also change the field separator using the -F option. For example, to split the input into fields based on a comma, you can use the following command:

bash

1$ echo "This,is,a,test" | awk -F, '{print $1, $2, $3, $4}'
 2This is a test

In this example, the input string “This,is,a,test” is split into four fields based on the comma field separator. The first field is “This”, the second field is “is”, the third field is “a”, and the fourth field is “test”. The awk command {print $1, $2, $3, $4} prints out the first four fields, which results in the output “This is a test”. I hope that helps! Let me know if you have any other questions.

Basic commands such as print, printf, and variables

Here are some basic awk commands that you can use to process your data:

print: This command is used to print the contents of a variable or a string. If no variable or string is specified, it will print the current input line.

Example:

bash

1$ echo "This is a test" | awk '{print $1, $2, $3}'
 2This is a

printf: This command is similar to print, but it allows you to format the output using placeholders.

Example:

bash

1$ echo "This is a test" | awk '{printf "%s %s %s\n", $1, $2, $3}'
 2This is a

Variables: In awk, you can use variables to store values. Variables are case-sensitive, and they are assigned a value using the = operator.

Example:

bash

1$ echo "This is a test" | awk '{first=$1; second=$2; third=$3; print first, second, third}'
 2This is a

if-else statements: You can use if-else statements in awk to conditionally execute code based on the value of a variable or expression.

Example:

bash

1$ echo "This is a test" | awk '{if ($1 == "This") {print "First word is This"}}'
 2First word is This

for loops: You can use for loops in awk to iterate over a range of values.

Example:

bash

1$ echo "This is a test" | awk '{for (i=1; i<=NF; i++) {print $i}}'
 2This
 3is
 4a
 5test

while loops: You can use while loops in awk to repeatedly execute a block of code as long as a condition is true.

Example:

bash

1$ echo "This is a test" | awk '{i=1; while (i<=NF) {print $i; i++}}'
 2This
 3is
 4a
 5test

These are just a few basic commands and constructs that you can use in awk to process your data. There are many more commands and features available in awk, so I encourage you to explore the documentation and examples to learn more.

Using conditional statements to perform actions based on certain conditions

awk, you can use conditional statements to perform actions based on certain conditions. Here are some examples:

if-else statement: The if-else statement allows you to execute a block of code if a condition is true, and another block of code if the condition is false.

Example:

bash

1$ echo "10" | awk '{if ($1 > 5) {print "The number is greater than 5"}}'
 2The number is greater than 5

if-else if-else statement: The if-else if-else statement allows you to execute a block of code if a condition is true, and another block of code if a different condition is true, and yet another block of code if all the conditions are false.

Example:

bash

1$ echo "10" | awk '{if ($1 > 10) {print "The number is greater than 10"} else if ($1 > 5) {print "The number is greater than 5 but less than or equal to 10"}}'
 2The number is greater than 5 but less than or equal to 10

switch statement: The switch statement allows you to execute a block of code based on the value of an expression.

Example:

bash

1$ echo "3" | awk '{switch ($1) {case 1: print "The number is 1"; case 2: print "The number is 2"; case 3: print "The number is 3"; default: print "The number is not 1, 2, or 3"}}'
 2The number is 3

ternary operator: The ternary operator allows you to execute a block of code based on a condition, similar to an if-else statement.

Example:

bash

1$ echo "10" | awk '{print ($1 > 5 ? "The number is greater than 5" : "The number is less than or equal to 5")}'
 2The number is greater than 5

These are just a few examples of how you can use conditional statements in awk to perform actions based on certain conditions. There are many more ways to use conditional statements in awk, so I encourage you to explore the documentation and examples to learn more.

Advanced awk features for bioinformatics analysis

Using regular expressions to manipulate data

In awk, you can use regular expressions to manipulate data. Here are some examples:

Matching a regular expression: You can use the ~ operator to match a regular expression against a string. If the string matches the regular expression, the expression evaluates to true.

Example:

bash

1$ echo "hello world" | awk '/hello/ {print "Match found"}'
 2Match found

Substituting a regular expression: You can use the sub() function to substitute a regular expression with a string.

Example:

bash

1$ echo "hello world" | awk '{sub(/hello/, "hi"); print}'
 2hi world

Global substitution of a regular expression: You can use the gsub() function to globally substitute a regular expression with a string.

Example:

bash

1$ echo "hello hello world" | awk '{gsub(/hello/, "hi"); print}'
 2hi hi world

Splitting a string using a regular expression: You can use the split() function to split a string into an array using a regular expression as the delimiter.

Example:

bash

1$ echo "a,b,c" | awk '{n = split($1, a, /,/); for (i=1; i<=n; i++) {print a[i]}}'
 2a
 3b
 4c

Extracting substrings using regular expressions: You can use the match() function to extract substrings using regular expressions.

Example:

bash

1$ echo "hello world" | awk '{match($0, /hello/); print substr($0, RSTART, RLENGTH)}'
 2hello

These are just a few examples of how you can use regular expressions in awk to manipulate data. There are many more regular expression functions and operators available in awk, so I encourage you to explore the documentation and examples to learn more.

Note: In the examples above, $0 refers to the current input line, RSTART is a built-in variable that contains the starting position of the match, and RLENGTH is a built-in variable that contains the length of the match.

Working with arrays and associative arrays

In awk, you can work with arrays and associative arrays to store and manipulate data. Here are some examples:

Creating an array: You can create an array in awk by declaring it with the array[] syntax.

Example:

bash

1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"} {print array[1]}'
 2apple

Accessing array elements: You can access an element of an array using the array[index] syntax.

Example:

bash

1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"} {print array[2]}'
 2banana

Adding elements to an array: You can add elements to an array by assigning a value to a new index.

Example:

bash

1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"; array[4] = "date"} {print array[4]}'
 2date

Deleting array elements: You can delete an element of an array using the delete keyword.

Example:

bash

1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"; delete array[2]; print array[2]}'
 2
 3$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"; delete array[2]; print (length(array) == 2 ? "Array has 2 elements" : "Array has more than 2 elements")}'
 4Array has 2 elements

Iterating over an array: You can iterate over an array using the for loop.

Example:

bash

1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"} {for (i in array) {print array[i]}}'
 2apple
 3banana
 4cherry

Associative arrays: In awk, you can also use associative arrays, which are arrays indexed by strings instead of integers.

Example:

bash

1$ awk 'BEGIN {fruit["apple"] = 1; fruit["banana"] = 2; fruit["cherry"] = 3} {print fruit["banana"]}'
 22

These are just a few examples of how you can work with arrays and associative arrays in awk. There are many more array functions and operators available in awk, so I encourage you to explore the documentation and examples to learn more.

Note: In the examples above, length(array) returns the number of elements in the array.

Reading and writing files

In awk, you can read and write files to perform input and output operations. Here are some examples:

Reading a file: You can read a file in awk using the getline function.

Example:

bash

1$ awk '{getline line < "file.txt"; print line}'

Writing to a file: You can write to a file in awk using the print or printf statement with the > operator.

Example:

bash

1$ awk '{print "Hello, World!" > "output.txt"}'

Appending to a file: You can append to a file in awk using the print or printf statement with the >> operator.

Example:

bash

1$ awk '{print "Hello, World!" >> "output.txt"}'

Reading and writing to different files: You can read from one file and write to another file in awk using the getline function with a file name.

Example:

bash

1$ awk '{getline line < "input.txt"; print line > "output.txt"}'

Reading and writing to the same file: You can read from and write to the same file in awk using the > operator with the RT variable.

Example:

bash

1$ awk '{print $1 > "file.txt"; getline line < "file.txt"; print line}'

Note: In the examples above, file.txt and output.txt are example file names. Replace them with the actual file names you want to use. Also, be careful when writing to files in awk as it can overwrite the file if it already exists. It’s a good practice to use the >> operator to append to a file instead of overwriting it.

These are just a few examples of how you can read and write files in awk. There are many more file I/O functions and operators available in awk, so I encourage you to explore the documentation and examples to learn more.

Using built-in variables such as NR, NF, and FILENAME

In awk, there are several built-in variables that you can use to manipulate data. Here are some examples of using the NR, NF, and FILENAME variables:

NR variable: The NR variable contains the total number of input records processed so far. An input record is typically a line of text, but it can be any unit of input specified by the RS variable.

Example:

bash

1$ awk '{print NR}' file.txt

NF variable: The NF variable contains the number of fields in the current input record.

Example:

bash

1$ awk '{print NF}' file.txt

FILENAME variable: The FILENAME variable contains the name of the current input file.

Example:

bash

1$ awk '{print FILENAME}' file1.txt file2.txt

Using multiple files: When processing multiple files in awk, the FILENAME variable changes for each file.

Example:

bash

1$ awk '{print FILENAME, NR, $0}' file1.txt file2.txt

Using FILENAME with getline: You can use the FILENAME variable with the getline function to read a specific file.

Example:

bash

1$ awk '{getline line < FILENAME; print line}'

These are just a few examples of how you can use the NR, NF, and FILENAME variables in awk. There are many more built-in variables available in awk, so I encourage you to explore the documentation and examples to learn more.

Note: In the examples above, file.txt, file1.txt, and file2.txt are example file names. Replace them with the actual file names you want to use.

Practical examples and case studies

Parsing and filtering genomic data files (e.g. VCF, BAM)

I can provide you with some examples of how to parse and filter genomic data files using awk. Here are some examples:

Remove “chr” string in variants in a VCF:

bash

1awk '{gsub(/^chr/,""); print}' infile.vcf > infile.no_chr.vcf

Add “chr” string in variants in VCF:

bash

1awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' infile.no_chr.vcf > infile.vcf

Sort karyotypically a VCF (version 1):

bash

1grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | sort -V -k1,1 -k2,2n >> out.vcf

Sort karyotypically a VCF (version 2): use ‘-V’, natural sort of (version) numbers within text

bash

1sort -V -k1,1 -k2,2n infile.vcf > outfile.vcf

Sort karyotypically a VCF (version 3): using vcf-sort (from vcftools)

bash

1vcf-sort infile.vcf > outfile.vcf

Replace spaces with tab:

bash

1awk -v OFS="\t" '$1=$1' tmp1 > tmp2

Prepare a header:

bash

1echo -e "#CHR\tPOS\tID\tREF\tALT\tnumGT.Unknowns\tnumGT.HomRef\tnumGT.Het \
 2 \tnumGT.HomAlt\tnumGT.(Het+HomAlt)" > header
 3cat header tmp2 > infile.variant-genotypes.counts

Extract variants that have a genotype count equal or higher than a genotype threshold:

bash

1awk -F'[ ]' '{ if ($10 >= 5) print $3 }' infile.variant-genotypes.counts > variant-list

Count the lines of each file in a dir and get the sum of all lines:

bash

1find ${indir} -type f -name "*.selected-files" -print0 | wc -l --files0-from=-

Count the total number of lines in the selected files of a dir:

bash

1find ${indir} -type f -name "*.selected-files" -exec cat {} + | wc -l

Grab the body of a file excluding the header:

bash

1tail -n +2 ${infile} > file-without-header

Count number of lines in a file:

bash

1wc -l ${infile}

These are just a few examples of how you can parse and filter genomic data files using awk. There are many more ways to manipulate and analyze genomic data using awk and other command-line tools. Let me know if you have any further questions!

Calculating summary statistics and generating reports

I can provide you with some examples of how to calculate summary statistics and generate reports using awk. Here are some examples:

Calculate the number of lines in a file:

bash

1awk 'END {print NR}' infile.txt

Calculate the number of fields in each line of a file:

bash

1awk '{print NF}' infile.txt

Calculate the sum of a column of numbers:

bash

1awk '{sum += $1} END {print sum}' infile.txt

Calculate the average of a column of numbers:

bash

1awk '{sum += $1; count++} END {print sum/count}' infile.txt

Calculate the minimum value of a column of numbers:

bash

1awk '{if (min > $1) min = $1} END {print min}' infile.txt

Calculate the maximum value of a column of numbers:

bash

1awk '{if (max < $1) max = $1} END {print max}' infile.txt

Calculate the median value of a column of numbers:

bash

1awk '{a[NR] = $1} END {n = asort(a); if (NR % 2 == 1) print a[int(NR/2)+1]; else print (a[int(NR/2)] + a[int(NR/2)+1])/2}' infile.txt

Generate a report of the number of occurrences of each unique value in a column:

bash

1awk '{count[$1]++} END {for (i in count) print i, count[i]}' infile.txt

Generate a report of the number of occurrences of each unique value in a column, sorted by count:

bash

1awk '{count[$1]++} END {for (i in count) print count[i], i}' | sort -rn

Generate a report of the number of occurrences of each unique value in a column, sorted alphabetically:

bash

1awk '{count[$1]++} END {for (i in count) print i, count[i]}' | sort

These are just a few examples of how you can calculate summary statistics and generate reports using awk. There are many more ways to manipulate and analyze data using awk and other command-line tools. Let me know if you have any further questions!

Automating workflows and pipelines

here are some examples of how to automate workflows and pipelines using shell scripts and command-line tools.

Create a shell script to automate a workflow:

bash

1#!/bin/bash
 2
 3# Define variables
 4indir="/path/to/input/files"
 5outdir="/path/to/output/files"
 6
 7# Create output directory if it doesn't exist
 8mkdir -p $outdir
 9
 10# Process each file in the input directory
 11for infile in $indir/*.txt; do
 12 # Extract the base filename without the extension
 13 base=$(basename $infile .txt)
 14
 15 # Perform some operations on the input file
 16 # For example, calculate the number of lines, fields, and unique values in the first column
 17 awk 'END {print NR}' $infile > $outdir/$base.lines
 18 awk '{print NF}' $infile > $outdir/$base.fields
 19 awk '{a[$1]++} END {for (i in a) print i, a[i]}' $infile | sort > $outdir/$base.unique
 20done

Create a shell script to automate a pipeline:

bash

1#!/bin/bash
 2
 3# Define variables
 4indir="/path/to/input/files"
 5outdir="/path/to/output/files"
 6
 7# Create output directory if it doesn't exist
 8mkdir -p $outdir
 9
 10# Process each file in the input directory
 11for infile in $indir/*.txt; do
 12 # Extract the base filename without the extension
 13 base=$(basename $infile .txt)
 14
 15 # Perform operations on the input file using a pipeline
 16 # For example, filter lines based on a condition, calculate summary statistics, and generate a report
 17 awk '{if ($1 > 5) print $0}' $infile > $outdir/$base.filtered
 18 awk '{sum += $1} END {print sum}' $infile > $outdir/$base.sum
 19 awk '{count[$1]++} END {for (i in count) print i, count[i]}' $infile | sort -rn > $outdir/$base.report
 20done

These are just a few examples of how you can automate workflows and pipelines using shell scripts and command-line tools. There are many more ways to automate and streamline your data analysis and processing workflows. Let me know if you have any further questions!