Using awk script in bioinformatics analysis
February 27, 2024Table of Contents
Introduction to awk and its basic syntax
Explanation of how awk processes input and splits it into fields
awk
is a powerful text processing language that is commonly used for processing structured text data. When awk
reads in a line of input, it splits the line into fields based on a field separator, which is a regular expression that matches the delimiter between fields. By default, the field separator is any amount of whitespace, so awk
will split the input into fields based on spaces, tabs, and newlines. Here’s an example of how awk
splits input into fields:
1$ echo "This is a test" | awk '{print $1, $2, $3}'
2This is a
In this example, the input string “This is a test” is split into three fields based on the default field separator. The first field is “This”, the second field is “is”, and the third field is “a”. The awk
command {print $1, $2, $3}
prints out the first three fields, which results in the output “This is a”. You can also change the field separator using the -F
option. For example, to split the input into fields based on a comma, you can use the following command:
1$ echo "This,is,a,test" | awk -F, '{print $1, $2, $3, $4}'
2This is a test
In this example, the input string “This,is,a,test” is split into four fields based on the comma field separator. The first field is “This”, the second field is “is”, the third field is “a”, and the fourth field is “test”. The awk
command {print $1, $2, $3, $4}
prints out the first four fields, which results in the output “This is a test”. I hope that helps! Let me know if you have any other questions.
Basic commands such as print, printf, and variables
Here are some basic awk
commands that you can use to process your data:
- print: This command is used to print the contents of a variable or a string. If no variable or string is specified, it will print the current input line.
Example:
1$ echo "This is a test" | awk '{print $1, $2, $3}'
2This is a
- printf: This command is similar to print, but it allows you to format the output using placeholders.
Example:
1$ echo "This is a test" | awk '{printf "%s %s %s\n", $1, $2, $3}'
2This is a
- Variables: In
awk
, you can use variables to store values. Variables are case-sensitive, and they are assigned a value using the=
operator.
Example:
1$ echo "This is a test" | awk '{first=$1; second=$2; third=$3; print first, second, third}'
2This is a
- if-else statements: You can use if-else statements in
awk
to conditionally execute code based on the value of a variable or expression.
Example:
1$ echo "This is a test" | awk '{if ($1 == "This") {print "First word is This"}}'
2First word is This
- for loops: You can use for loops in
awk
to iterate over a range of values.
Example:
1$ echo "This is a test" | awk '{for (i=1; i<=NF; i++) {print $i}}'
2This
3is
4a
5test
- while loops: You can use while loops in
awk
to repeatedly execute a block of code as long as a condition is true.
Example:
1$ echo "This is a test" | awk '{i=1; while (i<=NF) {print $i; i++}}'
2This
3is
4a
5test
These are just a few basic commands and constructs that you can use in awk
to process your data. There are many more commands and features available in awk
, so I encourage you to explore the documentation and examples to learn more.
Using conditional statements to perform actions based on certain conditions
awk
, you can use conditional statements to perform actions based on certain conditions. Here are some examples:
- if-else statement: The
if-else
statement allows you to execute a block of code if a condition is true, and another block of code if the condition is false.
Example:
1$ echo "10" | awk '{if ($1 > 5) {print "The number is greater than 5"}}'
2The number is greater than 5
- if-else if-else statement: The
if-else if-else
statement allows you to execute a block of code if a condition is true, and another block of code if a different condition is true, and yet another block of code if all the conditions are false.
Example:
1$ echo "10" | awk '{if ($1 > 10) {print "The number is greater than 10"} else if ($1 > 5) {print "The number is greater than 5 but less than or equal to 10"}}'
2The number is greater than 5 but less than or equal to 10
- switch statement: The
switch
statement allows you to execute a block of code based on the value of an expression.
Example:
1$ echo "3" | awk '{switch ($1) {case 1: print "The number is 1"; case 2: print "The number is 2"; case 3: print "The number is 3"; default: print "The number is not 1, 2, or 3"}}'
2The number is 3
- ternary operator: The ternary operator allows you to execute a block of code based on a condition, similar to an
if-else
statement.
Example:
1$ echo "10" | awk '{print ($1 > 5 ? "The number is greater than 5" : "The number is less than or equal to 5")}'
2The number is greater than 5
These are just a few examples of how you can use conditional statements in awk
to perform actions based on certain conditions. There are many more ways to use conditional statements in awk
, so I encourage you to explore the documentation and examples to learn more.
Advanced awk features for bioinformatics analysis
Using regular expressions to manipulate data
In awk
, you can use regular expressions to manipulate data. Here are some examples:
- Matching a regular expression: You can use the
~
operator to match a regular expression against a string. If the string matches the regular expression, the expression evaluates to true.
Example:
1$ echo "hello world" | awk '/hello/ {print "Match found"}'
2Match found
- Substituting a regular expression: You can use the
sub()
function to substitute a regular expression with a string.
Example:
1$ echo "hello world" | awk '{sub(/hello/, "hi"); print}'
2hi world
- Global substitution of a regular expression: You can use the
gsub()
function to globally substitute a regular expression with a string.
Example:
1$ echo "hello hello world" | awk '{gsub(/hello/, "hi"); print}'
2hi hi world
- Splitting a string using a regular expression: You can use the
split()
function to split a string into an array using a regular expression as the delimiter.
Example:
1$ echo "a,b,c" | awk '{n = split($1, a, /,/); for (i=1; i<=n; i++) {print a[i]}}'
2a
3b
4c
- Extracting substrings using regular expressions: You can use the
match()
function to extract substrings using regular expressions.
Example:
1$ echo "hello world" | awk '{match($0, /hello/); print substr($0, RSTART, RLENGTH)}'
2hello
These are just a few examples of how you can use regular expressions in awk
to manipulate data. There are many more regular expression functions and operators available in awk
, so I encourage you to explore the documentation and examples to learn more.
Note: In the examples above, $0
refers to the current input line, RSTART
is a built-in variable that contains the starting position of the match, and RLENGTH
is a built-in variable that contains the length of the match.
Working with arrays and associative arrays
In awk
, you can work with arrays and associative arrays to store and manipulate data. Here are some examples:
- Creating an array: You can create an array in
awk
by declaring it with thearray[]
syntax.
Example:
1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"} {print array[1]}'
2apple
- Accessing array elements: You can access an element of an array using the
array[index]
syntax.
Example:
1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"} {print array[2]}'
2banana
- Adding elements to an array: You can add elements to an array by assigning a value to a new index.
Example:
1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"; array[4] = "date"} {print array[4]}'
2date
- Deleting array elements: You can delete an element of an array using the
delete
keyword.
Example:
1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"; delete array[2]; print array[2]}'
2
3$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"; delete array[2]; print (length(array) == 2 ? "Array has 2 elements" : "Array has more than 2 elements")}'
4Array has 2 elements
- Iterating over an array: You can iterate over an array using the
for
loop.
Example:
1$ awk 'BEGIN {array[1] = "apple"; array[2] = "banana"; array[3] = "cherry"} {for (i in array) {print array[i]}}'
2apple
3banana
4cherry
- Associative arrays: In
awk
, you can also use associative arrays, which are arrays indexed by strings instead of integers.
Example:
1$ awk 'BEGIN {fruit["apple"] = 1; fruit["banana"] = 2; fruit["cherry"] = 3} {print fruit["banana"]}'
22
These are just a few examples of how you can work with arrays and associative arrays in awk
. There are many more array functions and operators available in awk
, so I encourage you to explore the documentation and examples to learn more.
Note: In the examples above, length(array)
returns the number of elements in the array.
Reading and writing files
In awk
, you can read and write files to perform input and output operations. Here are some examples:
- Reading a file: You can read a file in
awk
using thegetline
function.
Example:
1$ awk '{getline line < "file.txt"; print line}'
- Writing to a file: You can write to a file in
awk
using theprint
orprintf
statement with the>
operator.
Example:
1$ awk '{print "Hello, World!" > "output.txt"}'
- Appending to a file: You can append to a file in
awk
using theprint
orprintf
statement with the>>
operator.
Example:
1$ awk '{print "Hello, World!" >> "output.txt"}'
- Reading and writing to different files: You can read from one file and write to another file in
awk
using thegetline
function with a file name.
Example:
1$ awk '{getline line < "input.txt"; print line > "output.txt"}'
- Reading and writing to the same file: You can read from and write to the same file in
awk
using the>
operator with theRT
variable.
Example:
1$ awk '{print $1 > "file.txt"; getline line < "file.txt"; print line}'
Note: In the examples above, file.txt
and output.txt
are example file names. Replace them with the actual file names you want to use. Also, be careful when writing to files in awk
as it can overwrite the file if it already exists. It’s a good practice to use the >>
operator to append to a file instead of overwriting it.
These are just a few examples of how you can read and write files in awk
. There are many more file I/O functions and operators available in awk
, so I encourage you to explore the documentation and examples to learn more.
Using built-in variables such as NR, NF, and FILENAME
In awk
, there are several built-in variables that you can use to manipulate data. Here are some examples of using the NR
, NF
, and FILENAME
variables:
- NR variable: The
NR
variable contains the total number of input records processed so far. An input record is typically a line of text, but it can be any unit of input specified by theRS
variable.
Example:
1$ awk '{print NR}' file.txt
- NF variable: The
NF
variable contains the number of fields in the current input record.
Example:
1$ awk '{print NF}' file.txt
- FILENAME variable: The
FILENAME
variable contains the name of the current input file.
Example:
1$ awk '{print FILENAME}' file1.txt file2.txt
- Using multiple files: When processing multiple files in
awk
, theFILENAME
variable changes for each file.
Example:
1$ awk '{print FILENAME, NR, $0}' file1.txt file2.txt
- Using FILENAME with getline: You can use the
FILENAME
variable with thegetline
function to read a specific file.
Example:
1$ awk '{getline line < FILENAME; print line}'
These are just a few examples of how you can use the NR
, NF
, and FILENAME
variables in awk
. There are many more built-in variables available in awk
, so I encourage you to explore the documentation and examples to learn more.
Note: In the examples above, file.txt
, file1.txt
, and file2.txt
are example file names. Replace them with the actual file names you want to use.
Practical examples and case studies
Parsing and filtering genomic data files (e.g. VCF, BAM)
I can provide you with some examples of how to parse and filter genomic data files using awk
. Here are some examples:
- Remove “chr” string in variants in a VCF:
1awk '{gsub(/^chr/,""); print}' infile.vcf > infile.no_chr.vcf
- Add “chr” string in variants in VCF:
1awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' infile.no_chr.vcf > infile.vcf
- Sort karyotypically a VCF (version 1):
1grep '^#' in.vcf > out.vcf && grep -v '^#' in.vcf | sort -V -k1,1 -k2,2n >> out.vcf
- Sort karyotypically a VCF (version 2): use ‘-V’, natural sort of (version) numbers within text
1sort -V -k1,1 -k2,2n infile.vcf > outfile.vcf
- Sort karyotypically a VCF (version 3): using vcf-sort (from vcftools)
1vcf-sort infile.vcf > outfile.vcf
- Replace spaces with tab:
1awk -v OFS="\t" '$1=$1' tmp1 > tmp2
- Prepare a header:
1echo -e "#CHR\tPOS\tID\tREF\tALT\tnumGT.Unknowns\tnumGT.HomRef\tnumGT.Het \
2 \tnumGT.HomAlt\tnumGT.(Het+HomAlt)" > header
3cat header tmp2 > infile.variant-genotypes.counts
- Extract variants that have a genotype count equal or higher than a genotype threshold:
1awk -F'[ ]' '{ if ($10 >= 5) print $3 }' infile.variant-genotypes.counts > variant-list
- Count the lines of each file in a dir and get the sum of all lines:
1find ${indir} -type f -name "*.selected-files" -print0 | wc -l --files0-from=-
- Count the total number of lines in the selected files of a dir:
1find ${indir} -type f -name "*.selected-files" -exec cat {} + | wc -l
- Grab the body of a file excluding the header:
1tail -n +2 ${infile} > file-without-header
- Count number of lines in a file:
1wc -l ${infile}
These are just a few examples of how you can parse and filter genomic data files using awk
. There are many more ways to manipulate and analyze genomic data using awk
and other command-line tools. Let me know if you have any further questions!
Calculating summary statistics and generating reports
I can provide you with some examples of how to calculate summary statistics and generate reports using awk
. Here are some examples:
- Calculate the number of lines in a file:
1awk 'END {print NR}' infile.txt
- Calculate the number of fields in each line of a file:
1awk '{print NF}' infile.txt
- Calculate the sum of a column of numbers:
1awk '{sum += $1} END {print sum}' infile.txt
- Calculate the average of a column of numbers:
1awk '{sum += $1; count++} END {print sum/count}' infile.txt
- Calculate the minimum value of a column of numbers:
1awk '{if (min > $1) min = $1} END {print min}' infile.txt
- Calculate the maximum value of a column of numbers:
1awk '{if (max < $1) max = $1} END {print max}' infile.txt
- Calculate the median value of a column of numbers:
1awk '{a[NR] = $1} END {n = asort(a); if (NR % 2 == 1) print a[int(NR/2)+1]; else print (a[int(NR/2)] + a[int(NR/2)+1])/2}' infile.txt
- Generate a report of the number of occurrences of each unique value in a column:
1awk '{count[$1]++} END {for (i in count) print i, count[i]}' infile.txt
- Generate a report of the number of occurrences of each unique value in a column, sorted by count:
1awk '{count[$1]++} END {for (i in count) print count[i], i}' | sort -rn
- Generate a report of the number of occurrences of each unique value in a column, sorted alphabetically:
1awk '{count[$1]++} END {for (i in count) print i, count[i]}' | sort
These are just a few examples of how you can calculate summary statistics and generate reports using awk
. There are many more ways to manipulate and analyze data using awk
and other command-line tools. Let me know if you have any further questions!
Automating workflows and pipelines
here are some examples of how to automate workflows and pipelines using shell scripts and command-line tools.
- Create a shell script to automate a workflow:
1#!/bin/bash
2
3# Define variables
4indir="/path/to/input/files"
5outdir="/path/to/output/files"
6
7# Create output directory if it doesn't exist
8mkdir -p $outdir
9
10# Process each file in the input directory
11for infile in $indir/*.txt; do
12 # Extract the base filename without the extension
13 base=$(basename $infile .txt)
14
15 # Perform some operations on the input file
16 # For example, calculate the number of lines, fields, and unique values in the first column
17 awk 'END {print NR}' $infile > $outdir/$base.lines
18 awk '{print NF}' $infile > $outdir/$base.fields
19 awk '{a[$1]++} END {for (i in a) print i, a[i]}' $infile | sort > $outdir/$base.unique
20done
- Create a shell script to automate a pipeline:
1#!/bin/bash
2
3# Define variables
4indir="/path/to/input/files"
5outdir="/path/to/output/files"
6
7# Create output directory if it doesn't exist
8mkdir -p $outdir
9
10# Process each file in the input directory
11for infile in $indir/*.txt; do
12 # Extract the base filename without the extension
13 base=$(basename $infile .txt)
14
15 # Perform operations on the input file using a pipeline
16 # For example, filter lines based on a condition, calculate summary statistics, and generate a report
17 awk '{if ($1 > 5) print $0}' $infile > $outdir/$base.filtered
18 awk '{sum += $1} END {print sum}' $infile > $outdir/$base.sum
19 awk '{count[$1]++} END {for (i in count) print i, count[i]}' $infile | sort -rn > $outdir/$base.report
20done
These are just a few examples of how you can automate workflows and pipelines using shell scripts and command-line tools. There are many more ways to automate and streamline your data analysis and processing workflows. Let me know if you have any further questions!