Sed, awk, and regular expressions
January 5, 2024In the course of doing bioinformatics, you will be dealing with myriad different text files. As we noted in the previous chapters, Unix, with its file I/O model, piping capabilities, and numerous utilities, is well-suited to handling large text files. Two utilities found on every Unix installation—awk
and sed
—merit special attention in this context. awk
is a lightweight scripting language that lets you write succinct programs to operate line-by-by line on the contents of text files. It is particularly useful for handling text files that have columns of data separated by white spaces or tabs. sed
on the other hand, is particularly useful for automating “find-and-replace” operations on text files. Each of them is optimized to handle large files without storing a lot of information in memory, so they can be useful for quick operations on large bioinformatic data sets. Neither is a fully-featured programming language that you would want to write large, complex programs in (that said, I did once implement a complete program for full-sibling inference from multiallelic markers in awk
); however they do share many of the useful text-manipulation capabilities of such languages, such as Perl and Python. Additionally, awk
and sed
are deployed in a consistent fashion across most Unix operating systems, and they don’t require much time to learn to use effectively for common text-processing tasks. As a consequence awk
and sed
are a useful addition to the bioinformatician’s toolbox.
Both awk
and sed
rely heavily on regular expressions to describe patterns in text upon which some operation should be performed. You can think of regular expressions as providing a succinct language for performing very advanced “find” and “find-and-replace” operations in a text file.
In this chapter we will only scratch the surface of what can be done with awk
and sed
. Our goal here is to provide an introduction to a few basic maneuvers with both awk
and sed
and to describe instances where they can be useful, as well as to give an introduction to many (but certainly not all) of the patterns that can be expressed using regular expressions. We will start with a basic overview of how awk
works. Then we will have a short look at regular expressions, then we will use those further in awk
, and finally we will play with sed
a little bit.
All of the examples in this chapter that use such external files assume that the current working directory is the repository directory awk-and-sed-inputs
.
awk
Line-cycling, tests and actions
awk
operates by cycling through a file (or an input stream from stdin) line-by-line. At each new line awk
tests whether it should do anything with the contents of the line. If the answer to that test is “yes”, then it executes the code that describes what should be done to the contents of the line. The scope of ways that awk
can operate on text is quite wide, but, the most common use of awk
is to print parts of the line, or do small calculations on parts of the line. The instructions that describe the tests and the actions are included in the awk
“script”, though, in practice, this script is often given to awk
not as a file, but, rather, as text between single-quotes on the command line. The basic syntax looks like this:
% awk '
test1 {action1}
test2 {action2}
' file
Where file
is the path of the file you want awk
to read, and the “tests” and “actions” above are just placeholders showing where those parts of the script are written. The layout makes it clear that if test1
is TRUE, action1
will be executed, and if test2
is TRUE, then action2
is executed. Code describing the actions must always appear within a set of curly braces. We will refer to the text within those curly braces as an action block.
Two things to note: first, different tests and actions do not have to be written on separate lines. That makes things easier to read, but if you are writing these short scripts on the command line, it is often easier to put everything on a single line, like 'test1 {action1} test2 {action2}
, and that is fine. Second, if you don’t supply a file path to awk
it expects (and gladly processes) data from stdin.
This makes it easy to pipe data into awk
. For example, putting the above two points together, the above awk
script skeleton could have been written this way:
% cat file | awk 'test1 {action1} test2 {action2}'
Before we get too abstract talking about tests and actions, let’s look at a few examples:
# print lines starting with @SQ in the SAM header of a file
awk '/^@SQ/ {print}' data/DPCh_plate1_F12_S72.sam
# print only those @SQ lines with a sequence name tag starting with "NC_"
awk '/^@SQ/ && /SN:NC_/ {print}' data/DPCh_plate1_F12_S72.sam
# same as above, but quit processing the file as soon as you hit
# an @SQ line with a sequence name starting with "NW_"
awk '
/^@SQ/ && /SN:NC_/ {print}
/^@SQ/ && /SN:NW_/ {exit}
' data/DPCh_plate1_F12_S72.sam
# print only lines 101, and 103 from the fastq file
gzcat data/DPCh_plate1_F12_S72.R1.fq.gz | awk 'NR==101 || NR==102 {print}'
The above examples show only two different actions (print
and exit
) and a variety of tests based on matching substrings in each line, or on which line number (NR
, which stands for “number of the record”).
Take a moment to make sure you understand which parts are the tests, and which are the actions in the above examples.
Column splitting, fields, -F
, $
, NF
, print
, OFS
and BEGIN
Every time awk
processes a line of text it breaks it into different fields, which you can think of as columns (as in a spreadsheet). By default, any number of whitespace (space or TAB) characters constitutes a break between columns. If you want field-splitting to be done using more specific characters, you can specify that on the command line with the the -F
option. For example:
awk -F"\t" 'test {action}' file # split lines into fields on single TAB characters
awk -F"," 'test {action}' file # split lines into fields on single commas
awk -F":" 'test {action}' file # split lines into fields on single colons
awk -F";" 'test {action}' file # split lines into fields on single semicolons
# and so forth...
While in all those examples, the field separator is a single character, in full-blown practice, the field separator can be specified as a regular expression (see below).
Within an action block, you can access the different fields that awk
has split out of the line using a $
followed immediately by a field number, i.e., $1
, $2
, $3
, … . When you get to fields with indexes greater than 9, you have to wrap the number in parentheses, like $(10)
, $(11)
,… . The special value $0
denotes the whole line, not just one of its fields. (Note that if you give the action print
without any arguments, that also just prints the whole line.) The special variable NF
is the “Number of Fields”, so $NF
is the “last column.”
Let’s say we have a tab-separated file in data/wgs-chinook-samples.tsv
that looks like this:
vcf_name ID_Berk NMFS_DNA_ID BOX_ID BOX_POSITION Population Concentration (ng/ul) run_type
DPCh_plate1_A01_S1 CH_Plate1_1A T144767 T1512 1A Salmon River Fall 23.4 Fall
DPCh_plate1_A02_S2 CH_Plate1_2A T144804 T1512 5F Salmon River Fall 67.4 Fall
DPCh_plate1_A03_S3 CH_Plate1_3A T145109 T1515 7G Salmon River Spring 3.52 Spring
DPCh_plate1_A04_S4 CH_Plate1_4A T145118 T1515 8H Salmon River Spring 10.3 Spring
DPCh_plate1_A05_S5 CH_Plate1_5A T144863 T1513 1A Feather River Hatchery Fall 220 Fall
Then, if we wanted to print the first and the last columns (fields) we could do
awk -F"\t" '{print $1, $NF}' data/wgs-chinook-samples.tsv
# the first few lines of output look like:
vcf_name run_type
DPCh_plate1_A01_S1 Fall
DPCh_plate1_A02_S2 Fall
DPCh_plate1_A03_S3 Spring
DPCh_plate1_A04_S4 Spring
Notice that if there is no test
before the action block, then the action is done on every line.
The print
command prints variables to stdout, if you separate those variables with a comma, then in the output, they will be separated by the output field sepator which, by default, is a single space. You can set the output field separator using the OFS
variable, in an action block that is forced to run at the beginning of execution using the special BEGIN
test keyword:
awk -F"\t" '
BEGIN {OFS=";"}
{print $1, $NF}
' data/wgs-chinook-samples.tsv
# makes output like:
vcf_name;run_type
DPCh_plate1_A01_S1;Fall
DPCh_plate1_A02_S2;Fall
DPCh_plate1_A03_S3;Spring
DPCh_plate1_A04_S4;Spring
DPCh_plate1_A05_S5;Fall
The comma between the arguments to print
is what makes awk
print the output field separator between the items. If you separate arguments to print
with just a space (and not a comma), there will be nothing printed between the two arguments on output. This, coupled with the fact that print
will happily print any strings and numbers (in addition to variables!) you pass to it, provides a way to do some light formatting of the output text:
awk -F"\t" 'NR > 1 {print "sample_name:" $1, "run_type:" $NF}' data/wgs-chinook-samples.tsv
# gives output like:
sample_name:DPCh_plate1_A01_S1 run_type:Fall
sample_name:DPCh_plate1_A02_S2 run_type:Fall
sample_name:DPCh_plate1_A03_S3 run_type:Spring
sample_name:DPCh_plate1_A04_S4 run_type:Spring
sample_name:DPCh_plate1_A05_S5 run_type:Fall
Note that if you don’t provide an action block after a test, the default action, if the test is true, is assumed to be “print the whole line as is.” Thus you can use awk
to print matching lines simply like this:
awk '/regex/'
where /regex/
just means “some regular expression,” as is explained in the next section.
Now it’s your turn!
# 1. Using the file data/wgs-chinook-samples.csv, print out the
# NMFS_DNA_ID the BOX_ID and the BOX_POSITION, separated by periods
A brief introduction to regular expressions
In a few of the above examples you will see tests that look like: /^@SQ/
. This is a regular expression. In awk
, regular expressions are enclosed in forward slashes, so the actual regular expression part in the above is ^@SQ
, the enclosing forward slashes are just delimiters that are telling awk
, “Hey! The stuff inside here should be interpreted as a regular expression.”
At this stage, you can think of a regular expression as a “search-string” that awk
will try to match against the text in a line. At its simplest, a regular expression just describes how characters should match between the regular expression and the line being matched. For example:
/ACGGTC/
Is saying, “search for the word ACGGTC
,” which is something that might find in a DNA string.
If all that regular expressions did was express a search word (like your familiar find function in Microslop Word, for example), then they would be very easy to learn, but also very limited in utility. The good news is that all your standard numerals and upper- and lowercase letters work in regular expressions just like they do in your vanilla “find” function. So, all of the following regular expressions are just requesting that the word enclosed in the /
’s be found:
/Fall/
/A01/
/plate/
/LN/
Regular expressions get more complicated (and useful) because some of the familiar punctuation marks have special meaning within a regular expression. These are called metacharacters. The fundamental metacharacters in awk
are listed in Table 6.1. It is worth getting familiar with all of these as most are common to all languages that use regular expressions, such as R, python, and perl, so learning these will be helpful not just in using awk
, but also in your programming, in general.
Metacharacter(s) | Description |
---|---|
. | Match any single character |
[ ] | Match any single character from a character class (see below) |
^ | Signifies/matches the beginning of the line or word (remember /^@SQ/ ) |
$ | Signifies/matches the end of the line of word |
* | Match zero or more occurrences of the character (or grouped pattern) immediately to the left |
? | Match zero or one occurrences of the character (or grouped pattern) immediately to the left |
+ | Match one or more occurrences of the character (or grouped pattern) immediately to the left |
{n} | Match n occurrences of the character (or grouped pattern) immediately to the left |
{m,n} | Match any number between m and n occurrences of character (or grouped pattern) immediately to the left |
| | Combine regular expressions with an OR |
\ | Use it if you want to match a literal metacharacter. For example \. matches an actual period, and \? matches an acutal question mark |
( ) | Used to group characters into a single unit to which modifiers like * or + can be applied, to delimit the extent of | ’s (or to be used in more advanced expressions for inserting replacement groups) |
We note in the above table that more explanation is needed for the concept of character classes defined by [ ]
. These are similar to what you are already familiar with in terms of globbing file names. In short, if you put a variety of characters between square brackets, it means “match any one of these characters.” For example, /[aceACE]/
means “match any upper- or lowercase a
, c
, or e
. Within those square brackets, ^
and -
have special meaning depending on where between the square brackets they occur:
- A
-
, when between letters or numbers, indicates a range:/[a-zA-Z]/
means any letter; /[0-5]/ means any numeral between 0 and 5. - A
-
at the beginnging or end of the characters inside the square brackets just means-
:/[-;ab]/
means match any of the characters-
,;
,a
, orb
. - A
^
at the beginning of the characters inside the square brackets negates the character class, meaning the match will be to anything not within the character class, i.e.,[^ABC]
means match any character that is notA
norB
norC
. If the^
is not at the beginning of the characters within the[ ]
then it carries no special meaning:[:;^%]
matches any of those four punctuation characters.’ - All characters, except
-
and^
in the correct positions, and the backslash\
, are interpreted literally (i.e., not as metacharacters) within the[ ]
. So you can match any one of?
,*
,(
,)
, or+
, for instance, with/[?*()+]/
.
All these metacharacters might seem a little cryptic, so I provide a few examples, here, that might help to make it more clear. They are presented (i.e. “match any line”) as if they are part of an awk
test.
# match any line that starts with @RG
/^@RG/
# match any line that starts with @ followed by any two characters
/^@../
# match any line that starts with @ followed by any
# two uppercase letters
/^@[A-Z][A-Z]/
# the above could also be written as:
/^@[A-Z]{2}/
# This matches phone numbers formatted either like
# (###) ###-####, or ###-###-####
/\(?[0-9]{3}(\) |-)[0-9]{3}-[0-9]{4}/
# And if you can parse that out, you get an A for the day!
# match anything that starts with T456, then has any
# number (including 0) of any other characters, then
# ends with ".fq"
/T456.*\.fq/
# Note the use of backslash to escape the .
# Match either "big mess" or "huge mess"
/(big|huge) mess/
Now it’s your turn
Here are some things to try.
# 1. use alternation (the |) to match lines in the SAM file data/DPCh_plate1_F12_S72.sam
# that start with either @PG or @RG
# 2. search through the fastq file data/DPCh_plate1_F12_S72.R1.fq.gz
# to find any line with exactly two consecutive occurrences of any of the following characters:
# ! " # $ % &
# What are we searching for in the above?
A variety of tests
When you want to test whether a line will be processed by an action block, or not, you have many options. Among the major ones (each given with an example or two) are:
- match a regular expression anywhere in a line:
awk '/^@RG/' awk '/Fall/'
- match a regular expression in a specific column/field
awk '$3 ~ /Sacramento/' awk '$1 ~ /Chr[123]/'
- test for equality, using
==
, of a single column to a string, or a number. Note these are not regular expressions, but actual statements of equality. Strings in such a context are surrounded by double quotes.awk '$1 == "NC_07124.1"' awk '$5 == 100'
- Use comparison operators
<
,<=
,>
,>=
, and!=
(not equals) to compare a column to a string (compared by lexicographical order) or a number (compared by numerical order)awk '$2 <= "Aardvark"' awk '$7 > 25'
- test the value of a user-defined variable that may be changing as lines are getting processed.
awk 'n > 356' awk 'my_word == "Loony"'
- test the value of an internal variable, like
NR
orNF
awk 'NR < 25' awk 'NF == 13'
As these tests are effectively things that return a value of TRUE or FALSE, they can be combined with logical AND and logical OR operators. In awk, the logical AND is &&
and the logical OR is ||
. To make the intent of long combinations clear, and to specify specific combinations, the tests can be grouped with parentheses:
awk '(units = "days" && n > 356) || (units == "months" && n > 12) {print "More than a year!"}'
Code in the action blocks
Within the action blocks you write computer code to do things, and awk
has many features you expect in a programming language.
Separate lines of code can be ended with a line return (i.e., in scripts), or they can be ended with a semicolon. Variable assignment is done with the =
sign. Unlike the bash shell, you can have spaces around the =
. One interesting aspect of awk
is that the variables are untyped. This means that you don’t have to tell awk
ahead of time whether a variable is going to hold a number, or a string, etc. Everything is stored as a string, but when used in a numeric context, if it makes sense, the variable will be treated as a number. So, you don’t have to worry too much about the type of variables.
If a value has not yet been assigned to a variable, but it is used in a numeric context, the value is assumed to be 0; if used in a string context, its value is assumed to be the empty string ""
.
You can use for
loops within awk
. They have the syntax of the C language:
for(var = initial; test; increment)
For example, this cycles i
over values starting from 1, until 10, each time incrementing the value by 1:
for(i=1;i<=10;i++)
Note that the ++
means “add one to the variable to my left.”
You can also increment by larger amounts. What do you think this would do?
echo boing | awk '{for(i=5; i<=25; i+=5) print i}'
If the body of the loop consists of multiple lines of code (or even just one) those lines can be grouped using curly braces. Here is an examle of using a for
loop to print the columns of the Chinook sample sheet in row format:
awk -F"," '
NR == 1 {for(i=1;i<=NF;i++) head[i] = $i; next}
{for(i=1;i<=NF;i++) {
print i ". " head[i] ": " $i;
}
print "----------------------"
}
' data/wgs-chinook-samples.csv
Arrays are implemented as associative arrays. Thus, rather than being arrays with elements indexed (and accessed) by natural numbers (i.e. array[1]
, array[2]
, etc.), arrays are indexed by any string. This is sometimes called a hash. In python it is called a dictionary. So, within an awk action block you could see
n[$3]++
which means “find the element of the array n
that is associated with the key $3, and then add 1 to that element”. This can be quite useful for counting up the occurrences of different strings or values in a column. Of course, in order to actually see what the values are, after they have been counted up, you need to be able to cycle over all the different keys of the array, and print the key and the value for each. In awk
, you cycle over the keys of an array using
for (i in array)
where for
and in
are keywords, i
is any variable name you want, and array
is the name of an array variable. Unfortunately, you have no control over the order in which the different keys in the array are visited! Example:
# count the number of fish from different sampling collections
# in the wgs-chinook-samples.tsv file:
awk -F"\t" '
NR > 1 {n[$6]++}
END {for(i in n) print i ":", n[i]}
' data/wgs-chinook-samples.tsv
# gives us:
Coleman Hatchery Late Fall: 16
Feather River Hatchery Spring: 16
Feather River Hatchery Fall: 16
Salmon River Spring: 16
Trinity River Hatchery Fall: 16
Trinity River Hatchery Spring: 16
Butte Creek Spring: 16
Sacramento River Winter: 16
Salmon River Fall: 16
San Joaquin River Fall: 16
The above demonstrates the use of the very important END
specifier in the test position. The END
there means “perform the actions in the action block once all the lines of the file have been processed.”
Now it’s your turn!
We want to cycle through the file DPCh_plate1_F12_S72.R1.fq.gz
and count up the number of times different base-quality sequences occur in the file. The base quality scores occur on lines 4, 8, 12, … (so, the line number divided by 4 has no remainder. Note that x % 4
gives the remainder when x
is divided by 4. We want the output on each line to be in the format:
Number_of_occurrences Base_quality_score_sequence
# go for it!
After doing that, you might wish that things were sorted differently, like highest to lowest in terms of # of occurrences. Try piping the output into:
sort -n -b -r -k 1
Pipe that to less and look through it. It is actually pretty cool.
Conditional tests use if
and else
, with blocking by curly braces.
Below we will demonstrate the use of if
and else
and, how they can be put together to make an ifelse
-like construction. At the same time we demonstrate how output from print
statements in awk
can be redirected to files, from within the awk
script itself. This uses much the same syntax as the shell.
Imagine that we want to put the IDs (called the vcf_names
) of the fish in data/wgs-chinook-samples.csv
into separate files, one for each of the run_types
of “Fall”, “Winter”, and “Spring”, and another file for anything else. We could do that like this:
awk -F"," '
NR > 1 {
if($NF == "Fall") {
print $1, $NF > "fall.txt"
} else if($NF == "Winter") {
print $1, $NF > "winter.txt"
} else if($NF == "Spring") {
print $1, $NF > "spring.txt"
} else {
print $1, $NF > "other.txt"
}
}
' data/wgs-chinook-samples.csv
Now look at the four files produced.
NB: Parsing CSV files with awk
is not always so straightforward as doing -F","
because the CSV specification allows for commas that do not separate fields to be hidden within quotation marks. Don’t expect to parse complex CSV files made by Excel, for example, to be parsed this easily with awk
. It is better to save them as TAB separate files, typically.
Mathematical operations: awk
has got them all: +
, -
, *
, /
, %
, as well as “operate and reassign” versions: +=
, -=
, *=
, /=
, as well as: exp
, log
, sqrt
, sin
, cos
, and atan2
Built in functions: awk
has a limited set of built-in functions. See man awk
for a full listing. The ones I find I use all the time are:
length(n)
: return the number of characters of the variablen
substr(s, m, n)
: return the portion of strings
that begins at positionm
(counted starting from 1), and isn
characters longsub(r, t, s)
: substitute stringt
for the first occurrence of the regular expressionr
in the strings
. Ifs
is not given,$0
is used.gsub
: just likesub
but replaces all (rather than just the first) occurrences of the regular expressionr
in strings
.split(s, a, fs)
: split the strings
into an array variable nameda
at occurrences of the regular expressionfs
. The function returns the number of pieces the string was split into. Afterward each part of the string can be accessed likea[1]
,a[2]
, and so forth. (Good example = splitting out fields from a column in a VCF file.)
Take control of output formatting with the printf()
function:
I still need to write this.
Using awk
to assign to shell variables
We leave our discussion of awk
by noting that its terse syntax makes it a perfect tool for creating shell variables that we can do things (like cycling over them) with.
For example, imagine we wanted to do something to all the FASTQ files associated with the fish in data/wgs-chinook-samples.csv
that are of run_type
= “Winter”. We can get all their IDs using command substitution and then cycle over them in the shell like this:
WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)
for i in $WINTERS; do
echo "FASTQS are: $i.R1.fq.gz $i.R2.fq.gz"
done
Passing Variables into awk
with -v
Gotta let people know about this!
Writing awk scripts in files
Need to do this.
sed
sed
is a “stream editor”. Though it is capable of all sorts of things, to be honest, I use it almost exclusively to do simple find-and-replace operations. The syntax for that is:
sed 's/regex/replacement/g;' file
where regex
is a regular expression and replacement
is the string you wish to replace any segments of file
that match regex
. The s
means “substitute” and is like a command to sed
, and the g
means “globally”, without which only the first match of regex
on each line would be replaced.
Multiple instances of the “s” command can be given, like:
sed '
s/regex1/replacement1/g;
s/regex2/replacement2/g;
s/regex3/replacement3/g;
' file
The separate commands are done in order, line by line.
Now it’s your turn!
Take the output of:
WINTERS=$(awk -F"," '$NF == "Winter" {print $1}' data/wgs-chinook-samples.csv)
for i in $WINTERS; do
echo "FASTQS are: $i.R1.fq.gz $i.R2.fq.gz"
done
and pipe it into sed
to changes the R1
and R2
into r1
and r2
, and to remove the .gz
from the end of each file name. Remember to backslash-escape that period!