Cutting-Edge Bioinformatics Techniques

Essential Unix/Linux Terminal Knowledge

January 5, 2024 Off By admin
Shares

Unix originated at AT&T Bell Labs in the 1960s and, while formally a trademarked operating system, colloquially refers to the shell – the text-command-driven interface facilitating interaction between Unix users and computers.

The enduring appeal of the Unix shell lies in its consistent functionality over many decades. Learning it means embracing a time-tested mode of computer interaction that remains a cornerstone for large computer systems, defying transient trends.

In the realm of bioinformatics, Unix stands out as the preferred tool for several reasons. Firstly, it enables complex data analyses with minimal commands. Secondly, Unix facilitates task automation, particularly for repetitive operations. Thirdly, its standard command set includes tools for managing large files, inspecting, and manipulating text files. Furthermore, Unix efficiently expresses and executes multiple, consecutive analyses on a single data stream, often without intermediate disk writes. Additionally, Unix’s optimization for limited computing resources during its development makes it well-suited for handling massive genomic datasets on modern high-performance computing systems. Most cutting-edge bioinformatic tools are tailored for Unix environments. Lastly, nearly every high-performance computer cluster operates on a Unix variant, making Unix proficiency crucial for cluster-based analyses.

Getting a bash shell on your system

A special part of the Unix operating system is the “shell.” This is the system that interprets commands from the user. At times it behaves like an interpreted programming language, and it also has a number of features that help to minimize the amount of typing the user must do to complete any particular tasks. There are a number of different “shells” that people use. We will focus on one called “bash,” which stands for the “Bourne again shell.” Many of the shells share a number of features.

Many common operating systems are built upon Unix or upon Linux—an open-source flavor of Unix that is, in many scenarios, indistinguishable. Hereafter we will refer to both Unix and Linux as “Unix” systems). For example all Apple Macintosh computers are built on top of the Berkeley Standard Distribution of Unix and bash is the default shell. Many people these days use laptops that run a flavor of Linux like Ubuntu, Debian, or RedHat. Linux users should ensure that they are running the bash shell. This can be done by typing “bash” at the command line, or inserting that into their profile. To know what shell is currently running you can type:

echo $0

at the Unix command line. If you are running bash the result should be

-bash

PCs running Microsoft Windows are something of the exception in the computer world, in that they are not running an operating system built on Unix. However, Windows 10 now allows for a Linux Subsystem to be run. For Windows, it is also possible to install a lightweight implementation of bash (like Git Bash). This is helpful for learning how to use Unix, but it should be noted that most bioinformatic tools are still difficult to install on Windows.

Navigating the Unix filesystem

Most computer users will be familiar with the idea of saving documents into “folders.” These folders are typically navigated using a “point-and-click” interface like that of the Finder in Mac OS X or the File Explorer in a Windows system. When working in a Unix shell, such a point-and-click interface is typically not available, and the first hurdle that new Unix users must surmount is learning to quickly navigate in the Unix filesystem from a terminal prompt. So, we begin our foray into Unix and its command prompt with this essential skill.

When you start a Unix shell in a terminal window you get a command prompt that might look something like this:

my-laptop:~ me$

or, perhaps something as simple as:

$

or maybe something like:

/~/--%

We will adopt the convention in this book that, unless we are intentionally doing something fancier, the Unix command prompt is given by a percent sign, and this will be used when displaying text typed at a command prompt, followed by output from the command. For example

% pwd
/Users/eriq

shows that I issued the Unix command pwd, which instructs the computer to print working directory, and the computer responded by printing /Users/eriq, which, on my Mac OS X system is my home directory. In Unix parlance, rather than speaking of “folders,” we call them “directories;” however, the two are essentially the same thing. Every user on a Unix system has a home directory. It is the domain on a shared computer in which the user has privileges to create and delete files and do work. It is where most of your work will happen. When you are working in the Unix shell there is a notion of a current working directory—that is to say, a place within the hierarchy of directories where you are “currently working.” This will become more concrete after we have encountered a few more concepts.

The specification /Users/eriq is what is known as an absolute path, as it provides the “address” of my home directory, eriq, on my laptop, starting from the root of the filesystem. Every Unix computer system has a root directory (you can think of it as the “top-most” directory in a hierarchy), and on every Unix system this root directory always has the special name, /. The address of a directory relative to the root is specified by starting with the root (/) and then naming each subsequent directory that you must go inside of in order to get to the destination, each separated by a /. For example, /Users/eriq tells us that we start at the root (/) and then we go into the Users directory (Users) and then, from there, into the eriq directory. Note that / is used to mean the root directory when at the beginning of an absolute path, but in the remainder of the path its meaning is different: it is used merely as a separator between directories nested within one another. Figure 4.1 shows an example hierarchy of some of the directories that are found on the author’s laptop

Editing text files at the terminal

Sometimes you might need to edit text files at the command line.

The easiest text editor to use on the command line is nano. Try typing nano a-file.txt and type a few things. It is pretty self explanatory.

Customizing your Environment

Previously we saw how to modify your command prompt to tell you what the current working directory is (remember PS1='[\W]--% '). The limitation of giving that command on the command line is that if you logout and then log back in again, or open a new Terminal window, you will have to reissue that command in order to achieve the desired look of your command prompt. Quite often a Unix user would like to make a number of customization to the look, feel, and behavior of their Unix shell. The bash shell allows these customization to be specified in two different files that are read by the system so as to invoke the customization. The two files are hidden files in the home directory: ~/.bashrc and ~/.bash_profile. They are used by the Unix system in two slightly different contexts, but for most purposes, you, the user, will not need or even want to distinguish between the different contexts. Managing two separate files of customization is unnecessary and requires duplication of your efforts, and can lead to inconsistent and confusing results, so here is what we will do:

  1. Keep all of our customization in ~/.bashrc.
  2. Insert commands in ~/.bash_profile that say, “Hey computer! If you are looking for customization in here, don’t bother, just get them straight out of ~/.bashrc.

We take care of #2, by creating the file ~/.bash_profile to have the following lines in it:


if [ -f ~/.bashrc ]; then
    source ~/.bashrc
fi

Taking care of #1 is now just a matter of writing commands into ~/.bashrc. In the following are some recommended customization.

Appearances matter

Some customization just change the way your shell looks or what type of output is given from different commands. Here are some lines to add to your ~/.bashrc along with some discussion of each.

export PS1='[\W]--% '

This gives a tidier and more informative command prompt. The export command before it tells the system to pass the value of this environment variablePS1, along to any other shells that get spawned by the current one.

alias ls='ls -GFh'

This makes it so that each time you invoke the ls command, you do so with the options -G-F, and -h. To find out on your own what those options do, you can type man ls at the command line and read the output, but briefly: -G causes directories and different file types to be printed in different colors, -F causes a / to be printed after directory names, and other characters to be printed at the end of the names of different file types, and -h causes file sizes to be printed in an easily human-readable form when using the -l option.

 Where are my programs/commands at?!

We saw in Section 4.3.1 that bash searches the directories listed in the PATH variable to find commands and executables. You can modify the PATH variable to include directories where you have installed different programs. In doing so, you want to make sure that you don’t lose any of the other directories in PATH, so there is a certain way to go about redefining PATH. If you want to add the path /a-new/program/directory to your PATH variable you do it like this:

PATH=$PATH:/a-new/program/directory

 A Few More Important Keystrokes

If a command “gets stuck” or is running longer than it should be, you can usually kill/quit it by doing cntrl-c.

Once you have given a command, it gets stored in your bash history. You can use the up-arrow key to cycle backward through different commands in your history. This is particularly useful if you are building up complex pipelines on the command line piece by piece, looking at the output of each to make sure it is correct. Rather than re-typing what you did for the last command line, you just up-arrow it.

Once you have done an up-arrow or two, you can cycle back down through your history with a down-arrow.

Finally, you can search through your bash history by typing cntrl-r and then typing the word/command you are looking for. For example, if, 100 command lines back, you used a command that involved the program awk, you can search for that by typing cntrl-r and then typing awk.

One last big thing to note: the # is considered a comment character in bash. This means that any text following a # (unless it is backslash-escaped or inside quotation marks), until the next line ending, will be ignored by the shell.

A short list of additional useful commands.

Everyone should be familiar with the following commands, and the options that follow them on each line below. One might even think of scanning the manual page for each of these:

  • echo
  • cat
  • head-n, -c
  • tail-n
  • less
  • sort-n -b -k
  • paste
  • cut-d
  • tar-cvf, -xvf
  • gzip-c
  • du-h -C,
  • wc
  • date
  • uniq
  • chmodu+xug+x
  • grep

Two important computing concepts

Compression

Most file storage types (like text files) are a bit wasteful in terms of file space: every character in a text file takes the same number of bytes to store, whether it is a character that is used a lot, like s or e, or whether it is a character that is seldom seen in many text files, like ^Compression is the art of creating a code for different types of data that uses fewer bits to encode “letters” (or “chunks” of data) that occur frequently and it reserves codewords of more bits to encode less frequently occurring chunks in the data. The result is that the total file size is smaller than the uncompressed version. However, in order to read it, the file must be decompressed.

In bioinformatics, many of the files you deal with will be compressed, because that can save many terabytes of disk space. Most often, files will be compressed using the gzip utility, and they can be uncompressed with the gunzip command. Sometimes you might want to just look at the fist part of a compressed file. If the file is compressed with gzip, you can decompress to stdout by using gzcat and then pipe it to head, for example.

A central form of compression in bioinformatics is called bgzip compression which compresses files into a series of blocks of the same size, situated in such a way that it is possible to index the contents of the file so that certain parts of the file can be accessed without decompressing the whole thing. We will encounter indexed compressed files a lot when we start dealing with BAM and vcf.gz files.

Hashing

The final topic we will cover here is the topic of hashing, an in particular the idea of “fingerprinting” files on ones computer. This process is central to how the git version control system works, and it is well worth knowing about.

Any file on your computer can be thought of as a series of bits, 0’s and 1’s, as, fundamentally, that is what the file is. A hashing algorithm is an algorithm that maps a series of bits (or arbitrary length) to a short sequence of bits. The SHA1 hashing algorithm maps arbitrary sequences of bits to a sequence of 160 bits.

There are 2160≈1.46×1048 possible bit sequences of length 160. That is a vast number. If your hashing algorithm is well randomized, so that bit sequences are hashed into 160 bits in a roughly uniform distribution, then it is exceedingly unlikely that any two bit sequences (i.e. files on your filesystem) will have the same hash (“fingerprint”) unless they are perfectly identical. As hashing algorithms are often quite fast to compute, this provides an exceptionally good way to verify that two files are identical.

The SHA1 algorithm is implemented with the shasum command. In the following, as a demonstration, I store the recursive listing of my git-repos directory into a file and I hash it. Then I add just a single line ending to the end of the file, and hash that, to note that the two hashes are not at all similar even though the two files differ by only one character:

[~]--% ls -R Documents/git-repos/* > /tmp/gr-list.txt 
[~]--% # how many lines is that?
[~]--% wc /tmp/gr-list.txt 
   93096   88177 2310967 /tmp/gr-list.txt
[~]--% shasum /tmp/gr-list.txt 
1396f2fec4eebdee079830e1eff9e3a64ba5588c  /tmp/gr-list.txt
[~]--% # now add a line ending to the end
[~]--% (cat /tmp/gr-list.txt; echo) > /tmp/gr-list2.txt 
[~]--% # hash both and compare
[~]--% shasum /tmp/gr-list.txt /tmp/gr-list2.txt 
1396f2fec4eebdee079830e1eff9e3a64ba5588c  /tmp/gr-list.txt
23bff8776ff86e5ebbe39e11cc2f5e31c286ae91  /tmp/gr-list2.txt
[~]--% # whoa! cool.
Shares