Getting started in linux-Bioinformatics

July 16, 2019 Off By admin

Basic Linux/Unix Command for bioinformatics

The objective of this activity is to promote understanding of basic UNIX commands. It is structured as a command reference for better continued learning effectiveness. MacOS X and various flavors of Linux are UNIX-based operating systems. Because many of the programs run from the command line, one needs to know how to use some basic UNIX commands.

Among these are how to create files, move files, copy files, delete files, and so on. This page gives an overview of UNIX basics for those who are unfamiliar with them.

Getting help
List a directory
Navigating the directory structure
Create a directory
Change a directory
Locate file in filesystem
Viewing files
Copying files
Concatenate files
Moving files
Deleting files
Deleting directories
Searching files for a pattern
Working with tabular files
Combining commands into a pipeline
Basic shell scripting
Modifying text files using the command line
Connecting to a remote computer
Copying files between machines
Text file conversion
Change file permissions
Uncompressing files and directories
Emacs text editor

Getting help
While the Workshop is in session, one of your best sources of help are the teaching assistants and support staff. This page will still be accessible after the course, but contains help on only a few basic UNIX commands. Some material may not apply to your home systems. On most systems, typing the man command invokes information about commands.

For example, to
learn how cp works, type:

man cp

To exit the man page, hit:

Note: man pages are often a bit obtuse, but are worth reading nevertheless.

List a directory
To list all directories and files within a current directory, type:

To see a detailed list of all directories and files sorted by creation date, type:

ls -lrt

Navigating the directory structure
To work with a file or directory, it is important to tell the system where you are. There are two ways to do that: the absolute and the relative path. The absolute path gives the whole location of the file, for example:

/mnt/chromosome/home/plasma01
The relative path is described with the signs
“.” and “/”:
./ = current directory
../= parent directory of ./ (one directory up)
/ = root directory (top directory)
When no location is specified, UNIX assumes we mean “./”.
~ is your home directory.

In Ubuntu Linux, this will be /home/username
(for example /mnt/chromosome/home/res01 if your user id is res01)

To find out which directory you currently are in, type:

pwd

This will show an output such as: /mnt/chromosome/home/res01

Creating directories
To create a new directory, type:

mkdir directory1

This will create a directory called directory1

Changing directories
To go “down” one directory (in this example to go from plasma01 to directory1), type:

cd directory1

To go “up” one directory (in this example, to go from directory1 to plasma01), type:

cd ..

Now cd to directory1, create directory2 in it and cd to directory2. To go “straight back” to your home directory from directory2, type:

Go to the UNIX tutorial example directory:

cd ~/basic_bioinfo_2019/practical_1/

Viewing files
There are many ways to view a text file. One way for simple viewing is to type:

less hsscl.gff

where “ hsscl.gff” is a file in your current directory. If you cannot open the file with less, check that you are located in the right directory:

pwd (should return ~/basic_bioinfo_2019/practical_1)

The less command displays as much of the file as can fit onto the screen. To scroll up and down within the document, use the arrow keys. Hitting the space bar will bring a new screenfull of information.

To search forward in the file for a given pattern, enter:
/pattern

where “pattern” represents the character string you wish to find. To search for the next occurrence of the “pattern”, press:

To exit the less program and return to the prompt, press:
q

Test
this by opening hsscl.gff using less, and search for the identifier:

MAP17

Viewing part of a file
The head command displays the first few lines at the top of a file. But you can tell it how many lines to display. Example:

head -10 hsscl.gff

(shows the first 10 lines of that file) The tail command displays the last few lines of a file. By default tail will show the last ten lines of a file, but you can tell it how many lines to display.

Example:
tail -10 hsscl.gff
(shows the last ten lines of that file)

Copying files
To copy a file using a terminal, use:
cp original-file new-file
For example:
cp hsscl.gff newfile.gff

This makes a new file (called newfile.gff) that contains the same information as the original file (hsscl.gff ) with the name indicated. Check with ls that you have generated the
newfile.gff

cp can also be used to copy a file from one directory to another. For example:
cp ~/basic_bioinfo_2019/practical_2/reads_01.fastq ./newfile.fastq
This command will take reads_01.fastq in ~/basic_bioinfo_2019/practical_2/ and copy it to
~/basic_bioinfo_2019/practical_1/. It names the copy newfile.fastq. If a name is not specified for the file, a copy with the same name as the original is created.

For example:
cp ./newfile.fastq ~/basic_bioinfo_2012/practical_1/

Instead of using the whole path, you can also use a relative path in copying, for example, we can copy “newfile.fastq” from the current directory to its parent directory using, either:

cp ~/basic_bioinfo_2012/practical_1/newfile.fastq ./newfile2.fastq
or:
cp ./newfile.fastq ../ or cp newfile2.fastq ../

Again, check with ls to see that the file has been generated.
To copy an entire directory, use the -r option:

cp -r dir-to-copy/ new-dir/

For example:
cp -r ~/basic_bioinfo_2019/practical_1/directory1 directory2
copies the directory ~/basic_bioinfo_2019/practical_1/directory1 to directory2 in your current working directory.

Note: on some computers cp -r does not copy dot-files (files that start with ‘.’, for example .bashrc). Also, you may run into trouble copying to or from directories on which you do not have appropriate permissions.

Concatenating files
The cat (short for concatenate) command is a frequently used command. It is used to print the entire content of a file to your screen (use ctrl-c to break if you run cat on a too large file). It can also be used to concatenate text files together.

cat file1 file2 >file3

The contents of file1 and file2 are now in file3.

Check your directory and pick two gff files to concatenate into a file called concatenate.gff.

Check the results using less and see that the content of both files are found in the newly generated concatenate.gff.

Moving files
On a UNIX file system, moving a file is the same as renaming it. The command mv works like:

mv file1 file2
For example:
mv newfile.fastq new.fastq

takes newfile.fastq and renames it new.fastq. Files can be moved from one location to another:
mv location1/file1 location2/file1

For example:
mv new.fastq ../new2.fastq

This command moves the new.fastq file one directory up and calls the file new2.fastq. To move a file and also rename it, specify the name after the second location. For example:

mv ../new.fastq ./movedfile.fastq

Deleting files
rm is the command to remove a file. There are flags (options) that can be used with rm. To delete a file, type:

rm filename

For example, to delete the file called movedfile.fastq, type:
rm movedfile.fastq

To delete a file from a different directory, supply rm with the path. For example:
rm ~/basic_bioinfo_2019/practical_1/movedfile.fastq
Note that you can of course only remove the file once!

Deleting directories
To delete a directory, type:
rm -r directory1

This will remove the directory and all the files and directories in it, so use this command carefully.

Count lines in a file
To count the number of lines in a file, type:

wc -l

The -l option restricts the output to only giving the number of lines.
Locate file in the filesystem

Used to find the location of files and directories
Note that you would need locate database to be generated for the locate command to work. If that is the case follow the instructions given by the locate program or ignore this part of the tutorial.

Type locate followed by the name of the file you are looking for:
locate
Example:
locate reads_01

Search files for a pattern
Grep is an acronym for General Regular Expression Print.
In other words, the grep command is excellent for searching for a particular pattern in a file and outputting the results to the screen. This pattern can be literal (exact characters), such as
chr_1, but could also be a regular expression (often called regexp). Regular expressions are a powerful way to find and replace strings of a particular format. For example, regular expressions can be used to search for a pattern such as “chr_” followed by any number (e.g.
chr_1 as well as chr_999 etc should match). Note: A regexp provides more flexibility in a search, but can be difficult to get right in the beginning so make sure to check that your results are what you expect them to be. It is highly recommended to put the pattern you are
searching for within single quotes:

grep ‘pattern’ filename

For example to retrieve all lines matching fastq headers:
grep ‘@’ fastqfile

Count the total number of lines in the file using: wc -l reads_01.fastq

The file is a fastq file from Illumina. To retrieve a list of all the lines matching the fastq

headers:
grep ‘@’ reads_01.fastq

Adding the option –color highlights the match on the screen.Try the grep again:

grep –color ‘@’ reads_01.fastq

Match a number in the fastq file:
grep -E ‘[0-9]’ reads_01.fastq

The -E option extends the regular expression capabilities of grep and [0-9] matches any digit.

Add the color option (–color as described above) to see what you matched. The color option must come right in front of the pattern you are searching for (in this case [0-9]).

To count how many matches grep got, you can use the -c option:
grep -E -c ‘[0-9]’ reads_01.fastq

To search specifically for headers having 4 digits in their id, you can use the “#” in your

regexp:
grep -E -c ‘[0-9]{4}#’ reads_01.fastq
grep -E ‘[0-9]{4}#’ reads_01.fastq

Split large files
With the large data sizes typically generated in genomics, it can sometimes be an advantage to split a file into several smaller ones:

split -l 10000 reads_01.fastq

will split the file called filename into 4 separate files with 10000 lines in each. The files will be starting with x (for example xaa). Check with ls that you have these files and cound the lines using

wc -l x*

Make sure that you split files so that the sequence, quality and header information are kept intact in the same file. You can check the split outputs by using tail:

tail x*

Working with tabular files
To generate a file containing a subset of columns from a tabular (tab) file you can use:

cut
The “-f” flag allows you to specify which field (i.e. column) to extract. To get the first column from the file:

cut –f 1 filename

to get the first and the third columns from the file :

cut –f 1,3 filename

Example: to get the query, hit and evalue from a tabular BLAST output (output option – outfmt 6):
cut -f 1,2,9 hsscl.gff > threecolsfile

The “>” character will write the results of the cut command to the file output_file. Check the results using less.

It is also possible to combine columns from different files using paste:
paste hsscl.gff threecolsfile > pastefile

will combine the columns in hsscl.gff and threecolsfile into pastefile with tabular separation.

Sorting columns of a tabular file can be useful for digesting large data outputs. For example
running:
sort -k 3 hsscl.gff > hsscl.sorted

will sort the lines alphabetically in hsscl.gff by the third column and outputs to hsscl.sorted. If you instead would like to sort it numerically, you need to use the -n option. The sort command is often used with the -u option to get the unique set. Alternatively, uniq could be used on a sorted list:

uniq -f 8 hsscl.sorted

The -f option makes uniq compare column 8, which in the gff file is the sequence names.

This way we can count how many unique sequences we have in the gff file.

Combining commands into a pipeline
UNIX lets you combine virtually any commands by “glueing” them together with the pipe symbol “|”. The output of the first command will be used as input of the next command.

Combining grep and wc will give you the number of lines having a particular pattern:

grep pattern filename | wc -l

or generating a unique list from a file:
sort filename | uniq

For example, counting the number of sequences in a fastq file:
grep ‘MAP17’ hsscl.gff | wc -l

(Note that you can also count by using the -c command in grep (i.e grep -c), but here we are illustrating how to combine commands).

Basic shell scripting
Another way to combine commands is to use a shell script. This way you can save the commands and settings for reuse on future datasets. One of the most common shell scripting languages is bash. Open emacs (or other suitable text editor). The first line of a bash script must contain the shebang line, which references the bash interpreter in the operating system:

#!/bin/bash

Modifying text files using the command line
You can use awk (or gawk) to search through a file and make changes.
awk ‘condition {action}’ inputfile where condition is a pattern and action is one or several commands to be executed.

For example, print the protein ids matching the pattern “exon” from the hsscl.gff file located in

~/wg_jan2012/activities/Unix_Tutorial/:

awk ‘/exon/ {print $1}’ hsscl.gff

How did your output look compared to the complete file? Check with less or head.

The $1 only prints the first column, but you can choose to print all columns by using:

awk ‘/exon/ {print $0}’ hsscl.gff

In awk you can also pick several columns:
awk ‘/exon/ {print $1, “\t”, $3, “\t” $10}’ hsscl.gff

Another way to process files is to use a perl one liner. Perl is a programming language

commonly used in bioinformatics. It is possible to run simple perl commands directly on the command line (perl one liners):

perl -ne < inputfile ‘condition{action}’

For example, print all lines matching “exon”:

perl -ne < hsscl.gff ‘if(/exon/){print $_;}’

Connecting to a remote machine
ssh is a program to connect to another computer. For Linux and MacOS X, there are two ways of using ssh from a terminal:

ssh user@computer

ssh computer -l user

For example:
ssh [email protected]

ssh 10.10.10.16 -l darwin

See also Bioinformatics glossary - O

You will then be asked for your password. If you are using a notebook computer that does not already have an ssh client program, you can get one for free.

If you are running MacOS X or Linux, simply open a terminal. If you are running Windows, try PuTTY.

Copying files between machines
Some of you may be familiar with FTP (File Transfer Protocol), which we do not recommend because it is a relatively insecure method of transferring files between computers.

To transfer files between machines, you should use:
scp

This command works very similarly to cp, except that the files you are copying reside on different machines. To copy a file from a remote computer to the computer you are working at, type:

scp user@remote-computer:/remote/path/remote-file /local/path/file

You will then be prompted for the user’s password on the remote computer. After you enterit, the file will be copied.

For example:
scp [email protected]:/mnt/chromosome/home/plasma01/info /home/bioinfo/facts

This command will copy the file called ‘info’ that lives on Altix4700 server at

/mnt/chromosome/home/plasma01/info to /home/bioinfo

on my current machine and will rename it ‘facts’. For those of you familiar with FTP, this is like ‘get remote-file local-file’.

Note: if you do not specify a local name, scp will name the file the same name as the remote file. To copy a local file onto a remote computer, you type:

scp /local/path/local-file user@remote-computer:/remote/path/remote-file

For those of your familiar with FTP, this is like ‘put local-file remote-file’. Note: As with cp, you need to have read/execute permissions for the two files. Also, you should only specify
one computer name (either the local host or the remote host). As with the cp command, you can specify the -r flag to recursively enter directories.

Note: this does not work with Windows machines.
scp works only between UNIX based operating systems. SSH File Transfer
Protocol or SFTP is a network protocol that provides file transfer and manipulation functionality over any reliable data stream. To copy files from a remote computer to a local computer, first navigate to the local directory where you want to copy or retrieve files, then type:

sftp user@remote-computer

You will then be prompted for the user’s password on the remote computer. After you enter it, you will be logged into the user’s home directory on the remote computer. Use UNIX commands to navigate to the remote directory where you want copy or retrieve files. To copy
a local file onto a remote computer, you type:

put local-file remote-file

To copy a remote file onto the local computer, you type:

get remote-file local-file

To copy multiple remote files onto the local computer, you type:

mget remote-file local-file

Text file conversion

When text files are created in Windows, they are given line endings formatted for DOS.

These files are unable to be utilized by programs run in UNIX, and must be converted. There are many ways to acomplish this; however the simplest is by utilizing the following command, where you replace dosfile.txt with the name of your file, and unixfile.txt with the
new file name:

tr -d ‘\15\32′ < dosfile.txt > unixfile.txt

Similarly, to convert from Mac line endings to UNIX, you type:

tr ‘\r’ ‘\n’ < macfile.txt > unixfile.txt

Change file permissions

chmod – change file access permissions

Change file permissions to make a file executable:

chmod +x filename

Uncompressing files and directories

tar : create tape archives and add or extract files.

Example:
tar –zvxf tarfile.tar.gz

unzip : extract all files of the archive into the current directory and subdrectories:

unzip zippedfile.zip

gunzip : uncompress a gzip’d file:

gunzip gzippedfile.gz

Emacs text editor
Finally, open the text editor called emacs:

emacs

You should follow the tutorial by typing:
ctrl-h and then t

You can also open the bash_profile.txt in

~/basic_bioinfo_2019/practical_1/

Go through the text and test the commands on the command line (remove the # when running the commands). See what happens at the prompt in the terminal window. If you like any of the results, you can add the various commands to your own ~/.bash_profile file on your system, but make sure to make a copy of the original .bash_profile before you start.

The Quantum Leap in Bioinformatics and Structural Biology

What's Next in Bioinformatics: 2023 Innovations in AI, Genomics, and Healthcare

Bioinformatics in 2023: The Job Landscape Unraveled

Bioinformatics glossary - W

Bioinformatics courses in Africa

Single-cell Biology: A Comprehensive Overview

Advanced Topics in Enzymology for Bioinformatics

Top 100 Global Universities for Bioinformatics and Computational Biology: A Deep Dive

Python for Non-Programmers

Genomic Pioneers: Cord Blood Research Empowered by Bioinformatics Tools

Structural Bioinformatics: An Overview of Methods and Applications

Essential Tools and Software in Bioinformatics: BLAST, FASTA, and Clustal