Getting started in linux-Bioinformatics
July 16, 2019Basic Linux/Unix Command for bioinformatics
The objective of this activity is to promote understanding of basic UNIX commands. It is structured as a command reference for better continued learning effectiveness. MacOS X and various flavors of Linux are UNIX-based operating systems. Because many of the programs run from the command line, one needs to know how to use some basic UNIX commands.
Among these are how to create files, move files, copy files, delete files, and so on. This page gives an overview of UNIX basics for those who are unfamiliar with them.
Getting help
List a directory
Navigating the directory structure
Create a directory
Change a directory
Locate file in filesystem
Viewing files
Copying files
Concatenate files
Moving files
Deleting files
Deleting directories
Searching files for a pattern
Working with tabular files
Combining commands into a pipeline
Basic shell scripting
Modifying text files using the command line
Connecting to a remote computer
Copying files between machines
Text file conversion
Change file permissions
Uncompressing files and directories
Emacs text editor
Getting help
While the Workshop is in session, one of your best sources of help are the teaching assistants and support staff. This page will still be accessible after the course, but contains help on only a few basic UNIX commands. Some material may not apply to your home systems. On most systems, typing the man command invokes information about commands.
For example, to
learn how cp works, type:
man cp
To exit the man page, hit:
q
Note: man pages are often a bit obtuse, but are worth reading nevertheless.
List a directory
To list all directories and files within a current directory, type:
ls
To see a detailed list of all directories and files sorted by creation date, type:
ls -lrt
Navigating the directory structure
To work with a file or directory, it is important to tell the system where you are. There are two ways to do that: the absolute and the relative path. The absolute path gives the whole location of the file, for example:
/mnt/chromosome/home/plasma01
The relative path is described with the signs
“.” and “/”:
./ = current directory
../= parent directory of ./ (one directory up)
/ = root directory (top directory)
When no location is specified, UNIX assumes we mean “./”.
~ is your home directory.
In Ubuntu Linux, this will be /home/username
(for example /mnt/chromosome/home/res01 if your user id is res01)
To find out which directory you currently are in, type:
pwd
This will show an output such as: /mnt/chromosome/home/res01
Creating directories
To create a new directory, type:
mkdir directory1
This will create a directory called directory1
Changing directories
To go “down” one directory (in this example to go from plasma01 to directory1), type:
cd directory1
To go “up” one directory (in this example, to go from directory1 to plasma01), type:
cd ..
Now cd to directory1, create directory2 in it and cd to directory2. To go “straight back” to your home directory from directory2, type:
cd
Go to the UNIX tutorial example directory:
cd ~/basic_bioinfo_2019/practical_1/
Viewing files
There are many ways to view a text file. One way for simple viewing is to type:
less hsscl.gff
where “ hsscl.gff” is a file in your current directory. If you cannot open the file with less, check that you are located in the right directory:
pwd (should return ~/basic_bioinfo_2019/practical_1)
The less command displays as much of the file as can fit onto the screen. To scroll up and down within the document, use the arrow keys. Hitting the space bar will bring a new screenfull of information.
To search forward in the file for a given pattern, enter:
/pattern
where “pattern” represents the character string you wish to find. To search for the next occurrence of the “pattern”, press:
n
To exit the less program and return to the prompt, press:
q
Test
this by opening hsscl.gff using less, and search for the identifier:
MAP17
Viewing part of a file
The head command displays the first few lines at the top of a file. But you can tell it how many lines to display. Example:
head -10 hsscl.gff
(shows the first 10 lines of that file) The tail command displays the last few lines of a file. By default tail will show the last ten lines of a file, but you can tell it how many lines to display.
Example:
tail -10 hsscl.gff
(shows the last ten lines of that file)
Copying files
To copy a file using a terminal, use:
cp original-file new-file
For example:
cp hsscl.gff newfile.gff
This makes a new file (called newfile.gff) that contains the same information as the original file (hsscl.gff ) with the name indicated. Check with ls that you have generated the
newfile.gff
cp can also be used to copy a file from one directory to another. For example:
cp ~/basic_bioinfo_2019/practical_2/reads_01.fastq ./newfile.fastq
This command will take reads_01.fastq in ~/basic_bioinfo_2019/practical_2/ and copy it to
~/basic_bioinfo_2019/practical_1/. It names the copy newfile.fastq. If a name is not specified for the file, a copy with the same name as the original is created.
For example:
cp ./newfile.fastq ~/basic_bioinfo_2012/practical_1/
Instead of using the whole path, you can also use a relative path in copying, for example, we can copy “newfile.fastq” from the current directory to its parent directory using, either:
cp ~/basic_bioinfo_2012/practical_1/newfile.fastq ./newfile2.fastq
or:
cp ./newfile.fastq ../ or cp newfile2.fastq ../
Again, check with ls to see that the file has been generated.
To copy an entire directory, use the -r option:
cp -r dir-to-copy/ new-dir/
For example:
cp -r ~/basic_bioinfo_2019/practical_1/directory1 directory2
copies the directory ~/basic_bioinfo_2019/practical_1/directory1 to directory2 in your current working directory.
Note: on some computers cp -r does not copy dot-files (files that start with ‘.’, for example .bashrc). Also, you may run into trouble copying to or from directories on which you do not have appropriate permissions.
Concatenating files
The cat (short for concatenate) command is a frequently used command. It is used to print the entire content of a file to your screen (use ctrl-c to break if you run cat on a too large file). It can also be used to concatenate text files together.
cat file1 file2 >file3
The contents of file1 and file2 are now in file3.
Check your directory and pick two gff files to concatenate into a file called concatenate.gff.
Check the results using less and see that the content of both files are found in the newly generated concatenate.gff.
Moving files
On a UNIX file system, moving a file is the same as renaming it. The command mv works like:
mv file1 file2
For example:
mv newfile.fastq new.fastq
takes newfile.fastq and renames it new.fastq. Files can be moved from one location to another:
mv location1/file1 location2/file1
For example:
mv new.fastq ../new2.fastq
This command moves the new.fastq file one directory up and calls the file new2.fastq. To move a file and also rename it, specify the name after the second location. For example:
mv ../new.fastq ./movedfile.fastq
Deleting files
rm is the command to remove a file. There are flags (options) that can be used with rm. To delete a file, type:
rm filename
For example, to delete the file called movedfile.fastq, type:
rm movedfile.fastq
To delete a file from a different directory, supply rm with the path. For example:
rm ~/basic_bioinfo_2019/practical_1/movedfile.fastq
Note that you can of course only remove the file once!
Deleting directories
To delete a directory, type:
rm -r directory1
This will remove the directory and all the files and directories in it, so use this command carefully.
Count lines in a file
To count the number of lines in a file, type:
wc -l
The -l option restricts the output to only giving the number of lines.
Locate file in the filesystem
Used to find the location of files and directories
Note that you would need locate database to be generated for the locate command to work. If that is the case follow the instructions given by the locate program or ignore this part of the tutorial.
Type locate followed by the name of the file you are looking for:
locate
Example:
locate reads_01
Search files for a pattern
Grep is an acronym for General Regular Expression Print.
In other words, the grep command is excellent for searching for a particular pattern in a file and outputting the results to the screen. This pattern can be literal (exact characters), such as
chr_1, but could also be a regular expression (often called regexp). Regular expressions are a powerful way to find and replace strings of a particular format. For example, regular expressions can be used to search for a pattern such as “chr_” followed by any number (e.g.
chr_1 as well as chr_999 etc should match). Note: A regexp provides more flexibility in a search, but can be difficult to get right in the beginning so make sure to check that your results are what you expect them to be. It is highly recommended to put the pattern you are
searching for within single quotes:
grep ‘pattern’ filename
For example to retrieve all lines matching fastq headers:
grep ‘@’ fastqfile
Count the total number of lines in the file using: wc -l reads_01.fastq
The file is a fastq file from Illumina. To retrieve a list of all the lines matching the fastq
headers:
grep ‘@’ reads_01.fastq
Adding the option –color highlights the match on the screen.Try the grep again:
grep –color ‘@’ reads_01.fastq
Match a number in the fastq file:
grep -E ‘[0-9]’ reads_01.fastq
The -E option extends the regular expression capabilities of grep and [0-9] matches any digit.
Add the color option (–color as described above) to see what you matched. The color option must come right in front of the pattern you are searching for (in this case [0-9]).
To count how many matches grep got, you can use the -c option:
grep -E -c ‘[0-9]’ reads_01.fastq
To search specifically for headers having 4 digits in their id, you can use the “#” in your
regexp:
grep -E -c ‘[0-9]{4}#’ reads_01.fastq
grep -E ‘[0-9]{4}#’ reads_01.fastq
Split large files
With the large data sizes typically generated in genomics, it can sometimes be an advantage to split a file into several smaller ones:
split -l 10000 reads_01.fastq
will split the file called filename into 4 separate files with 10000 lines in each. The files will be starting with x (for example xaa). Check with ls that you have these files and cound the lines using
wc -l x*
Make sure that you split files so that the sequence, quality and header information are kept intact in the same file. You can check the split outputs by using tail:
tail x*
Working with tabular files
To generate a file containing a subset of columns from a tabular (tab) file you can use:
cut
The “-f” flag allows you to specify which field (i.e. column) to extract. To get the first column from the file:
cut –f 1 filename
to get the first and the third columns from the file :
cut –f 1,3 filename
Example: to get the query, hit and evalue from a tabular BLAST output (output option – outfmt 6):
cut -f 1,2,9 hsscl.gff > threecolsfile
The “>” character will write the results of the cut command to the file output_file. Check the results using less.
It is also possible to combine columns from different files using paste:
paste hsscl.gff threecolsfile > pastefile
will combine the columns in hsscl.gff and threecolsfile into pastefile with tabular separation.
Sorting columns of a tabular file can be useful for digesting large data outputs. For example
running:
sort -k 3 hsscl.gff > hsscl.sorted
will sort the lines alphabetically in hsscl.gff by the third column and outputs to hsscl.sorted. If you instead would like to sort it numerically, you need to use the -n option. The sort command is often used with the -u option to get the unique set. Alternatively, uniq could be used on a sorted list:
uniq -f 8 hsscl.sorted
The -f option makes uniq compare column 8, which in the gff file is the sequence names.
This way we can count how many unique sequences we have in the gff file.
Combining commands into a pipeline
UNIX lets you combine virtually any commands by “glueing” them together with the pipe symbol “|”. The output of the first command will be used as input of the next command.
Combining grep and wc will give you the number of lines having a particular pattern:
grep pattern filename | wc -l
or generating a unique list from a file:
sort filename | uniq
For example, counting the number of sequences in a fastq file:
grep ‘MAP17’ hsscl.gff | wc -l
(Note that you can also count by using the -c command in grep (i.e grep -c), but here we are illustrating how to combine commands).
Basic shell scripting
Another way to combine commands is to use a shell script. This way you can save the commands and settings for reuse on future datasets. One of the most common shell scripting languages is bash. Open emacs (or other suitable text editor). The first line of a bash script must contain the shebang line, which references the bash interpreter in the operating system:
#!/bin/bash
Modifying text files using the command line
You can use awk (or gawk) to search through a file and make changes.
awk ‘condition {action}’ inputfile where condition is a pattern and action is one or several commands to be executed.
For example, print the protein ids matching the pattern “exon” from the hsscl.gff file located in
~/wg_jan2012/activities/Unix_Tutorial/:
awk ‘/exon/ {print $1}’ hsscl.gff
How did your output look compared to the complete file? Check with less or head.
The $1 only prints the first column, but you can choose to print all columns by using:
awk ‘/exon/ {print $0}’ hsscl.gff
In awk you can also pick several columns:
awk ‘/exon/ {print $1, “\t”, $3, “\t” $10}’ hsscl.gff
Another way to process files is to use a perl one liner. Perl is a programming language
commonly used in bioinformatics. It is possible to run simple perl commands directly on the command line (perl one liners):
perl -ne < inputfile ‘condition{action}’
For example, print all lines matching “exon”:
perl -ne < hsscl.gff ‘if(/exon/){print $_;}’
Connecting to a remote machine
ssh is a program to connect to another computer. For Linux and MacOS X, there are two ways of using ssh from a terminal:
ssh user@computer
or
ssh computer -l user
For example:
ssh [email protected]
or
ssh 10.10.10.16 -l darwin
You will then be asked for your password. If you are using a notebook computer that does not already have an ssh client program, you can get one for free.
If you are running MacOS X or Linux, simply open a terminal. If you are running Windows, try PuTTY.
Copying files between machines
Some of you may be familiar with FTP (File Transfer Protocol), which we do not recommend because it is a relatively insecure method of transferring files between computers.
To transfer files between machines, you should use:
scp
This command works very similarly to cp, except that the files you are copying reside on different machines. To copy a file from a remote computer to the computer you are working at, type:
scp user@remote-computer:/remote/path/remote-file /local/path/file
You will then be prompted for the user’s password on the remote computer. After you enterit, the file will be copied.
For example:
scp [email protected]:/mnt/chromosome/home/plasma01/info /home/bioinfo/facts
This command will copy the file called ‘info’ that lives on Altix4700 server at
/mnt/chromosome/home/plasma01/info to /home/bioinfo
on my current machine and will rename it ‘facts’. For those of you familiar with FTP, this is like ‘get remote-file local-file’.
Note: if you do not specify a local name, scp will name the file the same name as the remote file. To copy a local file onto a remote computer, you type:
scp /local/path/local-file user@remote-computer:/remote/path/remote-file
For those of your familiar with FTP, this is like ‘put local-file remote-file’. Note: As with cp, you need to have read/execute permissions for the two files. Also, you should only specify
one computer name (either the local host or the remote host). As with the cp command, you can specify the -r flag to recursively enter directories.
Note: this does not work with Windows machines.
scp works only between UNIX based operating systems. SSH File Transfer
Protocol or SFTP is a network protocol that provides file transfer and manipulation functionality over any reliable data stream. To copy files from a remote computer to a local computer, first navigate to the local directory where you want to copy or retrieve files, then type:
sftp user@remote-computer
You will then be prompted for the user’s password on the remote computer. After you enter it, you will be logged into the user’s home directory on the remote computer. Use UNIX commands to navigate to the remote directory where you want copy or retrieve files. To copy
a local file onto a remote computer, you type:
put local-file remote-file
To copy a remote file onto the local computer, you type:
get remote-file local-file
To copy multiple remote files onto the local computer, you type:
mget remote-file local-file
Text file conversion
When text files are created in Windows, they are given line endings formatted for DOS.
These files are unable to be utilized by programs run in UNIX, and must be converted. There are many ways to acomplish this; however the simplest is by utilizing the following command, where you replace dosfile.txt with the name of your file, and unixfile.txt with the
new file name:
tr -d ‘\15\32′ < dosfile.txt > unixfile.txt
Similarly, to convert from Mac line endings to UNIX, you type:
tr ‘\r’ ‘\n’ < macfile.txt > unixfile.txt
Change file permissions
chmod – change file access permissions
Change file permissions to make a file executable:
chmod +x filename
Uncompressing files and directories
tar : create tape archives and add or extract files.
Example:
tar –zvxf tarfile.tar.gz
unzip : extract all files of the archive into the current directory and subdrectories:
unzip zippedfile.zip
gunzip : uncompress a gzip’d file:
gunzip gzippedfile.gz
Emacs text editor
Finally, open the text editor called emacs:
emacs
You should follow the tutorial by typing:
ctrl-h and then t
You can also open the bash_profile.txt in
~/basic_bioinfo_2019/practical_1/
Go through the text and test the commands on the command line (remove the # when running the commands). See what happens at the prompt in the terminal window. If you like any of the results, you can add the various commands to your own ~/.bash_profile file on your system, but make sure to make a copy of the original .bash_profile before you start.