Learn Next-Generation Sequencing Data Analysis

December 16, 2024 Off By admin

Table of Contents

Next-Generation Sequencing Data Analysis

Introduction

Recent advancements in DNA sequencing technologies have revolutionized genomics by making massively parallel sequencing both widely accessible and cost-effective. The cost of sequencing a genome has plummeted, enabling individual researchers to generate gigabases to terabases of data with ease. These cutting-edge next-generation sequencing (NGS) technologies are transforming biological research, facilitating comprehensive and routine analyses of genomes and transcriptomes at an unprecedented scale.

This is an exciting and transformative period for the field, with rapid advancements in sequencing technologies driving innovation in data analysis approaches. The sheer volume of generated data has necessitated the development of sophisticated algorithms and computational tools for efficient and accurate analysis.

This workshop is designed to equip participants with essential NGS data analysis strategies to address a wide range of biological questions. Key topics will include preprocessing NGS data, aligning sequencing reads to reference genomes, de novo assembly of short reads, and whole-transcriptome assembly. Participants will gain hands-on experience and insights into the evolving landscape of NGS analysis methods.

Practical 1: Basic UNIX Command

The objective of this activity is to promote understanding of basic UNIX commands. It is structured as a command reference for better continued learning effectiveness. MacOS X and various flavors of Linux are UNIX-based operating systems. Because many of the programs run from the command line, one needs to know how to use some basic UNIX commands. Among these are how to create files, move files, copy files, delete files, and so on. This page gives an overview of UNIX basics for those who are unfamiliar with them. For an extensive list of other UNIX commands, try one of the numerous resources available on the internet.

- Getting help
- List a directory
- Navigating the directory structure
- Create a directory
- Change a directory
- Locate file in filesystem
- Viewing files
- Copying files
- Concatenate files
- Moving files
- Deleting files
- Deleting directories
- Searching files for a pattern
- Working with tabular files
- Combining commands into a pipeline
- Basic shell scripting
- Modifying text files using the command line
- Connecting to a remote computer
- Copying files between machines
- Text file conversion
- Change file permissions
- Uncompressing files and directories
- Emacs text editor

Getting help

While the Workshop is in session, one of your best sources of help are the teaching assistants and support staff. This page will still be accessible after the course, but contains help on only a few basic UNIX commands. Some material may not apply to your home systems. On most systems, typing the man command invokes information about commands. For example, to learn how cp works, type:

man cp

To exit the man page, hit:

q

Note: man pages are often a bit obtuse, but are worth reading nevertheless.

List a directory

To list all directories and files within a current directory, type:

ls

To see a detailed list of all directories and files sorted by creation date, type:

ls -lrt

Navigating the directory structure

To work with a file or directory, it is important to tell the system where you are. There are two ways to do that: the absolute and the relative path. The absolute path gives the whole location of the file, for example:

/mnt/chromosome/home/plasma10

The relative path is described with the signs “.” and “/”:

./ = current directory

../= parent directory of ./ (one directory up)

/ = root directory (top directory)

When no location is specified, UNIX assumes we mean “./”.

~ is your home directory. In Ubuntu Linux, this will be /home/username (for example /mnt/chromosome/home/plasma01 if your user id is plasma01)

To find out which directory you currently are in, type:

pwd

This will show an output such as: /mnt/chromosome/home/plasmaXX

Creating directories

To create a new directory, type:

mkdir directory1

This will create a directory called directory1

Changing directories

To go “down” one directory (in this example to go from plasmaXX to directory1), type:

cd directory1

To go “up” one directory (in this example, to go from directory1 to plasmaXX), type:

cd ..

Now cd to directory1, create directory2 in it and cd to directory2. To go “straight back” to your home directory from directory2, type:

cd

Go to the UNIX tutorial example directory:

cd ~/NGS-Sept2014/practical_1/

Here you should find the files we will be using in this tutorial by using the ls command.

Viewing files

There are many ways to view a text file. One way for simple viewing is to type:

less hsscl.gff

where “ hsscl.gff” is a file in your current directory. If you cannot open the file with less, check that you are located in the right directory:

pwd (should return ~/NGS-Oct.2013/practical_1)

The less command displays as much of the file as can fit onto the screen. To scroll up and down within the document, use the arrow keys. Hitting the space bar will bring a new screen- full of information. To search forward in the file for a given pattern, enter:

/pattern

where “pattern” represents the character string you wish to find. To search for the next occurrence of the “pattern”, press:

n

To exit the less program and return to the promt, press:

q

Test this by opening hsscl.gff using less, and search for the identifier:

MAP17

How many occurrences of this identifier do you find in the file (press “n” to move down to the next match)?

Answer:

Eight

Viewing part of a file

The head command displays the first few lines at the top of a file. But you can tell it how many lines to display. Example:

head -10 hsscl.gff

(shows the first 10 lines of that file) The tail command displays the last few lines of a file. By default tail will show the last ten lines of a file, but you can tell it how many lines to display. Example:

tail -10 hsscl.gff (shows the last ten lines of that file) Copying files

To copy a file using a terminal, use:

cp original-file new-file

For example:

cp hsscl.gff newfile.gff

This makes a new file (called newfile.gff) that contains the same information as the original file (hsscl.gff ) with the name indicated. Check with ls that you have generated the newfile.gff

cp can also be used to copy a file from one directory to another. For example:

cp ~/NGS-Sept2014/practical_2/data_P1.fastq ./newfile.fastq

This command will take data_P1.fastq in ~/NGS-Sept2014/practical_2/ and copy it to

~/NGS-Sept2014/practical_1/. It names the copy newfile.fastq. If a name is not specified for the file, a copy with the same name as the original is created. For example:

cp ./newfile.fastq ~/NGS-Sept2014/practical_1/

Instead of using the whole path, you can also use a relative path in copying, for example, we

can copy “newfile.fastq” from the current directory to its parent directory using, either:

cp ~/NGS-Sept2014/practical_1/newfile.fastq ./newfile2.fastq

or:

cp ./newfile.fastq ../ or cp newfile2.fastq ../

Again, check with ls to see that the file has been generated. To copy an entire directory, use the -r option:

cp -r dir-to-copy/ new-dir/

For example:

cp -r ~/NGS-Sept2014/practical_1/directory1 directory2

copies the directory ~/NGS-Sept2014/practical_1/directory1 to directory2 in your current working directory.

Concatenating files

The cat (short for concatenate) command is a frequently used command. It is used to print the entire content of a file to your screen (use ctrl-c to break if you run cat on a too large file). It can also be used to concatenate text files together.

cat file1 file2 >file3

The contents of file1 and file2 are now in file3.

Check your directory and pick two gff files to concatenate into a file called concatenate.gff. Check the results using less and see that the content of both files are found in the newly generated concatenate.gff.

Moving files

On a UNIX file system, moving a file is the same as renaming it. The command mv works like:

mv file1 file2

For example:

mv newfile.fastq new.fastq

takes newfile.fastq and renames it new.fastq. Files can be moved from one location to another:

mv location1/file1 location2/file1

For example:

mv new.fastq ../new2.fastq

This command moves the new.fastq file one directory up and calls the file new2.fastq. To move a file and also rename it, specify the name after the second location. For example:

mv ../new.fastq ./movedfile.fastq

Deleting files

rm is the command to remove a file. There are flags (options) that can be used with rm. To delete a file, type:

rm filename

For example, to delete the file called movedfile.fastq, type:

rm movedfile.fastq

To delete a file from a different directory, supply rm with the path. For example:

rm ~/NGS-Sept2014/practical_1/movedfile.fastq

Note that you can of course only remove the file once!

Deleting directories

To delete a directory, type:

rm -r directory1

This will remove the directory and all the files and directories in it, so use this command carefully.

Count lines in a file

To count the number of lines in a file, type:

wc -l

The -l option restricts the output to only giving the number of lines.

Locate file in the filesystem

Used to find the location of files and directories

Note that you would need locate database to be generated for the locate command to work. If that is the case follow the instructions given by the locate program or ignore this part of the tutorial.

Type locate followed by the name of the file you are looking for:

locate

Example:

locate reads_R1

Search files for a pattern

Grep is an acronym for General Regular Expression Print.

In other words, the grep command is excellent for searching for a particular pattern in a file and outputting the results to the screen. This pattern can be literal (exact characters), such as chr_1, but could also be a regular expression (often called regexp). Regular expressions are a powerful way to find and replace strings of a particular format. For example, regular expressions can be used to search for a pattern such as “chr_” followed by any number (e.g. chr_1 as well as chr_999 etc should match). Note: A regexp provides more flexibility in a search, but can be difficult to get right in the beginning so make sure to check that your results are what you expect them to be. It is highly recommended to put the pattern you are searching for within single quotes:

grep ‘pattern’ filename

For example to retrieve all lines matching fastq headers:

grep ‘@’ fastqfile

Count the total number of lines in the file using: wc -l data_P1.fastq

The file is a fastq file from Illumina. To retrieve a list of all the lines matching the fastq headers:

grep ‘@’ data_P1.fastq

Adding the option –color highlights the match on the screen.Try the grep again:

grep –color ‘@’ data_P1.fastq

Match a number in the fastq file:

grep -E ‘[0-9]’ data_P1.fastq

The -E option extends the regular expression capabilities of grep and [0-9] matches any digit. Add the color option (–color as described above) to see what you matched. The color option must come right in front of the pattern you are searching for (in this case [0-9]).

To count how many matches grep got, you can use the -c option:

grep -E -c ‘[0-9]’ data_P1.fastq

To search specifically for headers having 4 digits in their id, you can use the “#” in your regexp:

grep -E -c ‘[0-9]{4}#’ data_P1.fastq grep -E ‘[0-9]{4}#’ data_P1.fastq

Split large files

With the large data sizes typically generated in genomics, it can sometimes be an advantage to split a file into several smaller ones:

split -l 100 data_P1.fastq

will split the file called filename into 10 separate files with 100 lines in each. The files will be starting with x (for example xaa). Check with ls that you have these files and cound the lines using

wc -l x*

Make sure that you split files so that the sequence, quality and header information are kept intact in the same file. You can check the split outputs by using tail:

tail x*

Working with tabular files

To generate a file containing a subset of columns from a tabular (tab) file you can use:

cut

The “-f” flag allows you to specify which field (i.e. column) to extract. To get the first column from the file:

cut –f 1 filename

to get the first and the third columns from the file :

cut –f 1,3 filename

Example: to get the query, hit and evalue from a tabular BLAST output (output option – outfmt 6):

cut -f 1,2,9 hsscl.gff > threecolsfile

The “>” character will write the results of the cut command to the file output_file. Check the results using less.

It is also possible to combine columns from different files using paste:

paste hsscl.gff threecolsfile > pastefile

will combine the columns in hsscl.gff and threecolsfile into pastefile with tabular separation.

Sorting columns of a tabular file can be useful for digesting large data outputs. For example running:

sort -k 3 hsscl.gff > hsscl.sorted

will sort the lines alphabetically in hsscl.gff by the third column and outputs to hsscl.sorted. If you instead would like to sort it numerically, you need to use the -n option. The sort command is often used with the -u option to get the unique set. Alternatively, uniq could be used on a sorted list:

uniq -f 8 hsscl.sorted

The -f option makes uniq compare column 8, which in the gff file is the sequence names. This way we can count how many unique sequences we have in the gff file.

Combining commands into a pipeline

UNIX lets you combine virtually any commands by “glueing” them together with the pipe symbol “|”. The output of the first command will be used as input of the next command. Combining grep and wc will give you the number of lines having a particular pattern:

grep pattern filename | wc -l

or generating a unique list from a file:

sort filename | uniq

For example, counting the number of sequences in a fastq file:

grep ‘MAP17’ hsscl.gff | wc -l

(Note that you can also count by using the -c command in grep (i.e grep -c), but here we are

illustrating how to combine commands).

Basic shell scripting

Another way to combine commands is to use a shell script. This way you can save the commands and settings for reuse on future datasets. One of the most common shell scripting languages is bash. Open emacs (or other suitable text editor). The first line of a bash script must contain the shebang line, which references the bash interpreter in the operating system:

#!/bin/bash

Below this line you can put in any shell command. For example, it is possible to make a bash script to explore the cov_cutoff parameter in Velvet (see Velvet learning activity on how to run velvetg):

velvetg cov_99 -exp_cov 99.0 -ins_length1 354 -ins_length1_sd 43 -ins_length2 4252 -ins_length2_sd 821

velvetg cov_95 -exp_cov 95.0 -ins_length1 354 -ins_length1_sd 43 -ins_length2 4252 -ins_length2_sd 821

velvetg cov_90 -exp_cov 90.0 -ins_length1 354 -ins_length1_sd 43 -ins_length2 4252 -ins_length2_sd 821

Save the four lines (she-bang line plus three velvetg commands) as a text file in plaint text format (not word document or .rtf, etc!) called:

velvetg_explore.bash

Close the text editor (in emacs, ctrl-x ctrl-s to save and ctrl-x ctrl-c to quit). Make the file executable by typing the following on the command line:

chmod u+x velvetg_explore.bash

You are now able to run the bash script which will execute 3 velvetg runs with different – exp_cov values and output the results into the respective directories (cov_99, cov_95 and cov_90). Execute the file (could take some time to finish):

./velvetg_explore.bash

Modifying text files using the command line

You can use awk (or gawk) to search through a file and make changes.

awk ‘condition {action}’ inputfile

where condition is a pattern and action is one or several commands to be executed. For example, print the protein ids matching the pattern “exon” from the hsscl.gff file located in

~/wg_jan2012/activities/Unix_Tutorial/:

awk ‘/exon/ {print $1}’ hsscl.gff

How did your output look compared to the complete file? Check with less or head. The $1 only prints the first column, but you can choose to print all columns by using:

awk ‘/exon/ {print $0}’ hsscl.gff

In awk you can also pick several columns:

awk ‘/exon/ {print $1, “\t”, $3, “\t” $10}’ hsscl.gff

Another way to process files is to use a perl one liner. Perl is a programming language commonly used in bioinformatics. It is possible to run simple perl commands directly on the command line (perl one liners):

perl -ne < inputfile ‘condition{action}’

For example, print all lines matching “exon”:

perl -ne < hsscl.gff ‘if(/exon/){print $_;}’

Connecting to a remote machine

ssh is a program to connect to another computer. For Linux and MacOS X, there are two ways of using ssh from a terminal:

ssh user@computer

ssh computer -l user

For example:

ssh plasma01@X.X.X.X

ssh 10.10.10.14 -l plasma01

You will then be asked for your password. If you are using a notebook computer that does not already have an ssh client program, you can get one for free. If you are running MacOS X or Linux, simply open a terminal. If you are running Windows, try PuTTY.

Copying files between machines

Some of you may be familiar with FTP (File Transfer Protocol), which we do not recommend because it is a relatively insecure method of transferring files between computers. To transfer files between machines, you should use:

scp

This command works very similarly to cp, except that the files you are copying reside on different machines. To copy a file from a remote computer to the computer you are working at, type:

scp user@remote-computer:/remote/path/remote-file /local/path/file

You will then be prompted for the user’s password on the remote computer. After you enter it, the file will be copied. For example:

scp plasma10@10.10.10.16:/mnt/chromosome/home/plasma01/info /home/mgis/facts

This command will copy the file called ‘info’ that lives on Altix4700 server at

/mnt/chromosome/home/plasma01/info to /home/mgis on my current machine and will rename it ‘facts’. For those of you familiar with FTP, this is like ‘get remote-file local-file’. Note: if you do not specify a local name, scp will name the file the same name as the remote file. To copy a local file onto a remote computer, you type:

scp /local/path/local-file user@remote-computer:/remote/path/remote-file

For those of your familiar with FTP, this is like ‘put local-file remote-file’. Note: As with cp, you need to have read/execute permissions for the two files. Also, you should only specify one computer name (either the local host or the remote host). As with the cp command, you can specify the -r flag to recursively enter directories. Note: this does not work with Windows machines. scp works only between UNIX based operating systems. SSH File Transfer Protocol or SFTP is a network protocol that provides file transfer and manipulation functionality over any reliable data stream. To copy files from a remote computer to a local computer, first navigate to the local directory where you want to copy or retrieve files, then type:

sftp user@remote-computer

You will then be prompted for the user’s password on the remote computer. After you enter it, you will be logged into the user’s home directory on the remote computer. Use UNIX commands to navigate to the remote directory where you want copy or retrieve files. To copy a local file onto a remote computer, you type:

put local-file remote-file

To copy a remote file onto the local computer, you type:

get remote-file local-file

To copy multiple remote files onto the local computer, you type:

mget remote-file local-file

Text file conversion

When text files are created in Windows, they are given line endings formatted for DOS. These files are unable to be utilized by programs run in UNIX, and must be converted. There are many ways to acomplish this; however the simplest is by utilizing the following command, where you replace dosfile.txt with the name of your file, and unixfile.txt with the new file name:

tr -d ‘\15\32′ < dosfile.txt > unixfile.txt

Similarly, to convert from Mac line endings to UNIX, you type:

tr ‘\r’ ‘\n’ < macfile.txt > unixfile.txt

Change file permissions

chmod – change file access permissions

Change file permissions to make a file executable:

chmod +x filename

Uncompressing files and directories

tar : create tape archives and add or extract files. Example:

tar –zvxf tarfile.tar.gz

unzip : extract all files of the archive into the current directory and subdrectories:

unzip zippedfile.zip

gunzip : uncompress a gzip’d file:

gunzip gzippedfile.gz

Emacs text editor

Finally, open the text editor called emacs:

emacs

You should follow the tutorial by typing:

ctrl-h and then t

Practical 2: NGS Data pre-processing

Pre-processing Steps:

Uncompress sequence file; gunzip <filename>.fastq.gz > <filename>.fastq
Count reads for every fastq files, (grep -c) > TOTAL READS
Trim reads based on base quality, (DynamicTrim.pl)
Trim reads based on sequence length, min 50 bp, (LengthSort.pl)
Count reads for every fastq files, (grep -c) > TOTAL HIGH QUALITY READS (QV ≥

20, Length ≥ 50bp)

Screen for phiX reads (phiX spike) (bowtie2, phix-174)
Count reads for every unalign fastq files, (grep -c) > TOTAL CLEAN READS

(minus low quality & short reads, minus phiX genome)

Select fastq paired sequences and singletons (select_paired_fq.pl)
Shuffle fastq paired-end sequences (ShuffleSequnces_fastq.pl)
Combined any singletons reads into a file.

Practical:

Uncompress Illumina NGS sequence data.

gunzip R1.fq.gz gunzip R2.fq.gz

Count reads for every fastq files, to get info on TOTAL READS.

find R1.fq -exec grep -c -H “@HWI-” {} \; >> TOTAL_READS find R2.fq -exec grep -c -H “@HWI-” {} \; >> TOTAL_READS

Trim reads based on base quality, Q_phred = 20.

DynamicTrim.pl -h 20 R1.fq DynamicTrim.pl -h 20 R2.fq

Trim reads based on sequence length, min 50 bp

LengthSort.pl -l 50 R1.fq.trimmed LengthSort.pl -l 50 R2.fq.trimmed

Count reads for every trimmed fastq files, to get info on TOTAL HIGH QUALITY READS (QV ≥ 20, Length ≥ 50bp).

find R1.fq.trimmed.single -exec grep -c -H “@HWI-” {} \; >> TOTAL_HIGH_QUALITY_READS find R2.fq.trimmed.single -exec grep -c -H “@HWI-” {} \; >> TOTAL_HIGH_QUALITY_READS

Screen for phiX reads (phiX spike) or other contaminants (i.e. vector sequence)

bowtie2 -x ../bowtie_index/phix-174 R1.fq.trimmed.single –un R1.cleaned.fq bowtie2 -x ../bowtie_index/phix-174 R2.fq.trimmed.single –un R2.cleaned.fq

Count reads for cleaned fastq files, to get info on TOTAL CLEAN READS

find R1.cleaned.fq -exec grep -c -H “@HWI-” {} \; >> TOTAL_CLEAN_READS find R2.cleaned.fq -exec grep -c -H “@HWI-” {} \; >> TOTAL_CLEAN_READS

Select fastq paired sequences and singletons

select_paired_fq.pl R1.cleaned.fq R2.cleaned.fq R1.PE.fq R2.PE.fq

Shuffle fastq paired-end sequences

shuffleSequences_fastq.pl R1.PE.fq R2.PE.fq reads.PE.shuffled.fq

Combined any singletons reads into a file.

cat R1.PE.fq.single R2.PE.fq.single > reads.SG.fq

Now you are ready with shuffled paired-end reads (reads.PE.shuffled.fq) and singleton reads (reads.SG.fq) for your next analysis steps. View the TOTAL_READS, TOTAL_HIGH_QUALITY_READS and TOTAL_CLEAN_READS file for your reads statistics. Fill in the table below with info that you got from this analysis pipeline.

	R1	R2
Total reads
Total high quality reads
Total clean reads
Total paired reads
Total singleton reads

Practical 3: Reference Genome Assembly

This exercise will focus on reference genome mapping using bowtie.

Prepare the exercise files.

cd ~/NGS-Sept2014/practical_3 ln -s ../practical_2/R1.PE.fq . ln -s ../practical_2/R2.PE.fq .

ln -s ../practical_2/reads.SG.fq .

Build the reference index.

bowtie2-build builds a Bowtie index from a set of DNA sequences. bowtie2-build outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence files are no longer used by Bowtie once the index is built.

Usage: bowtie2-build [options] <reference_in> <bt2_outfile_base>

bowtie2-build reference.fasta reference

This will create 6 index files:

- reference.1.bt2
- reference.2.bt2
- reference.3.bt2
- reference.4.bt2
- reference.rev.1.bt2
- reference.rev.2.bt2

Mapping NGS reads into reference sequence.

bowtie2 takes an index and a set of reads as input and outputs a list of alignments. By default, Bowtie enforces an alignment policy similar to Maq’s default quality-aware policy (-n 2 -l 28 -e 70).

-n Maximum number of mismatches permitted in the “seed”, i.e. the first L base pairs of the read (where L is set with -l). This may be 0, 1, 2 or 3 and the default is 2.

-l The “seed length”; i.e., the number of bases on the high-quality end of the read to which the -n ceiling applies. The lowest permitted setting is 5 and the default is

28. bowtie is faster for larger values of -l.

-e Maximum permitted total of quality values at all mismatched read positions throughout the entire alignment, not just in the “seed”. The default is 70. Like

Maq, bowtie rounds quality values to the nearest 10 and saturates at 30.

Usage:

bowtie2 [options] -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

<bt2-idx> Index filename prefix (minus trailing .X.bt2).

<m1> Files with #1 mates, paired with files in <m2>.

<m2> Files with #2 mates, paired with files in <m1>.

<r> Files with unpaired reads.

<sam> File for SAM output (default: stdout)

<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be specified many times.

E.g. ‘-U file1.fq,file2.fq -U file3.fq’

bowtie2 –x reference –S reads.out.sam –p 8 -1 R1.PE.fq -2 R2.PE.fq \

-U reads.SG.fq

The alignment output is in SAM format.

View the alignment using samtools.

A SAM file is a simple text file according with the SAM specification, but needs a lot of space. To store the reads more compact, transfer the SAM file into a BAM file. Most programs expect a BAM file to be sorted and indexed for faster access, which you can create with the samtools package.

# SAM -> BAM

samtools view -bS -o reads.out.bam reads.out.sam

# sort BAM file

samtools sort reads.out.bam reads.out.sorted

# Index the sorted BAM file

samtools index reads.out.sorted.bam

# View the BAM file

samtools view reads.out.sorted.bam | less -S samtools tview reads.out.sorted.bam

# ? for help # q for quit

Have you had a look at the mapping qualities?

A BAM file can also used to print the alignments in the pileup format. This can also be used to identify SNPs.

# Index fasta file

samtools faidx reference.fasta

# create pileup from BAM file

samtools mpileup -f reference.fasta reads.out.sorted.bam > reads.out.sorted.pileup

less -S reads.out.sorted.pileup

What other options can you use with the pileup? (samtools pileup)

http://sourceforge.net/apps/mediawiki/samtools/index.php?title=SAM_FAQ#I_want_to_call_SNPs_and_short_indels

View the alignment using IGV.

IGV is another visualisation tool. IGV is also provides tools for manipulation and functionality to serve data, which won’t be covered in this course. For now, use IGV to visualise the SAM file.

igv

# Genomes -> Load Genome from File … (select file reference.fasta) Click “Open” … (genome file will loaded)

# Load sorted BAM file

# File -> Load from File… ~/NGS-Sept2014/practical_3/reads.out.sorted.bam

In order to see the reads, you have to zoom into a region e.g. by marking a range with the mouse on the top scale:

Practical 4: de novo Genome Assembly

This exercise will focus on de novo genome assembly using velvet.

Prepare the exercise files.

cd ~/NGS-Sept2014/practical_4

ln -s ../practical_2/reads.PE.shuffled.fq . ln -s ../practical_2/reads.SG.fq .

Running velveth

Velveth helps you construct the dataset for the following program, velvetg, and indicate to the system what each sequence file represents. If, on the command line, you forget the syntax, you can print out a short help message:

velveth

Velveth takes in a number of sequence files, produces a hashtable, then outputs two files in an output directory (creating it if necessary), Sequences and Roadmaps, which are necessary to velvetg. The syntax is as follows:

velveth output_directory hash_length [[-file_format][-read_type] filename]

The hash length, also known as k-mer length, corresponds to the length, in base pairs, of the words being hashed.

Running velvetg

Velvetg is the core of Velvet where the de Bruijn graph is built then manipulated. Note that although velvetg saves some files during the process to avoid useless recalculations, the parameters are not saved from one run to the next.

Therefore:

velvetg output_directory/ -cov_cutoff 4

velvetg output_directory/ -min_contig_lgth 100

. . . is different from:

velvetg output_directory/ -cov_cutoff 4 -min_contig_lgth 100

This means you can freely play around with parameters, without re-doing most of the calculations.

On the other hand, within a single velvetg command, the order of parameters is not important. Finally, if you have any doubt at the command line, you can obtain a short help message by typing: velvetg

velvetg auto_71 -exp_cov auto

…. and repeats for the rest of analysis folder (i.e. auto_73 – auto_75)

This command will automatically estimate the expected kmer coverage (note this is different from the actual read coverage), the insert sizes of the individual paired-end runs and the coverage cut off, a parameter that is used to filter low coverage contigs from your assembly. All these parameters can be manually specified. Please look up the estimated kmer coverage, insert sizes and their standard deviations in the on- screen output of velvet and write them down. When velvetg finishes it will output the number of nodes, n50, and max and total size of the assembly created. If you look in the auto_* directory, you will also see a few files:

contigs.fa LastGraph PreGraph Sequences Graph2 Log Roadmaps stats.txt

These files are explained in detail in the manual, but the most useful files for post- analysis are the contigs.fa, Log, and stats.txt files.

Re-run the velvetg with the info on expected coverage and insert length information.

velvetg auto_71 -exp_cov 20 -ins_length 450 -ins_length_sd 50

Optimise the velvet parameter using VelvetOptimiser.pl script

This script automatically finds the optimal parameter settings for Velvet. Usage:

/bioapps/velvet_1.2.10/contrib/VelvetOptimiser-2.2.4/VelvetOptimiser.pl [options] -f ‘velveth input line’

-help This help.

-v|verbose+ Verbose logging, includes all velvet output in the logfile. (default ‘0’).

-s|hashs=i The starting (lower) hash value (default ’19’).

-e|hashe=i The end (higher) hash value (default ‘161’).

-x|step=i The step in hash search.. min 2, no odd numbers (default ‘2’).

-f|velvethfiles=s The file section of the velveth command line. (default ‘0’).

-a|amosfile! Turn on velvet’s read tracking and amos file output. (default ‘0’).

-o|velvetgoptions=s Extra velvetg options to pass through. eg. -long_mult_cutoff -max_coverage etc (default ”).

-t|threads=i The maximum number of simulataneous velvet instances to run. (default ’80’).

-g|genomesize=f The approximate size of the genome to be assembled in megabases. Only used in memory use estimation. If not specified, memory use estimation will not occur. If memory use is estimated, the results are shown and then program exits. (default ‘0’).

-k|optFuncKmer=s The optimisation function used for k-mer choice. (default ‘n50’).

-c|optFuncCov=s The optimisation function used for cov_cutoff optimisation. (default ‘Lbp’).

-p|prefix=s The prefix for the output filenames, the default is the date and time in the format

DD-MM-YYYY-HH-MM_. (default ‘auto’).

-d|dir_final=s The name of the directory to put the final output into. (default ‘.’).

Advanced!: Changing the optimisation function(s)

Velvet optimiser assembly optimisation function can be built from the following variables.

LNbp = The total number of Ns in large contigs

Lbp = The total number of base pairs in large contigs Lcon = The number of large contigs

max = The length of the longest contig n50 = The n50

ncon = The total number of contigs

tbp = The total number of basepairs in contigs

Examples are:

‘Lbp’ = Just the total basepairs in contigs longer than 1kb ‘n50*Lcon’ = The n50 times the number of long contigs.

‘n50*Lcon/tbp+log(Lbp)’ = The n50 times the number of long contigs divided

by the total bases in all contigs plus the log of the number of bases in long contigs.

We will optimise the optimal parameter for velvet by using range of hash length between 51 -75 as below:

/bioapps/velvet_1.2.10/contrib/VelvetOptimiser-2.2.4/VelvetOptimiser.pl \

-s 51 -e 85 -f ‘-fastq -shortPaired reads.PE.shuffled.fq -short reads.SG.fq’ \

-o ‘-ins_length 500 -ins_length_sd 100’ -t 3 -a

Practical 5: Reference Transcriptome Assembly

This exercise will focus on differential gene and transcript expression analysis of RNA- seq experiments with TopHat and Cufflinks.

Overview

Although RNA-seq experiments can serve many purposes, this practical that aims to compare the transcriptome profiles of two or more biological conditions, such as a wild- type versus mutant or control versus knockdown experiments. For simplicity, we assume that the experiment compares only two biological conditions, although the software is designed to support many more, including time-course experiments.

This exercise begins with raw RNA-seq reads and concludes with publication-ready visualization of the analysis. First, reads for each condition are mapped to the reference genome with TopHat. Many RNA-seq users are also interested in gene or splice variant discovery, and the failure to look for new transcripts can bias expression estimates and reduce accuracy. Thus, we include transcript assembly with Cufflinks as a step in the workflow. After running TopHat, the resulting alignment files are provided to Cufflinks to generate a transcriptome assembly for each condition. These assemblies are then merged together using the Cuffmerge utility, which is included with the Cufflinks package. This merged assembly provides a uniform basis for calculating gene and transcript expression in each condition. The reads and the merged assembly are fed to Cuffdiff, which calculates expression levels and tests the statistical significance of observed changes. Cuffdiff also performs an additional layer of differential analysis. By grouping transcripts into biologically meaningful groups (such as transcripts that share the same transcription start site (TSS)), Cuffdiff identifies genes that are differentially regulated at the transcriptional or post-transcriptional level. These results are reported as a set of text files and can be displayed in the plotting environment of your choice.

This exercise also utilise a plotting tool called CummeRbund (http://compbio.mit.edu/ cummeRbund/), which provides functions for creating commonly used expression plots such as volcano, scatter and box plots. CummeRbund also handles the details of parsing Cufflinks output file formats to connect Cufflinks and the R statistical computing environment. CummeRbund transforms Cufflinks output files into R objects suitable for analysis with a wide variety of other packages available within the R environment and can also now be accessed through the Bioconductor website (http://www.bioconductor.org/).

Procedure pipeline.

cd ~/NGS-Sept2014/practical_5

ln -s ../ bowtie_index/ref_genome.fa .

A – Align the RNA-seq reads to the genome

Run TopHat mapping – map reads for each sample replicates to the reference genome

tophat2 -p 16 -G ref_genome.gtf -o C1_thout ../bowtie_index/ref_genome C1.R1.PE.fq C1.R2.PE.fq tophat2 -p 16 -G ref_genome.gtf -o C2_thout ../bowtie_index/ref_genome C2.R1.PE.fq C2.R2.PE.fq tophat2 -p 16 -G ref_genome.gtf -o C3_thout ../bowtie_index/ref_genome C3.R1.PE.fq C3.R2.PE.fq

tophat2 -p 16 -G ref_genome.gtf -o T1_thout ../bowtie_index/ref_genome T1.R1.PE.fq T1.R2.PE.fq tophat2 -p 16 -G ref_genome.gtf -o T2_thout ../bowtie_index/ref_genome T2.R1.PE.fq T2.R2.PE.fq tophat2 -p 16 -G ref_genome.gtf -o T3_thout ../bowtie_index/ref_genome T3.R1.PE.fq T3.R2.PE.fq

Run Cufflink – assemble transcripts for each sample replicates

cufflinks -p 16 -G ref_genome.gtf -o C1_clout C1_thout/accepted_hits.bam cufflinks -p 16 -G ref_genome.gtf -o C2_clout C2_thout/accepted_hits.bam cufflinks -p 16 -G ref_genome.gtf -o C3_clout C3_thout/accepted_hits.bam

cufflinks -p 16 -G ref_genome.gtf -o T1_clout T1_thout/accepted_hits.bam cufflinks -p 16 -G ref_genome.gtf -o T2_clout T2_thout/accepted_hits.bam cufflinks -p 16 -G ref_genome.gtf -o T3_clout T3_thout/accepted_hits.bam

Create a file called assemblies.txt that lists the assembly file for each sample. The file should contain the following lines:

./C1_clout/transcripts.gtf

./C2_clout/transcripts.gtf

./C3_clout/transcripts.gtf

./T1_clout/transcripts.gtf

./T2_clout/transcripts.gtf

./T3_clout/transcripts.gtf

Run Cuffmerge on all your assemblies to create a single merged transcriptome annotation

cuffmerge -s ref_genome.fa -p 16 assemblies.txt

B – Identify differentially expressed genes and transcripts

Run Cuffdiff by using the merged transcriptome assembly along with the BAM files from TopHat for each samples

cuffdiff -o diff_out -b ref_genome.fa -p 16 -L ctrl,trtmt -u merged_asm/merged.gtf

./C1_thout/accepted_hits.bam,./C2_thout/accepted_hits.bam,./C3_thout/accepted_hits.bam

./T1_thout/accepted_hits.bam,./T2_thout/accepted_hits.bam,./T3_thout/accepted_hits.bam

C – Explore differential analysis results with CummeRbund

Open a new plotting script file in the editor of your choice, or use the R interactive shell

Load the CummeRbund package into the R environment:

library(cummeRbund)

Create a CummeRbund database from the Cuffdiff output:

cuff <- readCufflinks(‘diff_out’)

Plot the distribution of expression levels for each sample:

csDensity(genes(cuff))

Compare the expression of each gene in two conditions with a scatter plot:

csScatter(genes(cuff), ‘ctrl’, ‘trtmt’)

Create a volcano plot to inspect differentially expressed genes:

csVolcano(genes(cuff), ‘ctrl’, ‘trtmt’, alpha=0.05, showSignificant=T, xlimits = c(-10, 10))

Count reads that map to each chromosome:

for f in *thout/accepted_hits.bam; do echo $f; samtools index $f ; done for f in *thout/accepted_hits.bam; do echo $f; samtools idxstats $f ; done

The first command creates a searchable index for each map file so that you can quickly extract the alignments for a particular region of the genome or collect statistics on the entire alignment file. The second command reports the number of fragments that map to each chromosome.

D – Record differentially expressed genes and transcripts to files for use in downstream analysis.

Inspect the number of genes and transcripts that are differentially expressed between two samples. The R code below loads the results of Cuffdiff’s analysis and reports the number of differentially expressed genes:

library(cummeRbund)

cuff <- readCufflinks(‘diff_out’) gene_diff_data <- diff(genes(cuff))

sig_gene_data <- subset(gene_diff_data, (significant == ‘yes’)) nrow(sig_gene_data)

Reference:

Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David R Kelley, Harold Pimentel, Steven L Salzberg, John L Rinn and Lior Pachter. 2012. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7: 562–578.

Practical 6: de novo Transcriptome Assembly

This exercise will focus on de novo transcriptome assembly using Oases.

Prepare the exercise files.

cd ~/NGS-Sept2014/practical_6 ls

(this folder should contains your RNA-seq reads. i.e RNASeq.PE.fq)

Running Oases

To run Oases it is recommended to run an array of single-k assemblies, for different values of k

(i.e. the hash length). These assemblies are then merged into a final assembly.

Single-k assemblies

Each single-k assembly consists of a simple Velvet run, followed by an Oases run. As in Velvet, the hash length k is an odd integer. We will use k value of 51 – 75 for this exercise. Velveth is run normally and velvetg is run with only one option:

velveth dir_k k -shortPaired -fastq RNASeq.PE.fq

velvetg dir_k -read_trkg yes

Oases is then run on that directory:

oases dir_k -ins_length 250

(Note: Replace k with value of 51 to 75)

Assembly merging

After running the previous process for different values of k it is necessary to merge the results of all these assemblies (contained in transcripts.fa files) into a single, non-redundant assembly. To realize this merge, it is necessary to choose a value of K. Experiments have shown that K = 27 works nicely for most organisms. Higher values for K should be used if problems with missassemblies are observed.

Assume we want to create a merged assembly in a folder called MergedAssembly.

velveth MergedAssembly K -long dir_*/transcripts.fa velvetg MergedAssembly -read_trkg yes -conserveLong yes oases MergedAssembly -merge yes

NOTE that the transcripts.fa files need to be given as long. Replace K with value of 27.

Using the supplied python script

With version 0.2 of Oases comes a new python script that runs the individual single-k assemblies and the new merge module. We highly recommend to use the script for the analyses,

but please read the oases manual for the Oases params that you need to supply to the script. However, as the script also runs velvet please consult the velvet manual for more insight.

We have paired-end reads with insert length 500 and we want to compute a merged assembly from 51 to 75. We want to save the assemblies in the folders that start with pairedEnd.

python oases_pipeline.py -m 51 -M 75 -o pairedEnd \

-d ” -shortPaired -fastq RNASeq.PE.fq ” -p ” -ins_length 250 ”

Output files

Oases produces two output files. The predicted transcript sequences can be found in transcripts.fa and the file contig-ordering.txt explains the composition of these transcripts, see below.

1. new directory/transcripts.fa { A FASTA file containing the transcripts imputed directly from trivial clusters of contigs (loci with less than two transcripts and Confidence Values = 1) and the highly expressed transcripts imputed by dynamic programming (loci with more than 2 transcripts and Confidence Values <1).
2. new directory/contig-ordering-highly-expressed.txt { A hybrid file which describes the contigs contained in each locus in FASTA format, interlaced with one line summaries of the transcripts generated by dynamic programming. Each line is a string of atoms defined as:

$contig_id:$cumulative_length-($distance_to_next_contig)->next item

Here the cumulative length is the total length of the transcript assembly from its 5′ end to the 3′ end of that contig. This allows you to locate the contig sequence within the transcript sequence.

Categorybioinformatics Guides transcriptomics

Sequencing through Sanger basics

How AI is Revolutionizing Science Education