Introduction to Next Generation Sequencing Technologies
March 14, 2024Table of Contents
Definition and brief history of NGS
Next-generation sequencing (NGS), also known as high-throughput sequencing, refers to a set of modern sequencing technologies that enable rapid and cost-effective sequencing of large amounts of DNA or RNA. NGS has revolutionized genomics research and has had a profound impact on fields such as personalized medicine, oncology, agriculture, and evolutionary biology.
Brief History:
- First Generation Sequencing: Sanger sequencing, developed in the 1970s, was the first method used for DNA sequencing. While revolutionary at the time, Sanger sequencing was slow, expensive, and labor-intensive, limiting its use for large-scale sequencing projects.
- Next-Generation Sequencing: The term “next-generation sequencing” was coined in the early 2000s to describe a new wave of sequencing technologies that promised to overcome the limitations of Sanger sequencing. These technologies, including those developed by companies like Illumina, Roche, and Thermo Fisher, introduced massively parallel sequencing, allowing for the simultaneous sequencing of millions of DNA fragments.
- Advancements and Adoption: Since the early 2000s, NGS technologies have undergone rapid advancements, leading to increases in sequencing speed, read lengths, and cost-effectiveness. These advancements have led to the widespread adoption of NGS in research, clinical diagnostics, and various other fields.
- Third Generation Sequencing: Recent years have seen the emergence of third-generation sequencing technologies, such as nanopore sequencing (Oxford Nanopore) and single-molecule real-time (SMRT) sequencing (PacBio). These technologies offer long read lengths and real-time sequencing capabilities, further advancing the field of genomics.
NGS has enabled a wide range of applications, including whole-genome sequencing, exome sequencing, transcriptome profiling, epigenetic analysis, and metagenomics. It continues to drive groundbreaking discoveries in biology and medicine, paving the way for personalized and precision approaches to healthcare.
A Brief History of DNA Sequencing
In 1962, James Watson, Francis Crick, and Maurice Wilkins jointly received the Nobel Prize in Physiology/Medicine for their discover- ies of the structure of deoxyribonucleic acid (DNA) and its signifi- cance for information transfer in living material. The secret of DNA in orchestrating living activities lies in the arrangement of the four bases (i.e. adenine, thymine, guanine, and cytosine). The linear sequence of the four bases can be considered as the language of life with each word specified by a codon that is made up of three bases. It was an interesting puzzle to figure out how codons specify amino acids. In 1968, Robert W. Holley, HarGobind Khorana, and Marshall W. Nirenberg was awarded the Nobel Prize in Physiology/ Medicine for solving the genetic code puzzle. Now it is known that the collection of codons direct what, where, when, and how much proteins should be made. Since the discovery of the structure of DNA and the genetic code, deciphering the meaning of DNA sequences has been an ongoing quest by many scientists to under- stand the intricacies of life.
The ability to read a DNA sequence is a prerequisite to decipher its meaning. Not surprisingly then, there has been intense compe- tition to develop better tools to sequence DNA. In the 1970s, the first revolution in DNA sequencing technology began and there were two major competitors in this area. One was the commonly known Sanger sequencing method2,3 and another was the Maxam– Gilbert sequencing method.4 Over time, the popularity of the Sanger sequencing method and its modifications grew so much that it overshadowed other methods until perhaps 2005 when Next Generation Sequencing (NGS) began to take off.
In 1977, Sanger and colleagues successfully used their sequenc- ing method to sequence the first DNA-based genome, a <X174 bacteriophage, which is approximately 5375 bp.5 This discovery heralded the start of the genomics era. Initially, the Sanger sequencing method in 1975 used a two-phase DNA synthesis reaction. In the first phase, a DNA polymerase was used to partially extend a primer bound onto a single-stranded DNA template to generate DNA fragments of random lengths. In phase two, the par- tially extended templates from the earlier reaction were split into four parallel DNA synthesis reactions where each reaction only had three of the four deoxyribonucleotide triphosphates (dNTPs; which are made up of dATP, dCTP, dGTP, dTTP). Due to a missing deoxyri- bonucleotide triphosphate (e.g. dATP), the DNA synthesis reaction would stop at its 3 end position just one position prior to where the missing base was supposed to be incorporated. All of these synthesized DNA fragments could then be separated by size using electrophoresis on an acrylamide gel. The DNA sequence could be read off a radioautograph since its DNA synthesis happened with the incorporation of radiolabeled nucleotides (e.g. S-dATP).
There were many problems with the initial version of the Sanger sequencing method that required further innovations before its widespread use and this scenario is akin to what is happening in the recent NGS technological developments. Some prob- lems of the early Sanger sequencing method included the cumbersome two-phase procedures, only short length of a DNA sequence could be determined, the requirement of a primer meant
some sequences of the template had to be known, hazardous radio labeled nucleotides were used and there was also no automated way to read off a DNA sequence. Sanger and colleagues rapidly improved on the method described in 1975 by eliminating the two- phase procedure with the use of dideoxynucleotides as chain ter- minators. Briefly, the improved method started with four reaction mixtures that already had the single-stranded DNA template hybridized to a primer. In each reaction, the DNA synthesis pro- ceeded with four deoxyribonucleotide triphosphates (one with radiolabeled nucleotide) and one dideoxynucleotide (ddNTP). Whenever a dideoxyribonucleotide was incorporated, the reaction terminated and thereby produced a mixture of truncated frag- ments of varying lengths. These DNA fragments were then separated by electrophoresis and then read off from a radioautograph. By adjusting the concentration of ddNTPs, chain termination can be manipulated to produce a longer sequence read
To solve the requirement of knowing some template sequences
for primer design, cloning was introduced. For example, the M13 sequencing vector is commonly used as a holder for DNA insert and known primers that bind to the vector sequence are available to be used to sequence the unknown DNA insert. One major innovation to the Sanger sequencing method is the replacement of radioactive labels with fluorescent dyes.6 Four different dye color labels are available for the four dideoxynucleotide chain terminators and thus, DNA fragments that terminate at all four bases can be gener- ated in a single reaction and thus analyzed on a single lane of acrylamide gel. The electrophoresis is coupled to a fluorescent detector that is also connected to a computer and thus sequence data can be automatically collected. In 1986, Applied Biosystems commercialized the first automated DNA sequencer (i.e. Model 370A) that is based on the Sanger sequencing method. For an animation of the Sanger sequencing method, the reader should refer to the Welcome Trust Sanger Institute (http://www.wellcome. ac.uk/Education-resources/Education-and-learning/Resources/ Animation/WTDV026689.htm). Due to limitations of the chain terminator chemistry and resolution of the electrophoresis method,
the Sanger sequencing method is only capable of sequencing a read of about 500–800 bases long. Most genes and other interest- ing DNA sequences are longer than that. Therefore, a method is required to first break up a longer DNA molecule into fragments, sequence the individual fragments and then piece them together to create a contiguous sequence (i.e. contig). In one approach known as the shotgun sequencing, the long DNA fragment is randomly sheared and then cloned for sequencing. A computer program is then used to assemble the sequences by finding overlaps. It is challenging to find sequence overlaps when thousands to mil- lions of DNA fragments are generated. The problem requires align- ment algorithms and some notable examples of early work in this area include the Needleman–Wunsch algorith and Smith– Waterman algorithm.
Comparison with traditional Sanger sequencing
Basic Principles of NGS:
- Fragmentation: The DNA or RNA sample is fragmented into smaller pieces, typically ranging from a few hundred to a few thousand base pairs in length.
- Library Preparation: Adapters are ligated to the fragmented DNA or RNA, allowing for the attachment of sequencing primers and identification of individual fragments during sequencing.
- Clonal Amplification: In some NGS platforms, such as Illumina sequencing, DNA fragments are amplified to create clusters of identical fragments, each representing a single original fragment.
- Sequencing: The sequencing process involves the addition of fluorescently labeled nucleotides to the DNA template. As each nucleotide is incorporated into the growing DNA strand, the fluorescence is detected and recorded.
- Data Analysis: The raw sequencing data is processed to remove errors and artifacts, and the sequence reads are aligned to a reference genome or assembled de novo to reconstruct the original sequence.
Comparison with Sanger Sequencing:
- Throughput: NGS allows for the simultaneous sequencing of millions of DNA fragments, whereas Sanger sequencing is limited to sequencing individual DNA fragments.
- Speed: NGS is much faster than Sanger sequencing, with the ability to sequence entire genomes in a matter of days or weeks, compared to months or years with Sanger sequencing.
- Cost: NGS is more cost-effective than Sanger sequencing, particularly for large-scale sequencing projects, due to its high throughput and automation.
- Read Length: While early NGS technologies had shorter read lengths compared to Sanger sequencing, recent advancements in NGS have resulted in longer read lengths, approaching or exceeding those of Sanger sequencing.
- Applications: NGS is well-suited for a wide range of applications, including whole-genome sequencing, transcriptome analysis, and metagenomics, whereas Sanger sequencing is more commonly used for sequencing individual genes or regions of interest.
Next Generation Sequencing Technologies
One of the goals of the Human Genome Project (HGP) is to support advancements in DNA sequencing technology. Although the HGP was completed with the Sanger sequencing method, many groups of researchers were already tinkering with new ideas to increase throughput and decrease cost of sequencing prior to the announce- ment of the first human genome draft in 2001. For example, devel- opments for nanopore sequencing can be traced back to 1996 when researchers experimented with -hemolysin. After years of experimentations, the second DNA sequencing technology revolution finally took off in 2005 and ended Sanger sequencing dominance in the marketplace. The revolution is still ongoing at the time of this writing and it can be seen from the rapid decline in the cost of sequencing since the introduction of NGS technologies (Figure 1).
The sequencing technologies associated with the second revolution are referred to by various names, including second-genera- tion sequencing, NGS, and high throughput sequencing. It should
Figure 1. The cost to sequence one million bases of a specified quality (i.e. a minimum Phred score of Q20 for Sanger sequencing and an equivalent of Q20 or higher accuracy for NGS data) according to the National Human Genome Research Institute (NHGRI).12 The cost of sequencing only made its rapid reduc- tion in price from 2008 onwards.
perhaps be most appropriately termed as high throughput sequencing but NGS seems to be more commonly used to categorize such technologies and hence, this term is used for the book. For the purpose of this book, NGS technology refers to platforms that are able to sequence massive amount of DNA in parallel with a simul- taneous sequence detection method and overall achieve a much cheaper cost per base than Sanger. These platforms include 454, ABI Supported Oligo Ligation Detection (SOliD), Illumina, and Ion Torrent.
There is a third revolution in sequencing technology underway with the commercialization of third-generation sequencing technologies such as those from Pacific Biosciences and Oxford Nanopore Technologies. Third-generation sequencing is defined as the sequencing of single DNA molecules without the need to halt between read steps, whether enzymatic or otherwise. There are three categories of single-molecule sequencing: (i) sequencing by synthesis method whereby base detection occur real-time (e.g. PacBio), (ii) nanopore technologies whereby DNA thread through a nanopore and are detected as they pass through it (e.g. Oxford Nanopore), and (iii) direct imaging of DNA molecules using advanced microscopy (e.g. Halcyon Molecular (this company has shut down)).
.DNA sequence data generation process among different sequencing platforms may share similarities such as the general “wash and scan” approach but they may differ in terms of cost, runtime, and detection methods. The sequence data from different platforms have different characteristics such as error patterns and different tools being used to process the raw data to FASTQ format. Much of the internal workings of NGS sequencers are proprietary matters and users generally rely on providers to come out with their own tools for base calls as well as error calls. After that, a sequence is assumed as “correct” and researchers proceed to analyze it. The subsequent sections aim to introduce the background and some details of commercially available platforms, which include 454, ABI SOLiD, Illumina, Ion Torrent, PacBio, and Oxford Nanopore. Besides these six platforms, there are other companies out there that also innovate in this space such as SeqLL, GnuBIO, Complete Genomics, and others, but they will not be covered here. For a list of available sequencing companies, readers are encouraged to read a news article by Michael Eisenstein in 2012 that was published by Nature Biotechnology, which detailed 14 NGS companies.
454
A company named 454 Life Sciences Corporation made the first move in the NGS revolution. The company was initially majority owned by CuraGen. It was from this company that the name “454” originated, which was just a code name for a project. 454 was later acquired by Roche in 2007. It made a public announcement in 2003 that it managed to sequence the entire genome of a virus in a single day. Then in 2005, scientists using 454 technology published
an article in Nature on the complete sequencing and de novo assembly of Mycoplasma genitalium genome with 96% coverage and 99.96% accuracy in one run of the machine. In the same year, the company made a system named Genome Sequencer 20 (GS20) commercially available. This breakthrough in sequencing through- put and speed was an incredible feat when compared to the Sanger technology and it created a lot of excitement.
The principle behind 454 relies on pyrosequencing, which was a technology licensed from Pyrosequencing AB. This method depends on the generation of inorganic pyrophosphate (PPi) dur- ing PCR when a complementary base is incorporated17 (Figure 2).
Figure 2. 454 pyrosequencing method. (a) In brief, the method starts with a single-stranded library that has adaptors on both ends. (b) The adaptor sequence is used to bind to the bead. This is followed by emulsion PCR to generate millions of copies of single DNA fragment on each bead. (c) After that, beads are placed into a device known as PicoTiter Plate for sequencing by detection of base incor- poration during PCR. (d) Whenever a base is incorporated, inorganic pyrophos- phate (PPi) is generated. PPi is converted to ATP by sulfurylase and luciferase uses the ATP to convert luciferin to oxyluciferin and light.
PPi is converted to ATP by sulfurylase and luciferase uses the ATP to convert luciferin to oxyluciferin and light. The reaction occurs very fast, in the range of milliseconds, and the light pro- duced can be detected by a charge-couple-device (CCD) camera. One of the key innovations of 454 technology is miniaturization of the pyrosequencing reactions, thereby allowing for parallel sequencing reactions to occur in a small space using smaller vol- ume of reagents. Another innovation is simultaneous detection of the light signals from many individual reactions.
One of the key drawbacks of the 454 pyrosequencing chemistry is the difficulty in detection of the actual number of bases in homopolymer tract (e.g. AAAAA). There is no blocking mechanism included to prevent multiple same bases incorporation during DNA elongation and thus light signals are stronger in longer homopoly- mer tracts. The light signal is actually light intensity that is con- verted to a flow value in the 454 system. It is difficult to distinguish how many bases there are once the homopolymer is more than 8 bases long.16 The presence of homopolymers is the reason why 454 sequence reads do not have fixed lengths, unlike the Illumina plat- form that includes a blocking mechanism that allow the reading of only a single base each time. Another shortcoming of the 454 sys- tem is artificial amplification of replicates of sequences during the PCR step. It was estimated in a metagenomics study that this type of error is between 11% and 35%.
Although a pioneer in NGS, 454 has officially lost the race of the sequencing game. As seen in Figure 3, on the comparisons of NGS platforms, the trend for 454 sequencing in articles tracked by Google Scholar has reached a plateau. It used to hold a lot of prom- ises in revolutionizing sequencing and it was even regarded by some as the technology that had won the sequencing race. Roche announced the closing down of 454 in 2013. Sequencers from 454 started being phased out in the middle of 2016.
ABI SOLiD
The initial success story of 454 sequencers challenged the domi- nance of Applied Biosystems (AB), which was the main supplier of
.
Figure 3. Comparisons of popularity of NGS platforms over the years by using keywords as search terms in PubMed. The keywords for searches are as follow: 454 — “454 Sequencing”; Illumina — “Illumina”; PacBio — “Pacbio”; SOLiD — “SOLiD sequencing”; Ion torrent — “Ion torrent”; Oxford nanopore — “Oxford nanopore”.
Sanger-based sequencing machines for the HGP. The ABI PRISM 3700 was a very popular system and many researchers who needed to perform sequencing prior to 2005 were familiar with the system. In 2006, ABI completed acquisition of Agencourt Personal Genomics, which allowed it to market a novel NGS technology known as SOLiD. Currently, Thermo Fisher Scientific owns SOLiD sequencing technology after it acquired Life Technologies, which is a company formed from the merging of Invitrogen and AB. From Figure 3, it seems that SOLiD sequencing is not that popular as a NGS platform when compared to the others even though it has been available since 2006. To our knowledge, SOLiD is the only NGS platform that employs ligation-based chemistry with a unique di-base fluores- cent probes system.
Understanding the SOLiD sequencing system is akin to solving a jigsaw puzzle due to the di-base encoding system. The sample prepa- ration steps prior to probes ligation are very similar in concept to the 454 system. Briefly, a genomic DNA library is sheared into smaller fragments and both ends of each fragment will be tagged with different adaptors (e.g. Adaptor P1 — Fragment 1 — Adaptor P2).
.
Figure 4. An overview of the SOLiD sequencing process. (a) Each ligation cycle starts with the 8-mer probe binding to the template and then ligated for its detection. Then, cleavage occurs to remove three nucleotides and a tagged dye.
(b) The structure of the 8-mer probe. (c) An illustration of the sequence deter- mination process during each ligation cycle of the primer rounds. Position 0 is a part of the adaptor sequence and template sequence is only revealed from posi- tion 1 onwards.
Then emulsion PCR will take place to create beads enriched with copies of the same DNA fragment on each bead. The beads are then attached to a glass slide through covalent bonds. From here ligation and detection of bases will take place (Figure 4(a)). Firstly, a univer- sal sequence primer (n) is used to bind to the known adaptor sequence. Then a specific 8-mer probe with sequence-structure as depicted in Figure 4(b) will out compete other probes for binding immediately after the primer-binding site. Ligation then occurs and identity of the bound probe is detected by distinguishing which fluorescent dye is tagged at the probe’s 5 end. Then cleavage
occurs at a position between the fifth and sixth nucleotide of the probe. After cleavage is complete, subsequent ligation is possible as a free phosphate group is now available at the fifth base of the probe. The reason why only one particular 8-mer probe will win the binding site is due to the specific di-base sequence at the 3 end that distinguishes the collection of probes. Only four types of fluo- rescence dyes are used and each 8-mer probe with specific di-base sequence is tagged by a dye at the 5 end. This system is unique in the sense that a di-base sequence is detected in each ligation cycle.
The ligation and cleavage process can be repeated many times to achieve the desired sequence length. However, it will only give sequence information two bases at a time with a gap of 3 bases in between. Next, the ligate-cleave-detect process is repeated with a new universal primer (n–1), which is a primer that binds exactly one base further upstream at the 5 end of the adaptor sequence. This ligate-cleave-detect process that cycles a few times with a new primer is also known as reset. The entire process is repeated another three more rounds with universal primer (n–2), (n–3), and (n–4). Altogether, five different universal primers are used. Figure 4(c) shows an example of sequence determination after five rounds of reset. Note that each base is called twice in independent primer round and this increases the accuracy of base call. A check for concordance of the two calls for the same base represents an in-built error checking property of this system and allows it to achieve an overall accuracy greater than 99.94%. Although the SOLiD system is unique in the sense that it can store sequence of oligo color calls (i.e. color space) to be used for mutation calls, this method does introduce challenges to bioinformatics analysis as most tools are based on DNA calls rather than color space model.
Illumina
In the mid-90s, Shankar Balasubramanian and David Klenerman, both from the University of Cambridge conceived the idea of mas- sive parallel sequencing of short reads on solid phase using revers- ible terminators. They formed Solexa in 1998 after successfully
received funding from a venture capital firm. The sequencing approach by Solexa is also known as sequencing-by-synthesis. The company launched its first sequencer, Genome Analyzer in 2006 and the machine is capable of producing 1 Gb of data in just a sin- gle run. Figure 5 shows an overview of the Illumina sequencing-by- synthesis method.
.
Illumina acquired Solexa in 2007. Soon after its acquisition, there were at least three high profile research publications in Nature 2008 volume 456, which highlighted the capabilities of the Genome Analyzer in sequencing human genomes (e.g. African
Figure 5. An overview of the Illumina sequencing process. (a) Genomic DNA is sheared, size selected, and then attached with adaptors at both ends. (b) The DNA library is placed on the flow cell to allow for complementary binding at one end of the adaptor to probes that are coated on the surface. Then bridge ampli- fication in the solid phase occurs to generate clusters of single DNA fragments. After that, reverse strands are cleaved and washed away. A cluster of clonal sequences is required to enable a high signal to noise during base detection. (c) Sequencing begins with a primer binding to the remaining forward strand and a DNA polymerase is used to incorporate the right fluorescently labeled nucleo- tide among the four possible options (i.e. A, C, T, or G). At each cycle, only one nucleotide is incorporated due to the use of reversible terminator chemistry and detection occurs at this stage. This is followed by a cleavage step and the next cycle is ready to go.
genome, Chinese genome, and cancer patient genome). In the subsequent years, the popularity of this system grew so much that by 2020, the cumulative number of articles that cited Illumina was over twenty thousand (Figure 3). To quote a marketing brochure by Illumina in 2015, “More than 90% of the world’s sequencing data is generated using Illumina sequencing-by-synthesis method.” The company is also very creative at developing and marketing their products with sequencing systems (e.g. MiniSeq, MiSeq, MiSeqDx, NextSeq 500, HiSeq 2500, HiSeq 3000, HiSeq 4000, HiSeq X Ten, HiSeq X Five) that suit researchers who operate on different budg- ets and require different level of sequencing throughput. The Illumina systems can be used for a wide range of applications that include resequencing, whole-genome sequencing, exome sequenc- ing, metagenomics, epigenetic studies, and sequencing of a panel of genes such as targeting genes linked to cancer (e.g. TruSight Cancer).
One of the key strengths of the Illumina platform is the ability
to produce high throughput of DNA sequence data at a lower cost despite only producing short sequences (e.g. paired-end of 35 bp in the African genome sequencing). Improvement in bioinformat- ics methods allows researchers to do so much more than what was thought as possible if only short, accurate reads are available. Nowadays, the Illumina system can produce paired-end sequences of 300 bp for each end, which further enhances the power of this technology. Besides the advantage of high throughput low-cost sequencing, it also performs better than the 454 system with respect to homopolymer sequencing error because it uses reversi- ble terminator sequencing chemistry. Only a single base is incorpo- rated each time prior to detection in the Illumina system whereas 454 allows for multiple bases incorporation in a homopolymer tract. However, the Illumina system also comes with drawbacks. The 3 end of the sequence tends to be of lower quality than the 5 end, which means some sequences from the 3 end should be fil- tered out if it is below certain set threshold. There can also be tiles associated error when the flow cell is affected by bubbles in reagents or some other unknown causes. In addition, sequence-specific errors have also been found for inverted repeats and GGC sequences.24 Furthermore, in a study on 16S rRNA ampli- con sequencing on the MiSeq, library preparation method and choice of primers significantly influence the error patterns.
.
Ion Torrent
Besides SOLiD sequencing, Thermo Fisher Scientific has another NGS platform on its portfolio known as Ion Torrent, which was acquired from Life Technologies. Initially, Life Technologies devel- oped the platform and released the Ion Personal Genome Machine (PGM) in 2010. The launch of this machine created much excite- ment among researchers who wanted affordable sequencers for their laboratories. It was sold at just $49,500 per sequencer and utilized cheap disposable chip of about $250. In addition, it runs faster when compared with competing machines such as HiSeq from Illumina. However, in terms of DNA data throughput, it loses out in comparison to the Illumina HiSeq.
Like the 454 and SOLiD systems, the library preparation and emulsion PCR steps on beads are present in the Ion Torrent. The main difference lies in the detection of nucleotide incorporation that is not based on fluorescence or chemiluminescence, but instead it measures the H+ ions released during the process. In other words, detection of nucleotide incorporation is done by min- iature semiconductor pH sensor. Since each of the four DNA bases are supplied sequentially for DNA elongation, if the base matches the template, then a signal is detected. For homopolymer region in the template, the signal will be amplified but accurate detection on the actual number of bases is challenging. Only natural nucle- otides are needed and no high-resolution camera and complicated image processing are required, which when taken together are some reasons for a faster runtime and lower machine cost. For a video on the Ion Torrent method, the reader should refer to the Thermo Fisher Scientific sequencing education webpage (http:// www.thermofisher.com/my/en/home/life-science/sequencing/ sequencing-education.html#).
Following the release of Ion PGM, the Ion Torrent product line now includes Ion Proton, Ion Chef, and Ion S5 systems. There is a diverse range of applications for these systems such as targeted sequencing, exome sequencing, transcriptome sequencing, bacte- rial and viral typing. However, genomic studies that involve de novo assemblies of larger genomes (e.g. >1 Gbp) do not seem to be the target areas of Ion Torrent. The popularity of Ion Torrent has been steadily rising but seemed to have reached a plateau (Figure 3).
Pacific Biosciences
The second-generation sequencing technologies are generally characterized by the “wash and scan” procedure that is much slower than the natural rate of DNA elongation by DNA polymer- ase. Furthermore, the length of contiguous DNA that can be sequenced is rather short (e.g. <1 kb). If one could observe DNA polymerization in real-time and detect which base was incorpo- rated each time there was a DNA polymerase activity, faster sequencing runtime, and longer read length could be achieved. However, there are many challenges for detection of bases incor- poration during real-time DNA polymerase activities because they happen too fast.
Pacific Biosciences, which was founded in 2004, has made two key innovations that enabled real-time observation of DNA synthe- sis. One of them is the use of phospholinked nucleotides. Each phospholinked nucleotide has a fluorescent dye attached to the phosphate chain rather than to the base. During DNA elongation, the phosphate chain is cleaved and hence the dye label diffuses away. The DNA template is ready to accept the next nucleotide. Another key innovation is the use of zero-mode waveguide (ZMW) as the platform for detection of base incorporation. These ZMWs are housed inside a SMRT Cell. A ZMW can be thought of as a well with a very tiny hole at the bottom, which enables visible laser light to pass through. However, the light intensity decays exponentially and thus it can only illuminate the bottom of the well. With a DNA polymerase immobilized at the bottom of the well, its DNA
polymerase activity can be monitored as it is illuminated. This is akin to having a miniature microscope placed at the bottom to peek at DNA polymerase activity on top of it. Phospholinked nucle- otides diffuse into the well and when the right one is encountered by the DNA polymerase, it will be incorporated to the growing strand. The simple diffusion of phospholinked nucleotides happens in the microseconds range but when they are incorporated to the growing DNA strand, they stay longer at the site of incorporation (i.e. miliseconds range). It is from this longer stay by a particular phospholinked nucleotide that a signal is detected against a back- ground of other free moving nucleotides.
Another interesting aspect of the PacBio technology is the observation of the kinetics of DNA polymerase activity. Kinetics data can be collected directly from the system and this allows for an investigation of favorable mutations of DNA polymerase with a lower sequencing error rate. In addition, environmental parame- ters such as pH, temperature, and concentration of inhibitor that affect the kinetics of DNA polymerase can also be optimized. For researchers interested in epigenetics, the PacBio system is able to detect epigenetic effects such as base methylation (e.g. N6-methyladenine (m6A) and N4-methylcytosine (m4C)) because such modification to the DNA template affects the kinetics of DNA polymerase.
A detailed report on the PacBio technology was first published
in Science in 2009. The company released their commercial plat- form PacBio RS in 2011 and later the PacBio RS II in 2013. It is rather impressive that the combination of PacBio RS II with P6-C4 chemistry can achieve an average read length of 10–15 kb. The combination of an upgrade in the PacBio machine to the higher throughput Sequel and newer sequencing chemistry v3 have ena- bled read length to an average of 30 kb polymerase read length. The library size for SMRT sequencing ranges from 250 bp to 50 kbp. As the main advantage of the PacBio system is its long read length, researchers have tried to use its sequenced data alone or in com- bination with other sequenced data to de novo assemble various genomes including bacteria (e.g. Escherichia coli), yeast (e.g. Saccharomyces cerevisiae), plant (e.g. Arabidopsis thaliana) and animals (D. melanogaster, Homo sapiens).29 It is now known that PacBio technology is particularly good for closing gaps in de novo assembled genomes, resolve phases among haplotypes, produce full-length RNA transcripts isoforms sequences, identify structural variants and to sequence complex regions with repeats. However, its main disadvantages are its low throughput, high cost per sequenced base and high error rate ( 11–15%). The errors are not biased towards homopolymers but appear as random with indels errors more common than substitution errors. Owing to the ran- dom error feature, if there is enough PacBio sequenced data cover- age on a particular template, the consensus sequence can achieve a much higher accuracy than a single sequence pass. In late 2015, PacBio announced the release of the Sequel System that has a redesigned SMRT Cell, which now contains 1 million ZMWs. It pro- vides 7 higher sequencing throughput than PacBio RS II and this development is exciting in terms of highlighting the scalability of this technology. Since then, PacBio has upgraded the Sequel machine to handle 8 million ZMWs. Additionally, the read accuracy of this platform has substantially improved with the introduction of HiFi reads (https://www.pacb.com/smrt-science/smrt-sequencing/ hifi-reads-for-highly-accurate-long-read-sequencing/). For more information on the PacBio system, readers should refer to the com- pany’s website: http://www.pacb.com.
Oxford Nanopore Technologies
Besides PacBio, there is another new entrant to the sequencing race that also belongs to the third generation sequencing category — Oxford Nanopore Technologies. The company was a spin-off from the University of Oxford in 2005 and its goal is to democratize sequencing by making it affordable and portable (https://nanoporetech.com). The company’s sequencers made its debut in 2012 at the Advances in Genome Biology and Technology meeting. The sequencer MinION was introduced during the meeting but it was only in 2014 that a limited number of participants who were a part of MinION Access Programme (MAP) received their first sequencers for performance testing. Then in 2015, the first nanopore sensing conference known as the London Calling was held and researchers gathered to find out more about the MinION technology. In that same year, MinION was made com- mercially available. At the time of this writing, the company also has two other systems in development, PromethION, and GridION. Although new, the technology has occupied a rather interesting niche where portability of DNA sequencers is required, such as in real-time genomic surveillance of Ebola outbreak32 and DNA sequencing in space to monitor changes to microbes and humans in spaceflight, as well as other astrobiological applications.
The methodology behind the MinION technology was described
in a whole-genome shotgun sequencing of a reference Escherichia coli strain. The DNA library preparation method was elaborated in the mentioned paper. An ideal DNA fragment for sequencing has a DNA hairpin loop that is ligated on one end to join the two strands together. Then, one of the strands will traverse into a protein nano- pore that is anchored on an electrically resistant polymer mem- brane. The setup of the nanopore is such that any analyte that passes through it or approach its opening will create a disruption in current. Measurements of the characteristics of this disruption then lead to identification of which nucleotides have passed through the pore. After the first strand has moved in, the other strand will follow suit. Similar to the PacBio, it is also possible to identify epigenetic modifications to the DNA using this method. The sequencing process is scalable by using more nanopores for simultaneous detection of DNA fragments that are moving through them.
The procedures involved sound simple and allow for the
sequencing of a single long DNA molecule without amplification and usage of fluorescent dyes that require expensive imaging. This is clearly a case of a disruptive technology in the making but the technology is still characterized by high sequencing errors. In a paper that compared sequencing errors, the error rate of Oxford Nanopore technology is in the range of 20–25%. More time is
needed for the technology to mature and to improve on the error rate. The Oxford Nanopore sequencing delivers the longest sequencing read, 2.3 Mb, and it usually takes library size that ranges from 10 kb to 30 kb.
Advantages and Limitations of NGS
Advantages of NGS:
- High Throughput and Speed: NGS allows for the rapid sequencing of large amounts of DNA or RNA, enabling the analysis of entire genomes or transcriptomes in a fraction of the time required by traditional sequencing methods.
- Cost-Effectiveness: NGS has become increasingly cost-effective, with the cost per base of sequencing significantly lower than traditional methods. This has made large-scale sequencing projects more accessible to researchers.
- Versatility: NGS can be applied to a wide range of applications, including whole-genome sequencing, exome sequencing, transcriptome profiling, epigenetic analysis, and metagenomics.
- Single-Cell Sequencing: NGS technologies enable the sequencing of individual cells, allowing for the study of cellular heterogeneity and the identification of rare cell populations.
Limitations of NGS:
- Data Analysis Challenges: NGS generates large amounts of raw sequencing data that require complex bioinformatics analysis to process and interpret. This can be challenging and time-consuming, particularly for researchers without bioinformatics expertise.
- Error Rates: NGS technologies can introduce errors during the sequencing process, which can affect the accuracy of the resulting sequence data. This is particularly true for regions with high GC content or homopolymeric regions.
- Detection of Certain Variants: NGS may have limitations in detecting certain types of genetic variants, such as large structural variants or variants in repetitive regions of the genome. Additional validation and confirmation may be required for these variants.
- Data Storage and Management: NGS generates large amounts of data that require careful management and storage to ensure data integrity and accessibility. This can strain computational resources and require specialized infrastructure.
- Quality Control: NGS workflows require careful quality control measures to ensure the accuracy and reliability of the sequencing data, including library preparation, sequencing, and data analysis steps.
Future Directions of NGS
Emerging technologies in the field of genomics and sequencing, such as nanopore sequencing, are advancing rapidly and offering new possibilities for research and applications. Here are some key emerging technologies:
- Nanopore sequencing: Nanopore sequencing, offered by Oxford Nanopore Technologies, is a technology that allows direct, real-time sequencing of DNA or RNA molecules as they pass through a nanopore. This technology has the potential for long reads, portability, and the ability to sequence native DNA or RNA without the need for amplification or fragmentation.
- Spatial transcriptomics: Spatial transcriptomics technologies allow researchers to analyze gene expression within the context of tissue architecture. These technologies provide spatially resolved gene expression data, enabling the study of cellular interactions and functional relationships within tissues.
- Cryo-electron microscopy (Cryo-EM): Cryo-EM is a powerful imaging technique that allows researchers to visualize biological molecules and complexes at near-atomic resolution. Cryo-EM is revolutionizing structural biology, enabling the study of complex molecular structures and their functions.
- CRISPR-based technologies: CRISPR-Cas systems are being harnessed for applications beyond genome editing, including diagnostics, epigenome editing, and gene regulation. CRISPR-based technologies offer precise and efficient tools for manipulating genetic information.
- Single-cell multi-omics: Single-cell multi-omics technologies enable the simultaneous analysis of multiple types of omics data (such as genomics, transcriptomics, epigenomics, and proteomics) from individual cells. These technologies provide comprehensive insights into cellular heterogeneity and function.
- Synthetic biology: Synthetic biology combines principles from biology, engineering, and computer science to design and construct biological parts, devices, and systems for useful purposes. Synthetic biology has applications in biotechnology, medicine, and environmental conservation.
- Artificial intelligence (AI) and machine learning: AI and machine learning algorithms are increasingly being applied to analyze and interpret large-scale genomic and omics data. These technologies are improving our understanding of complex biological systems and driving discoveries in genomics and personalized medicine.
- Quantum sequencing: Quantum sequencing is an emerging field that explores the use of quantum computing principles for DNA sequencing. Quantum sequencing has the potential to dramatically increase sequencing speed and efficiency compared to classical computing methods.
These emerging technologies are expanding the capabilities of genomics and sequencing, driving new discoveries, and opening up exciting opportunities for research and applications in biology, medicine, and beyond.
Integration with other omics data
Integration of genomic data with other omics data, such as transcriptomics, proteomics, metabolomics, and epigenomics, is crucial for gaining a comprehensive understanding of biological systems. Here’s how integration with other omics data can enhance genomic analysis:
- Transcriptomics: Integrating genomic data with transcriptomic data (gene expression) can help identify genes that are actively transcribed, providing insights into gene regulation, cellular processes, and disease mechanisms.
- Proteomics: Integrating genomic data with proteomic data (protein expression) can reveal how genetic variations impact protein expression, post-translational modifications, and protein-protein interactions, leading to a more complete understanding of cellular functions and disease pathways.
- Metabolomics: Integrating genomic data with metabolomic data (metabolite profiles) can elucidate how genetic variations influence metabolic pathways, providing insights into metabolic disorders, drug metabolism, and cellular responses to environmental stimuli.
- Epigenomics: Integrating genomic data with epigenomic data (epigenetic modifications) can help understand how genetic variations affect epigenetic patterns, gene regulation, and cellular differentiation, contributing to the understanding of development, disease, and aging processes.
- Multi-omics integration: Integrating multiple omics data sets (multi-omics) can provide a holistic view of biological systems, enabling the identification of complex interactions between genes, proteins, metabolites, and epigenetic modifications. Multi-omics integration can lead to the discovery of biomarkers, therapeutic targets, and personalized treatment strategies for complex diseases.
- Systems biology: Integrating genomic data with other omics data is central to systems biology, which aims to understand biological systems as integrated networks of genes, proteins, metabolites, and other molecules. Systems biology approaches can reveal emergent properties of biological systems and help predict how perturbations in one component can affect the entire system.
Overall, integrating genomic data with other omics data is essential for advancing our understanding of biology, disease, and personalized medicine. It enables researchers to uncover complex relationships within biological systems and develop targeted interventions for improved health outcomes.
Personalized medicine and precision oncology
Personalized medicine and precision oncology are approaches that aim to tailor medical treatment to individual characteristics, such as genetic makeup, molecular profiles, and environmental factors. These approaches offer the potential to improve treatment outcomes and reduce adverse effects by selecting therapies that are most likely to be effective for a particular patient. Here’s how personalized medicine and precision oncology are applied in cancer treatment:
- Genomic profiling: Genomic profiling of tumors helps identify genetic alterations that drive cancer growth. This information can be used to select targeted therapies that specifically inhibit these alterations, leading to more effective treatment with fewer side effects.
- Companion diagnostics: Companion diagnostics are tests that help identify patients who are most likely to benefit from a particular therapy. These tests are often used in conjunction with targeted therapies to ensure that the right treatment is given to the right patient.
- Liquid biopsies: Liquid biopsies are non-invasive tests that analyze tumor-derived material, such as circulating tumor DNA (ctDNA), in blood or other bodily fluids. Liquid biopsies can provide real-time information about tumor dynamics and help monitor treatment response and detect resistance mutations.
- Immunotherapy: Immunotherapy uses the body’s immune system to fight cancer. Personalized approaches to immunotherapy involve identifying biomarkers, such as PD-L1 expression or tumor mutational burden, to predict which patients are likely to respond to immunotherapy.
- Combination therapies: Personalized medicine and precision oncology also involve combining different therapies to target multiple pathways involved in cancer growth. By understanding the molecular profile of a tumor, oncologists can tailor combination therapies to target specific vulnerabilities in the cancer cells.
- Clinical trials: Personalized medicine has led to the development of basket trials and umbrella trials, which test therapies based on specific biomarkers rather than tumor types. These trials allow for more targeted and effective treatments for individual patients.
- Data integration: Integrating genomic data with other omics data, such as transcriptomics and proteomics, can provide a more comprehensive view of the molecular landscape of a tumor, leading to more personalized treatment strategies.
Personalized medicine and precision oncology are rapidly evolving fields that hold great promise for improving cancer treatment outcomes. By tailoring treatment approaches to individual patients, these approaches have the potential to revolutionize cancer care and improve patient survival and quality of life.
Informatics Challenges
Advances in sequencing technologies have enabled the scientific community to decode more than 65,000 organisms’ genomes. The trend for more sequenced data is likely to continue unabated. According to Raymond McCauley of the Singularity University, “It turns out that one human genome wasn’t worth much, but thou- sands upon thousands represent an invaluable pool of data to be sifted for patterns and correlated with diseases, treatments, and outcomes.” To sift through massive amount of sequenced data is a challenge and to begin to address this problem, we need to increase the supply of skilled bioinformaticians. This is in fact one of the main reasons for writing this book. For beginners who need to use second or third-generation sequencing technologies, they will likely face informatics challenges in terms of knowing how each sequencer produces its raw sequence output, conversion of sequenced data to FASTQ format, quality checking, alignment to a reference, or de novo assembly, and interpretation of results (e.g. impact of SNPs, indels, etc.). Therefore, the subsequent chapters from here will focus on developing skills needed to navigate seas of NGS data in order to help answer biological questions.
Ethical and Legal Considerations
Privacy concerns, data sharing and ownership, and regulatory issues are significant considerations in the context of personalized medicine and precision oncology. Here’s an overview of these key aspects:
- Privacy concerns: Personalized medicine relies heavily on the collection and analysis of personal health data, including genomic data. Privacy concerns arise due to the sensitive nature of this data and the potential for misuse or unauthorized access. Ensuring patient privacy requires robust data protection measures, such as encryption, anonymization, and secure data storage practices.
- Data sharing and ownership: Data sharing is essential for advancing research and improving patient care in personalized medicine. However, issues related to data ownership, intellectual property rights, and data access can complicate data sharing efforts. Establishing clear guidelines and frameworks for data sharing, including data use agreements and data access policies, is crucial for fostering collaboration while protecting data rights.
- Regulatory issues: Regulatory frameworks play a crucial role in overseeing personalized medicine and precision oncology practices. In the United States, the Food and Drug Administration (FDA) regulates the approval and use of genomic tests and targeted therapies. Additionally, compliance with privacy laws, such as the Health Insurance Portability and Accountability Act (HIPAA), is essential for protecting patient data.
- Informed consent: Obtaining informed consent from patients is critical in personalized medicine, as it involves the collection and use of personal health data for research and treatment purposes. Informed consent should clearly communicate the purpose of data collection, how the data will be used, and any potential risks or benefits to the patient.
- Data security: Ensuring the security of patient data is paramount in personalized medicine. Data breaches can lead to privacy violations and undermine patient trust. Implementing robust data security measures, such as encryption, access controls, and regular security audits, can help mitigate these risks.
- Ethical considerations: Ethical considerations, such as transparency, fairness, and equity, are central to personalized medicine and precision oncology. Ensuring that data collection, analysis, and treatment decisions are conducted ethically and with consideration for patient autonomy and well-being is essential.
Addressing these privacy, data sharing, and regulatory challenges is crucial for realizing the full potential of personalized medicine and precision oncology while ensuring patient safety, privacy, and ethical standards are upheld.
Key terms and definitions used in NGS
Next-generation sequencing (NGS) is a rapidly evolving field with many specialized terms and concepts. Here are some key terms and their definitions:
- Sequencing: The process of determining the precise order of nucleotides (A, T, C, G) in a DNA or RNA molecule.
- Library preparation: The process of preparing a DNA or RNA sample for sequencing, including fragmentation, adapter ligation, and amplification.
- Read: A short segment of DNA or RNA sequence obtained from a sequencing experiment.
- Read length: The number of nucleotides in a sequencing read.
- Coverage: The average number of times that each base in the genome is sequenced.
- Depth: The total number of sequencing reads obtained for a given sample, often expressed as a multiple of the genome size.
- Alignment: The process of mapping sequencing reads to a reference genome or transcriptome.
- Variant calling: The process of identifying genetic variations, such as single nucleotide polymorphisms (SNPs) or insertions/deletions (indels), from sequencing data.
- De novo assembly: The process of reconstructing a genome or transcriptome from sequencing reads without a reference sequence.
- Reference genome: A complete, annotated sequence of a genome used as a standard for comparison in sequencing experiments.
- RNA-Seq: A sequencing method used to study the transcriptome, the complete set of RNA transcripts produced by the genome.
- ChIP-Seq: A sequencing method used to study protein-DNA interactions, such as transcription factor binding.
- Whole-genome sequencing (WGS): A sequencing method used to determine the complete DNA sequence of an organism’s genome.
- Whole-exome sequencing (WES): A sequencing method used to selectively sequence the protein-coding regions of the genome.
- Metagenomics: A sequencing method used to study the genetic material of microbial communities.
- Phasing: The process of determining which alleles at different loci are on the same chromosome, important for understanding genetic inheritance patterns.
- Long-read sequencing: Sequencing technologies that produce longer reads than traditional short-read sequencing technologies, such as PacBio and Oxford Nanopore sequencing.
- Short-read sequencing: Traditional sequencing technologies that produce short reads, typically less than 500 base pairs in length.
- Mate-pair sequencing: A sequencing method that sequences DNA fragments with a known distance between the paired reads, useful for studying structural variations in the genome.
- Single-cell sequencing: A sequencing method used to sequence the DNA or RNA of individual cells, allowing for the study of cellular heterogeneity within a sample.
These are just a few of the key terms used in NGS. The field is constantly evolving, and new terms and concepts are continually being introduced as technologies and methodologies advance.