Plugging the Gaps in the Human Genome
Monday, April 22, 2019
A combination of new DNA sequencing technologies and supercomputers like the IRP’s Biowulf enable geneticists to sequence and study parts of the human genome that have long remained a mystery.
While the Human Genome Project accomplished a remarkable feat in sequencing all the genes in the human genome, technological limitations still left significant swaths of our genetic blueprints unexplored. Recent advances in DNA sequencing are starting to fill in those gaps, but these new technologies require new computational tools to make sense of the data they generate. That’s where computer scientists like the IRP’s Adam Phillippy, Ph.D., come in.
DNA sequencing requires breaking up a long DNA strand into numerous short segments, each of which is sequenced individually to determine the arrangement of the four different components, called bases, that make up the complete molecule. The individual segments’ sequences are then pieced together by computer algorithms that identify identical, overlapping sequences on the individual pieces.
At first, the DNA segments could be no longer than a couple hundred bases, though by the late 2000s technological developments had increased that limit to 700 bases. Unfortunately, segments this short make it extremely difficult to locate the overlapping areas that permit algorithms to stitch together the pieces into a full DNA sequence. Regions of DNA in which the same sequence of bases is repeated dozens or even hundreds of times have particularly bedeviled geneticists, leading to gaps in those parts of the human genome.
Fortunately, the past decade has seen the advent of so-called ‘long-read sequencing’ technologies that can sequence DNA segments one thousand to one million bases in length. These new techniques allow geneticists to ‘read’ portions of human and other organisms’ DNA that they could not before, including highly repetitive regions. With the help of the IRP’s supercomputer, Biowulf, Dr. Phillippy’s team spends its days developing computational methods for sorting through modern sequencing data.
“Whenever new technologies emerge, either the data is of a different type, there’s a higher error rate, there’s a lot more data than people are used to, or the sequencing reads are longer than we’ve been used to in the past,” Dr. Phillippy says. “We try to develop tools that correct the problems caused by new technology.”
Because numerous disorders have been linked to repetitive regions of DNA, such as the neurodegenerative diseases ALS and Huntington’s disease, long-read sequencing has the potential to dramatically expand scientists’ understanding of how such illnesses develop.
Without the sort of cutting-edge computational tools being developed in the lab of IRP investigator Adam Phillippy, significant portions of our genetic code would be extremely difficult to sequence efficiently and accurately.
“And those are the ones we know we’re looking for — there’s a number of diseases that could be linked to the variants within these repetitive sequences that are yet to be found,” Dr. Phillippy says. “Because we can resolve the repeats, now we can look at the variation in these highly repetitive areas and examine what role that variation plays in disease.”
No technology is perfect, though, and long-read sequencing is actually more error-prone than some older techniques. To prevent mistakes, geneticists sequence the same DNA multiple times. The computational tools developed in Dr. Phillippy’s lab examine the results of all the sequencing attempts in order to determine which base truly occupies each location in a DNA molecule. Moreover, even long-read sequencing techniques require DNA strands to be broken up into shorter segments, albeit much longer segments than older technologies. Consequently, Dr. Phillippy’s algorithms must also look for areas of overlap in the individual pieces in order to put them back together in the right order.
NIH’s supercomputing resources make developing such computational tools much easier because they allow Dr. Phillippy’s team to focus first on making sure their algorithms produce accurate DNA sequences without regard to how efficiently they do so. With a less powerful computer than the IRP’s Biowulf, testing an inefficient algorithm would take an untenable amount of time. Once an algorithm has been vetted, Dr. Phillippy’s team begins refining it so the code can be run more quickly. Finally, once an algorithm can analyze genetic data in a reasonable amount of time, his group can scale it up, taking advantage of Biowulf’s massive computing power to analyze thousands of genomes at once.
Dr. Phillippy’s group has applied its computational expertise to a number of projects, including a collaboration with IRP senior investigator Christopher Buck, Ph.D., that searched through National Center for Biotechnology Information (NCBI) databases for DNA sequences that resembled tumor-causing viruses called polyomaviruses. The partnership ended up discovering a new member of this virus family. In addition, Sergey Koren, Ph.D., a staff scientist in Dr. Phillippy’s lab, worked with researchers at the University of California, Santa Cruz, to produce a gapless sequence of the human X chromosome, marking the first time a human chromosome has been completely sequenced from end to end. Once this sequence is proven to be accurate, it could aid studies designed to shed light on traits and diseases linked to the X chromosome.
“Biowulf is just a tremendous force multiplier for us,” Dr. Phillippy says. “There are things we can do here that we wouldn’t be able to do if we were somewhere else where we didn’t have access to these analysis resources. One of my most pleasant surprises joining the NIH intramural program is how well-run that resource is and how much of a fantastic resource it is for my group.”
Subscribe to our weekly newsletter to stay up-to-date on the latest breakthroughs in the NIH Intramural Research Program.