Three Billion Base Pairs vs. One Powerful Computer

Tuesday, November 27, 2018

IRP investigator Dr. Daniel Levy relies on the NIH’s supercomputing resources to help him sift through the billions of base pairs that make up the human genome.

The human genome comprises roughly three billion base pairs and around 20,000 protein-coding genes, according to recent estimates. That’s a lot of information crammed into the tiny nucleus of a cell, and it doesn’t even include the many genes that do not produce a protein or the fact that most genes come in multiple flavors that vary in different individuals. Add to that the phenomenon of an identical gene being either more or less active in two different people and you can quickly end up with genomic datasets that would overload nearly any computer. Fortunately for IRP senior investigator Daniel Levy, M.D., the NIH IRP has one of the few computer systems in the world that can handle this mountain of information.

Dr. Levy’s specific interests lie in determining how genes and their activity levels, also called gene expression, influence the risk for cardiovascular conditions like heart disease, high blood pressure, and high cholesterol. Since 1984, he has been working on the Framingham Heart Study, including serving for more than two decades as its director. The 70-year-old Framingham study, run by the NIH’s National Heart, Lung, and Blood Institute (NHLBI), was originally designed to probe the origins of cardiovascular disease but has expanded dramatically in scope since its inception. This has spurred a parallel evolution in Dr. Levy’s own research to encompass not just the genome but also other ‘omic’ data, including analyses of the chemical markers that influence gene expression, known as the epigenome, and the set of RNA molecules that translate DNA’s instructions into proteins, called the transcriptome.

“Increasingly, my lab is focusing on ‘big data’ projects,” Dr. Levy says. “Many observational and epidemiology studies have been collecting various types of ‘omic’ data, and it’s in this area that we rely on the IRP’s computing resources to analyze big data in large numbers of individuals.”

For example, Dr. Levy has used data from the Framingham Heart Study to pursue genome-wide association studies, or GWAS – a type of study that examines the entire human genome in an attempt to establish relationships between particular genetic variants and specific traits or diseases. But his research team took the traditional GWAS a step further by attempting to understand how genetic variation affects the expression of thousands of different genes.¹

“This was really hard computationally because we needed to do millions upon millions of computations,” explains Roby Joehanes, Ph.D., a staff scientist in Dr. Levy’s lab. “At that time, we had about eight million genetic variants and about 285,000 fragments of gene expression. I took just a tiny piece of our data and tried to gauge how long it would take to accomplish this computation and I initially came up with an estimated time of about 3,000 years.”

But by utilizing Biowulf, the IRP’s state-of-the-art supercomputer, Dr. Joehanes and his colleagues were able to analyze a dataset that would have been otherwise unmanageable.

“Those 3,000 years of computations shrunk to about nine days using Biowulf, and we’re enormously grateful for that,” Dr. Joehanes says.

Dr. Levy’s interest in gene expression has also led his team to examine chemical markers on DNA called methyl groups, which influence gene expression. In one recent study, his team sifted through data from more than 13,000 individuals to identify hundreds of genetic sites with levels of methylation that differed based on the amount of alcohol the participants routinely consumed.² Another study examined methylation at more than 400,000 genetic locations in several thousand participants and identified 83 locations where methylation was associated with body-mass index,³ a measure that is used to determine whether a person is overweight or obese and therefore at increased risk of related conditions like heart disease and diabetes.

“By identifying genes associated with these diseases, we may be able to better understand mechanisms of disease and we may be able to better predict who’s likely to develop the disease,” Dr. Levy says. “And, of course, identifying genes associated with disease may allow us eventually to identify novel therapies for affected individuals.”

As computationally complex as Dr. Levy’s past studies were, his future research promises to push the boundaries of the IRP’s high-performance-computing resources even further. His team is currently planning to analyze the entire genomes of 4,200 participants from the Framingham Heart Study to uncover additional genetic influences on cardiovascular conditions. The researchers will examine the relationship between cardiovascular disease and DNA methylation at 450,000 locations in those participants’ genomes, as well as analyze gene expression in 1,500 people to see how it influences cardiovascular disease risk.

“The data requirements, both for storage and for analysis, are going up by orders of magnitude compared to anything we have done before,” Dr. Levy says. “Biowulf allows us to do very demanding and complex analyses that we otherwise might not be able to do at all, and as the types of data we’re analyzing have evolved, the scalability of Biowulf has allowed us to keep up with those increased needs.”

Subscribe to our weekly newsletter to stay up-to-date on the latest breakthroughs in the NIH Intramural Research Program.

References:

[1] Integrated genome-wide analysis of expression quantitative trait loci aids interpretation of genomic association studies. Joehanes R, Zhang X, Huan T, Yao C, Ying SX, Nguyen QT, Demirkale CY, Feolo ML, Sharopova NR, Sturcke A, Schäffer AA, Heard-Costa N, Chen H, Liu PC, Wang R, Woodhouse KA, Tanriverdi K, Freedman JE, Raghavachari N, Dupuis J, Johnson AD, O'Donnell CJ, Levy D, Munson PJ. Genome Biol. 2017 Jan 25;18(1):16. doi: 10.1186/s13059-016-1142-6.

[2] A DNA methylation biomarker of alcohol consumption. Liu C, Marioni RE, Hedman ÅK, Pfeiffer L, Tsai, Reynolds LM, Just AC, Duan Q, Boer CG, Tanaka T, Elks CE, Aslibekyan S, Brody JA, Kühnel B, Herder C, Almli LM, Zhi D, Wang Y, Huan T, Yao C, Mendelson MM, Joehanes R, Liang L, Love SA, Guan W, Shah S, McRae AF, Kretschmer A, Prokisch H, Strauch K, Peters A, Visscher PM, Wray NR, Guo X, Wiggins KL, Smith AK, Binder EB, Ressler KJ, Irvin MR, Absher DM, Hernandez D, Ferrucci L, Bandinelli S, Lohman K, Ding J, Trevisi L, Gustafsson S, Sandling JH, Stolk L, Uitterlinden AG, Yet I, Castillo-Fernandez JE, Spector TD, Schwartz JD, Vokonas P, Lind L, Li Y, Fornage M, Arnett DK, Wareham NJ, Sotoodehnia N, Ong KK, van Meurs JBJ, Conneely KN, Baccarelli AA, Deary IJ, Bell JT, North KE, Liu Y, Waldenberger M, London SJ, Ingelsson E, Levy D. Mol Psychiatry. 2018 Feb;23(2):422-433. doi: 10.1038/mp.2016.192.

[3] Association of Body Mass Index with DNA Methylation and Gene Expression in Blood Cells and Relations to Cardiometabolic Disease: A Mendelian Randomization Approach. Mendelson MM, Marioni RE, Joehanes R, Liu C, Hedman ÅK, Aslibekyan S, Demerath EW, Guan W, Zhi D, Yao C, Huan T, Willinger C, Chen B, Courchesne P, Multhaup M, Irvin MR, Cohain A, Schadt EE, Grove ML, Bressler J, North K, Sundström J, Gustafsson S, Shah S, McRae AF, Harris SE, Gibson J, Redmond P, Corley J, Murphy , Starr JM, Kleinbrink E, Lipovich L, Visscher PM, Wray NR, Krauss RM, Fallin D, Feinberg A, Absher DM, Fornage M, Pankow JS, Lind L, Fox C, Ingelsson E, Arnett D, Boerwinkle E, Liang L, Levy D, Deary IJ. PLoS Med. 2017 Jan 17;14(1):e1002215. doi: 10.1371/journal.pmed.1002215.

Category: IRP Discoveries

Tags: Biowulf, high-performance computing, computers, genomics, genes, genetics, epigenetics, epigenome, heart disease, Framingham Heart Study, big data

Three Billion Base Pairs vs. One Powerful Computer

Related Blog Posts