From the Deputy Director for Intramural Research
Breaking Through the Data Bottleneck
BY ANDY BAXEVANIS, OIR, AND MICHAEL GOTTESMAN, DDIR
One of the hallmarks of our intramural research environment is the rapid development and implementation of cutting-edge technologies that drive discovery and innovation, greatly facilitating our ability to respond to urgent public-health needs as they arise. The use of these new high-throughput, data-intensive technologies has found its way into almost all areas of scientific inquiry within the intramural research program (IRP), including genomics, computational chemistry, molecular modeling and simulations, structural biology, biomedical imaging, proteomics, metabolomics, and systems biology. The pace with which IRP scientists within these fields are generating biological and biomedical data continues to increase at breakneck speed—but that, in turn, poses its own problem: a data bottleneck. Breaking through that bottleneck requires new ways to easily and effectively use novel computational approaches that allow us to analyze this flood of data.
Back in 2013, with the ever-growing amount of data being generated across the IRP, it became obvious that we were quickly reaching the point at which we would no longer be able to meet the computational needs of our scientists without significantly expanding Biowulf, our large-scale central biomedical-computing resource. We were at a critical junction where not having enough horsepower to pursue computationally intensive projects could have a potentially devastating effect on our research goals as well as a serious impact on our ability to attract (and retain) the best and brightest scientists within the IRP. A collective “call to action” was issued to develop a plan to ensure that we had the kind of world-class computational support that would make us a leader in biomedical computing. The concerted effort to achieve this rather lofty goal involved more than 100 individuals from all corners of the NIH and an investment of $70 million in funding from NIH’s Capital Improvement Fund. Our computational landscape now looks vastly different. In November 2017, Biowulf became the first supercomputer completely dedicated to advancing biomedical research that was listed among the top 100 most-powerful computers in the world.
The architecture of “Biowulf 2.0” is notable in that it is designed for general-purpose scientific computing, meaning that its architecture provides both the power and flexibility to meet the wide variety of computational needs of IRP investigators. With 100,000 computer cores, 35 petabytes of storage, a 100-gigabit connection to the modernized NIH network, and over 600 available scientific software applications, Biowulf is being put to good use. Half of all IRP investigators’ research programs are now actively using Biowulf to process and analyze their research data, an overall doubling from just three years ago. This increased usage is reflected in number of manuscripts published by our scientists, with 5 percent of all peer-reviewed papers from the IRP based on data that were generated or analyzed using Biowulf. Many of these papers are featured on the CIT High-Performance Computing (HPC) Team’s Twitter feed (@nih_hpc), where you can learn more about the studies described in these papers by searching on the very appropriate hashtag #PoweredByBiowulf.
Beyond the impressive capabilities of Biowulf 2.0, the IRP also benefits immensely from the HPC team’s expertise, greatly facilitating our ability to make best use of this remarkable resource. The team members spend a significant amount of time providing training and support to IRP scientists, both in formal classroom settings and through their monthly “Coffee Shop Consults” across campus. Also, in response to the increasing demand for training, the HPC team has also released an online, self-paced “Introduction to Biowulf” course that covers a wide variety of basics accompanied by hands-on exercises. Whether a beginner or a seasoned coder, any researcher can benefit from the expertise and diligent assistance of the Biowulf staff. You can learn more about their services by visiting the HPC website at https://hpc.nih.gov.
Of course, when it comes to a fast-paced arena such as supercomputing, there is no rest for the weary. As we come to the end of the IRP’s current five-year plan for HPC, we have already started developing a new five-year plan to continue to sustain and capitalize on our investments. The new plan will focus not only on continuing to meet increasing demand but, more importantly, will look to take our HPC program in new directions based on the ever-changing computational needs of our researchers. This effort would necessarily include investigating new architectures and technologies that would allow us to move more strongly into areas such as deep learning and artificial intelligence. These kinds of approaches have already shown great promise in several clinically relevant areas such as diagnostic pathology and in the study of molecular dynamics. Any new vision for HPC should also take advantage of the significant advances that have been made in cloud computing, particularly in the context of advancing multicenter collaborative studies. The NIH has already taken a significant first step in this direction with the launch of the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative in July 2018. The STRIDES Initiative is already helping to advance NIH’s goals for increased data sharing and enhancing collaborations through the use of cloud-based tools and services.
As biomedical research becomes even more computationally intensive, the creativity of NIH computational scientists and thoughtful investments by NIH leadership will be essential to ensure that the IRP remains a cutting-edge research institution. As we look forward, we welcome your input as we identify new challenges and set priorities for the future of biomedical computing in the IRP. More importantly, we encourage all of you to take advantage of these shared computational resources, incorporating them into your research as an “enabling technology” that can jump-start major initiatives and truly push the boundaries of biomedical science.
Andy Baxevanis is the Director of Computational Biology in the Office of Intramural Research. He is also a senior scientist in the Computational and Statistical Genomics Branch of the National Human Genome Research Institute. For more on Biowulf, go to https://hpc.nih.gov; or read the article in the May–June 2018 issue of the NIH Catalyst and the IRP Blog post of June 13, 2018.