An aptly named paper brings genomic data into perspective.
The amount of data contained in just a few molecules of DNA could fill a computer hard drive. According to scientists from the University of Illinois and Cold Spring Harbor Laboratory, genomic data will require computational and storage capabilities beyond anything previously anticipated. By 2025, genomic data needs are likely to outstrip those of astronomy, Twitter, and YouTube (the current Big Data leader, generating about 100 petabytes a year). And, unlike Internet data, genomic data come in numerous (nonstandard) formats, adding a layer of complexity to its storage.
It’s estimated that so far, like YouTube, genomics has produced data on the petabyte scale (a petabyte is equal to 1 million gigabytes). If all of the human sequence data generated to date were to be put in one place, it would amount to about 25 petabytes. But over the last decade, the amount of sequencing data alone (just one type of genomic data) doubled about every seven months. That pace is only going to pick up, likely reaching the exabyte scale within ten years. One exabyte is about a million times more data than can be stored on a home computer.
Their paper, “Big Data: Astronomical or Genomical?” was published in July in PLoS Biology.
NIH IS ONBOARD TO HELP WITH BIG GENOMIC DATA
The National Institutes of Health (NIH) recently awarded $1.3 million to researchers at the University of Illinois and Stanford University to develop new data compression approaches. The grant is one of several new software development efforts within NIH’s Big Data to Knowledge Initiative. NIH is dividing a total of $6.5 million among 15 winning recipient programs in this fiscal year. In addition to data compression, the awards fell into the categories of data provenance, data visualization, and data wrangling.
The University of Illinois/Stanford data compression effort will focus on more efficient ways of representing genomic information stored in a dataset. For example, a long sequence of “A’s” could be represented as “A times 50.” Handily, genomic data lends itself to compression, because sequences often contain a lot of repetition due to its relatively small alphabet (A, G, C, and T). The researchers said their primary goal is to develop a suite of data compression software that will handle several different types of genomic data, including DNA sequence data. It is of “paramount importance,” according to NIH, to find ways to efficiently, accurately, and quickly compress data and to recognize techniques for sharing, accessing, visualizing, and searching variously formatted genomic data.
NIH data compression awards also went to researchers at UCSD, Case Western Reserve University, and the University of Arizona. Better compression software will reduce the cost of data storage and analysis. Additionally, by requiring that these tools be open source, these NIH awards open the door to future innovations and improvements based upon the initial developments. Other federal agencies are also targeting Big Data for the biosciences. The National Science Foundation (NSF) has said that its Division of Mathematical Sciences will collaborate with the NIH Big Data initiative to address biomedical data science projects.
And last month, the National Cancer Institute (NCI) said it will fund multiple research projects related to the development and management of informatics technologies for cancer research. The agency largely echoed the University of Illinois/Cold Spring Harbor Laboratory researchers, noting that "Over the last decade major advances in biology, coupled with innovations in information technology, have led to an explosive growth of biological and biomedical information," particularly as they relate to genomics. According to a recent analysis by BCC Research, the market for sequencing-based cancer applications is currently valued at $206.3 million and growing at a CAGR of 34.7%. At this rate, sequencing-based cancer diagnostics will be a $915.7 million market in just five years.
The NIH noted that while recent advances in informatics (e.g., the use of cloud computing to support Big Data analysis) have benefited cancer research, a lack of tools and related resources limit the routine use of informatics in research. In response, the NCI is expanding its Informatics Technology for Cancer Research program with several new funding opportunities. One such opportunity is focused on projects improving the user experience and availability of existing, widely used informatics tools and resources that have demonstrated their impact on cancer research. Included in this category are resources for data compression, storage, organization, and transmission.
PRIVATE SECTOR STRIVES TO CLOSE GAP BETWEEN GENOMIC DATA AND BEDSIDE
BCC Research has analyzed the next-generation sequencing (NGS) clinical informatics industry. As noted earlier, the mammoth amounts of data generated must be not only stored, but put into a format that is relevant to physicians so that actionable decisions can be made. A recent BCC Research report highlights numerous companies that are providing software tools for managing Big Data. These include businesses focusing on informatics, general life science/medical informatics, NGS informatics, and clinical NGS informatics.
In a relatively short period of time, the challenge has shifted from the generation of genomic data to begin with, to how to now store it in such way that it is clinically actionable. It’s going to be vital not to miss the forest for the trees. This torrent of genomic bytes must be carefully managed to ensure that health and well-being stand to benefit from its enormous value. With both the generation of data and the breakthroughs it leads to coming so fast and furiously, constant progress on the informatics side is key.