
Comin' Atcha!
The volume and rate of data produced today in the quest for medical breakthroughs, better biofuels, a cleaner environment and a deeper understanding of the world we live in can make researchers feel as if they are trying to sip water from a fire hose.
The reason behind the data flood? The life sciences are undergoing a revolution.
Traditionally, biologists have reduced systems to discrete components, such as a single protein produced under specific conditions.
"This left scientists knowing more and more about less and less," said Karin Rodland, senior research scientist at Pacific Northwest National Laboratory. "And it has not led to the robust predictions we need."
Now, with the advent of powerful instruments and new techniques, biologists are studying whole systems—€”looking at all of the genes and proteins expressed by microorganisms and human cells—€”to be able to predict, not just observe, behavior. This system-wide approach is generating unprecedented volumes of data. In a single day, one mass spectrometer can generate a terabyte (one thousand gigabytes) of data. That's roughly equivalent to the text contained in one million books. A single experimental study can produce 60 terabytes of data.
Enter data-intensive computing, the process of building predictive models and analytic tools based on massive volumes of complex and heterogeneous data from a wide variety of sources, explains Deborah Gracio, lead for PNNL's Data-Intensive Computing Initiative. "It is the enabling technology for analyses where theoretical relationships do not exist or are not known for the system being studied," she says.
High-end computer systems are required to store and process growing datasets that are reaching the petascale (one million gigabytes) range. However, the cost of these systems can be prohibitive. PNNL is deploying hardware accelerators to address specific bottlenecks and features of the problem. "Data-intensive computing even at the petascale still requires us to maximize the work done per dollar," says Jarek Nieplocha, a PNNL fellow.
A team of computational scientists at the lab is working with biologists to develop intelligent hardware and software solutions that analyze, store and share data in ways that enable innovative research. The effort is to reduce both the cost and time involved in scientific discoveries that lead to solutions for challenges in energy, health and national security.
Instrumentation data rates are growing at an alarming pace (consider ion mobility mass spectrometers that produce gigabytes of data per second), but many institutions simply do not have the capacity to hold online every piece of experimental data. One solution has been to build information management systems such as the active space manager in PNNL's Proteomics Research Information System and Management, PRISM for short. The space manager watches the disk storage space, moving raw data to an offline archive. If interest in the archived datasets is later renewed, PRISM recalls the data in a delicate balancing act of demand and available on-line storage.
PRISM's space manager was called into action for a human blood plasma study, where the software identified and stored data on 4,000 proteins, a critical step in cataloging markers for early diagnosis of cancer and other diseases. While not commercially available at this time, PRISM is available to users of the Environmental Molecular Sciences Laboratory, a Department of Energy national user facility located at the lab. (www.emsl.pnl.gov)
Another solution is to manage the data at the source: the instruments. Scientists are "teaching" computers to discern patterns in the data—€”patterns the scientists don't even know exist. Using support vector machines, they're training computers to recognize precursors of important data signatures and then save those select signatures for the analyst.
To extract knowledge from the masses of data, researchers must complete time-consuming processes. For example, many software programs are not optimized for supercomputers, and protein identification from mass spectra—€”comparing the proteins found in an organism with extensive characterization databases—€”can take months or years.
To accelerate the analyses, computer scientists are modifying software to divide the computational work, assigning tasks to each processor in parallel, dramatically reducing the time needed for sequence identification and other basic analysis tasks. Recently, PNNL's ScalaBLAST, a parallelized software, was used to analyze 1.6 million protein sequences against more than 3 million proteins in under 19 hours for the Joint Genome Institute. This work would have taken three years using nonparallelized software.
A team of scientists used the parallelization technique to develop a high-throughput analytics workflow and identified key proteins involved in the virulence of Salmonella using 800,000 mass spectra. This parallelized, data-intensive workflow provides analysis and interpretation of massive amounts of data in less time than it took to generate the data. The project will be demonstrated in the analytics challenge at Supercomputing 2006, an international conference on high-performance computing, networking and storage.
Yet even with faster analyses researchers must mine mountains of heterogeneous data: simulation data, sequence data and endless types of experimental data, including microarrays, high-throughput imaging, mass spectra, and others. They must extract meaningful features and then fuse those features with supporting information from other sources.
One example of simplifying this knowledge acquisition process is PNNL's Bioinformatics Resource Manager. This software links heterogeneous data sources and automates data mining. Researchers can retrieve experimental data, extract relevant gene and protein identifiers from public files, add information from other sources and link to visualization tools. The software also facilitates merging files using identifier matching and cross-reference information. Developed at the lab, this software will be available by the end of the year.
BRM can provide the framework to enable data-intensive computing applications that work synergistically within a problem-solving environment. Developing these types of applications is the goal of the Data-Intensive Computing for Complex Biological Systems project, a collaboration between PNNL and Oak Ridge National Laboratory funded by the DOE Office of Advanced Scientific Computing Research. Even with the significant patterns and features extracted, the data are often still very large and complex. To avoid getting lost in the abyss and gain insights into the complex datasets, researchers are looking to visualization technologies.
Established as the global leader in the field of visual analytics (PNNL is headquarters for the National Visualization and Analytics Center), the lab is combining its expertise in visual analytics, computational science and biological science to develop computational environments that will help systems biology answer important questions in environmental remediation, energy production, and human health.
Moe Khaleel, director of PNNL's computational sciences and mathematics division, sums up the impact of the technologies and techniques needed for the big data problem in the life sciences: "Data-intensive computing is not just an evolutionary change, it is a revolutionary change in the way information is gathered and processed." From the way hardware and algorithms are used to the presentation of information to scientists, data-intensive computing is already changing researchers' approach to life science.
Kristin Manke is a writer at Pacific Northwest National Laboratory.

Copyright © 2012 | Innovation America