Where Bioscience, Math and Computing Meet

The National Center for Genome Resources (NCGR) in Santa Fe is a unique provider of professionally engineered software and computational discovery platforms for the scientific community. Operating as a private, non-profit research institute, NCGR translates research at intersections between bioscience, computing and mathematics into improvements in global health and nutrition.

At its inception in 1994, NCGR's mandate was to develop the Genome Sequence Data Base (GSDB), the first publicly accessible, relational database of human genome sequences generated by scientists at Los Alamos National Laboratory. The commercial importance of computationally processing raw genome sequence data led to the spinoff of Molecular Informatics. Inc., formed by NCGR, and its 1997 sale to Perkin Elmer. As a result of subsequent acquisitions, algorithms originally developed at NCGR were used at Celera Genomics, Inc., to complete the first draft of the human genome in June 2000.

Since that time, a revolution has taken place in the biological sciences with the advent of team-based, industrial scale, systematic research—€”an approach now termed "systems biology." Biologists no longer work on one gene at a time or one drug target at a time; today they are equipped with information about the entire human genome and biotechnologies capable of measuring thousands of genes or proteins simultaneously. As a result biomedical scientists can ask questions on a systems level and, perhaps, predict the efficacy and safety of drugs before they go into clinical trials or into the marketplace. Plant biologists can use comparative genomics methods to identify plant allergens or to identify traits that have the greatest impact on global nutrition.

Cutting-edge informatics and computational resources have become centrally important to bioscience researchers worldwide. Over the last decade, the volume of information generated by genome sequencing centers outstripped the effectiveness of collecting and managing data in scientific notebooks and spreadsheet software. Other high-throughput biotechnologies that generate information on a large scale, such as microarray technologies and molecular library screening, created the need to find associations between large sets of different data types. Collaborative research teams working in this new data environment consist of specialists at diverse locations, linked by a network or the world wide web. Only now can researchers begin tackling complex biological problems that were intractable a decade ago, due to new data at the systems level and the ability to collect, manage and analyze biological information on a large, or industrial, scale.

Currently, many bioinformatic tools used in medical schools and academic settings are based on non-portable C++ code and platform-dependent clients.
Biological databases are notorious for "going stale" due to lack of interoperability and link-ability. Cutting-edge biotechnologies generate experimental data that is orphaned on the instrument and not integrated effectively into research programs. Even large pharmaceutical companies have struggled with linking both legacy databases and new biotechnologies.
Cross-disciplinary efforts using clinical data are particularly hampered by the paper culture that dominates patient care settings. The National Institutes of Health has published strategic, or "Roadmap" (nihroadmap.nih.gov/overview.asp, September 2003), goals envisioning a national software engineering system for bioscience not unlike an integrated office software bundle. Likewise, the Food and Drug Administration issued a report on "Challenge and Opportunity on the Critical Path to New Medical Products" in March 2004 (http://www.fda.gov/oc/initiatives/criticalpath/) that identified an "urgent need for additional public-private collaborative work on technologies such as genomics, proteomics and bioinformatics systems."

As its name suggests, the National Center for Genome Resources is particularly suited to responding to these challenges and emerging opportunities. NCGR has designed, implemented and maintained computational resources throughout the paradigm shift biology has experienced over the last decade. Software engineers, IT professionals and scientists at NCGR have collaborated to build a series of information resources as new data and annotation is layered onto genomic sequences. A decade at the cutting edge has given NCGR an impressive experience base, recently augmented by new hires from biotech companies and biological software providers. NCGR is actively seeking biotechnology and clinical partners, both in industry and academia, by providing software solutions to assist in the development of practical improvements in global health and nutrition.

Today NCGR is a leader in provision of integrated, internet-based biological information resources and development of innovative bioscience software that addresses the growing need to integrate and analyze research results generated at different locations, different times, and with disparate technologies.
Interestingly, many of the systems biology processes and data types now being adopted and exploited by the biomedical community were first used effectively by the plant biology community. As a result, NCGR pioneered a series of information resources that serve the worldwide plant biology community and underpin fundamental research related to global nutrition. These resources allow in-depth model plant genome analysis, cross genome analysis, and insight into host-pathogen interactions.

NCGR is aiming to broaden the audience for biological information resources.
Currently the biological databases require expert knowledge to find information or make connections among data. Next-generation resources should not require such expertise. Indeed, for systems biology projects to succeed, we must bring together teams of specialists and provide them with tools so they may make meaningful and valid connections in the data, regardless of their background or location.

NCGR is placing additional emphasis on ease of use and seeking to provide engaging, effective presentation of data and information. It should be apparent that one simply cannot screen databases containing tens of millions of fields to gain knowledge. Information resources require the inclusion of innovative, proprietary tools to analyze data and query information. NCGR employs scientists who develop mathematical methods to infer relationships among disparate, sparse datasets to formulate new hypotheses. Combining theoretical computer science, control engineering and mathematics, NCGR is developing innovative analysis methods for application to biological networks. NCGR is developing engines for next-generation query and analysis tools. Conventional search engines, like Google, are constrained to literal, specific words. NCGR has collaborated with New Mexico Tech on a Department of Defense-funded project to use the Unified Medical Language System to translate molecular biological terms into medical terms and vice versa. Thus, the tool can search the biomedical literature and return results based on concepts, using controlled vocabularies and logical combinations of words.

Looking forward, NCGR has partnered with a diverse set of public and private institutions to answer the computational challenges of today's biomedical research. Collaborative, multi-institutional teams have been arrayed to study the genetics of autism, the genetics of a very common and serious drug side effect, drug-induced cardiac arrhythmias, and the genetics underlying immune response to vaccination and peanut allergens. Through these collaborations and others, NCGR seeks to provide tools, algorithms and computational solutions to tackle the challenges of the post-genomic era and make an impact on global health and nutrition.

Susan M. Baxter, Ph.D, is chief operating officer for the National Center for Genome Resources. The web site is www.ncgr.org.