NCI High Performance Computing Data Management Services Common APIs
One of the most significant challenges to overcome for high performance computing (HPC) support is effective data management, i.e., tracking, annotation and staging of digital datasets, accompanied by a data life cycle policy for these datasets. While frequently not considered an HPC challenge or opportunity, an effective solution is needed to contain costs for stored data while increasing the scientific usefulness of data that has been created in the era of ‘big data’ where analysis of datasets can take days and total cost to store and maintain large datasets continue to tax personnel and financial resources. Without a reliable, managed dataset solution, large datasets are frequently maintained in multiple copies across the physical storage in an isolated fashion, leading to an unnecessary expense as additional storage is required for analysis and storage of new data. A managed, secured, and high-availability solution will minimize the need for maintaining unnecessarily redundant copies of these datasets. Even with projected declines in the cost of physical storage, the investment in managing stored data without associated annotation will provide only minimal (at best) long-term scientific usefulness or support to advance the mission of the NCI.
Annotation and registration of large datasets is inherent for managed datasets to effectively deliver broader scientific impact and advance the mission of the NCI. Consistent with efforts already underway at the NIH within the Big Data To Knowledge (BD2K) program, annotation and registration of datasets will enable managed datasets to be of use to the community of extended and future cancer investigators. The creation and delivery of metadata and tracking the utilization of datasets will provide key insights into the scientific impact for each maintained dataset.
Without an effective data management solution, the HPC effort will struggle with difficulties in staging data for analysis, recovering generated datasets, and inefficiencies created by insufficient physical storage and recomputing results that have once been completed. Therefore, we believe that: • NCI is in critical need of advancing its core scientific and technological means of data management and services from large, diverse, distributed and heterogeneous datasets • Large datasets are currently maintained in multiple copies across physical storage in an isolated fashion, leading to an unnecessary expense • Annotation and registration of datasets is inherent for managed datasets to effectively deliver broader scientific impact and enable the full power of personalized medicine • Strategically, the absence of an effective data management solution presents a barrier to supporting emerging efforts to leverage the breadth of generated datasets for use in development of computationally and data intensive predictive models as well as efforts to utilize cloud resources for collaboration and analysis.
The NCI HPC DME Data Management initiative is aimed to overcome these challenges and pave the way for big data based personalized medicine and innovative cancer treatment/prevention exploration and discovery.