This document summarizes interfaces that are instrumental for the interaction with Clouds, Containers, and High Performance Computing (HPC) systems to manage virtual clusters to support the NIST Big Data Reference Architecture (NBDRA). The REpresentational State Transfer (REST) paradigm is used to define these interfaces, allowing easy integration and adoption by a wide variety of frameworks.
Big Data is a term used to describe extensive datasets, primarily in the characteristics of volume, variety, velocity, and/or variability. While opportunities exist with Big Data, the data characteristics can overwhelm traditional technical approaches, and the growth of data is outpacing scientific and technological advances in data analytics. To advance progress in Big Data, the NIST Big Data Public Working Group (NBD-PWG) is working to develop consensus on important fundamental concepts related to Big Data. The results are reported in the NIST Big Data Interoperability Framework (NBDIF) series of volumes. This volume, Volume 8, uses the work performed by the NBD-PWG to identify objects instrumental for the NIST Big Data Reference Architecture (NBDRA) which is introduced in the NBDIF: Volume 6, Reference Architecture.
**Keywords**Adoption; barriers; implementation; interfaces; market maturity; organizational maturity; project maturity; system modernization.
**Acknowledgements**
This document reflects the contributions and discussions by the membership of the NBD-PWG, cochaired by Wo Chang (NIST ITL), Bob Marcus (ET-Strategies), and Chaitan Baru (San Diego Supercomputer Center; National Science Foundation). For all versions, the Subgroups were led by the following people: Nancy Grady (SAIC), Natasha Balac (San Diego Supercomputer Center), and Eugene Luster (R2AD) for the Definitions and Taxonomies Subgroup; Geoffrey Fox (Indiana University) and Tsegereda Beyene (Cisco Systems) for the Use Cases and Requirements Subgroup; Arnab Roy (Fujitsu), Mark Underwood (Krypton Brothers; Synchrony Financial), and Akhil Manchanda (GE) for the Security and Privacy Subgroup; David Boyd (InCadence Strategic Solutions), Orit Levin (Microsoft), Don Krapohl (Augmented Intelligence), and James Ketner (AT&T) for the Reference Architecture Subgroup; and Russell Reinsch (Center for Government Interoperability), David Boyd (InCadence Strategic Solutions), Carl Buffington (Vistronix), and Dan McClary (Oracle), for the Standards Roadmap Subgroup.
The following milestone releases exist:
- Version 2.1: A previous volume used the definitions of the schema based on examples only. It was easier to read but only included the definition of the resources and not the interaction with the resources. This volume was in place until June 2018.
- Version 2.2: This version was significantly changed and used OpenAPI 2.0 to specify the interfaces between the various services and components.
- Version 3.1.1: The version includes significant improvements of the object specifications but are still using OpenAPI 2.0.
- Version 3.2.0: All specifications have been updated to OpenAPI 3.0.2. Significant updates have been done to a number of specifications.
The editors for these documents are:
- Gregor von Laszewski (Indiana University)
- Wo Chang (NIST).
Laurie Aldape (Energetics Incorporated) and Elizabeth Lennon (NIST) provided editorial assistance across all NBDIF volumes.
NIST SP 1500-9, Draft NIST Big Data Interoperability Framework: Volume 8, Reference Architecture Interfaces, Version 2 has been collaboratively authored by the NBD-PWG. As of the date of publication, there are over six hundred NBD-PWG participants from industry, academia, and government. Federal agency participants include the National Archives and Records Administration (NARA), National Aeronautics and Space Administration (NASA), National Science Foundation (NSF), and the U.S. Departments of Agriculture, Commerce, Defense, Energy, Census, Health and Human Services, Homeland Security, Transportation, Treasury, and Veterans Affairs.
NIST would like to acknowledge the specific contributions to this volume, during Version 1 and/or Version 2 activities. Contributors are members of the NIST Big Data Public Working Group who dedicated great effort to prepare and gave substantial time on a regular basis to research and development in support of this document. This includes the following NBD-PWG members:
- Gregor von Laszewski, Indiana University
- Wo Chang, National Institute of Standard and Technology,
- Fugang Wang, Indiana University
- Geoffrey C. Fox, Indiana University
- Shirish Joshi, Indiana University
- Badi Abdul-Wahid, formerly Indiana Univresity
- Alicia Zuniga-Alvarado, Consultant
- Robert C. Whetsel, DISA/NBIS
- Pratik Thakkar, Philips
Executive Summary
The NIST Big Data Interoperability Framework (NBDIF): Volume 8, Reference Architecture Interfaces document was prepared by the NIST Big Data Public Working Group (NBD-PWG) Reference Architecture Subgroup to identify interfaces in support of the NIST Big Data Reference Architecture (NBDRA). The interfaces define resources that are part of the NBDRA. These resources are formulated in OpenAPI 3.0.2 format and can be easily integrated into a REpresentational State Transfer (REST) framework or an object-based framework.
The resources were categorized in groups that are identified by the NBDRA set forth in the NBDIF: Volume 6, Reference Architecture document. While the NBDIF: Volume 3, Use Cases and General Requirements document provides *application-*oriented high-level use cases, the use cases defined in this document are subsets of them and focus on interface use cases. The interface use cases are not meant to be complete examples, but showcase why a resource has been defined. Hence, the interfaces use cases are only representative and do not encompass the entire spectrum of Big Data usage. All the interfaces were openly discussed in the NBD-PWG [@www-bdra-working-group].
The NBDIF was released in three versions, which correspond to the three stages of the NBD-PWG work. Version 3 (current version) of the NBDIF volumes resulted from Stage 3 work with major emphasis on the validation of the NBDRA Interfaces and content enhancement. Stage 3 work built upon the foundation created during Stage 2 and Stage 1. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data. The three stages (in reverse order) aim to achieve the following with respect to the NIST Big Data Reference Architecture (NBDRA).
- Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces;
- Stage 2: Define general interfaces between the NBDRA components; and
- Stage 1: Identify the high-level Big Data reference architecture key components, which are technology-, infrastructure-, and vendor-agnostic.
The NBDIF consists of nine volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The nine volumes are as follows:
- Volume 1, Definitions [@www-vol1-v3]
- Volume 2, Taxonomies [@www-vol2-v3]
- Volume 3, Use Cases and General Requirements [@www-vol3-v3]
- Volume 4, Security and Privacy [@www-vol4-v3]
- Volume 5, Architectures White Paper Survey [@www-vol5-v3]
- Volume 6, Reference Architecture [@www-vol6-v3]
- Volume 7, Standards Roadmap [@www-vol7-v3]
- Volume 8, Reference Architecture Interfaces (this volume)
- Volume 9, Adoption and Modernization [@www-vol9-v3]
During Stage 1, Volumes 1 through 7 were conceptualized, organized, and written. The finalized Version 1 documents can be downloaded from the V1.0 Final Version page of the NBD-PWG website (https://bigdatawg.nist.gov/V1_output_docs.php).
During Stage 2, the NBD-PWG developed Version 2 of the NBDIF Version 1 volumes, with the exception of Volume 5, which contained the completed architecture survey work that was used to inform Stage 1 work of the NBD-PWG. The goals of Version 2 were to enhance the Version 1 content, define general interfaces between the NBDRA components by aggregating low-level interactions into high-level general interfaces, and demonstrate how the NBDRA can be used. As a result of the Stage 2 work, the need for NBDIF Volume 8 and NBDIF Volume 9 were identified and the two new volumes were created. Version 2 of the NBDIF volumes, resulting from Stage 2 work, can be downloaded from the V2.0 Final Version page of the NBD-PWG website (https://bigdatawg.nist.gov/V2_output_docs.php).
This document is the result of Stage 3 work of the NBD-PWG. Coordination of the group is conducted on the NBD-PWG web page [@www-bdra-working-group].
There is broad agreement among commercial, academic, and government leaders about the potential of Big Data to spark innovation, fuel commerce, and drive progress. Big Data is the common term used to describe the deluge of data in today's networked, digitized, sensor-laden, and information-driven world. The availability of vast data resources carries the potential to answer questions previously out of reach, including the following:
- How can a potential pandemic reliably be detected early enough to intervene?
- Can new materials with advanced properties be predicted before these materials have ever been synthesized?
- How can the current advantage of the attacker over the defender in guarding against cybersecurity threats be reversed?
There is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The growth rates for data volumes, speeds, and complexity are outpacing scientific and technological advances in data analytics, management, transport, and data user spheres.
Despite widespread agreement on the inherent opportunities and current limitations of Big Data, a lack of consensus on some important fundamental questions continues to confuse potential users and stymie progress. These questions include the following:
- How is Big Data defined?
- What attributes define Big Data solutions?
- What is new in Big Data?
- What is the difference between Big Data and bigger data that has been collected for years?
- How is Big Data different from traditional data environments and related applications?
- What are the essential characteristics of Big Data environments?
- How do these environments integrate with currently deployed architectures?
- What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust, secure Big Data solutions?
Within this context, on March 29, 2012, the White House announced the Big Data Research and Development Initiative (The White House Office of Science and Technology Policy, "Big Data is a Big Deal," OSTP Blog, accessed February 21, 2014 [@www-whitehous-bigdata]. The initiative's goals include helping to accelerate the pace of discovery in science and engineering, strengthening national security, and transforming teaching and learning by improving analysts' ability to extract knowledge and insights from large and complex collections of digital data.
Six federal departments and their agencies announced more than $200 million in commitments spread across more than 80 projects, which aim to significantly improve the tools and techniques needed to access, organize, and draw conclusions from huge volumes of digital data. The initiative also challenged industry, research universities, and nonprofits to join with the federal government to make the most of the opportunities created by Big Data.
Motivated by the White House initiative and public suggestions, the National Institute of Standards and Technology (NIST) accepted the challenge to stimulate collaboration among industry professionals to further the secure and effective adoption of Big Data. As a result of NIST's Cloud and Big Data Forum held on January 15--17, 2013, there was strong encouragement for NIST to create a public working group for the development of a Big Data Standards Roadmap. Forum participants noted that this roadmap should define and prioritize Big Data requirements, including interoperability, portability, reusability, extensibility, data usage, analytics, and technology infrastructure. In doing so, the roadmap would accelerate the adoption of the most secure and effective Big Data techniques and technology.
On June 19, 2013, the NIST Big Data Public Working Group (NBD-PWG) was launched with extensive participation by industry, academia, and government from across the nation. The scope of the NBD-PWG involves forming a community of interests from all sectors---including industry, academia, and government---with the goal of developing consensus on definitions, taxonomies, secure reference architectures, security and privacy, and, from these, a standards roadmap. Such a consensus would create a vendor-neutral, technology- and infrastructure-independent framework that would enable Big Data stakeholders to identify and use the best analytics tools for their processing and visualization requirements on the most suitable computing platform and cluster, while also allowing added value from Big Data service providers.
The NIST Big Data Interoperability Framework (NBDIF) was released in three versions, which correspond to the three stages of the NBD-PWG work. Version 3 (current version) of the NBDIF volumes resulted from Stage 3 work with major emphasis on the validation of the NBDRA Interfaces and content enhancement. Stage 3 work built upon the foundation created during Stage 2 and Stage 1. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data. The three stages (in reverse order) aim to achieve the following with respect to the NIST Big Data Reference Architecture (NBDRA).
Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces; Stage 2: Define general interfaces between the NBDRA components; and Stage 1: Identify the high-level Big Data reference architecture key components, which are technology-, infrastructure-, and vendor-agnostic.
The NBDIF consists of nine volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The nine volumes are as follows:
- Volume 1, Definitions [@www-vol1-v3]
- Volume 2, Taxonomies [@www-vol2-v3]
- Volume 3, Use Cases and General Requirements [@www-vol3-v3]
- Volume 4, Security and Privacy [@www-vol4-v3]
- Volume 5, Architectures White Paper Survey [@www-vol5-v3]
- Volume 6, Reference Architecture [@www-vol6-v3]
- Volume 7, Standards Roadmap [@www-vol7-v3]
- Volume 8, Reference Architecture Interfaces (this volume) [@www-vol8-v3]
- Volume 9, Adoption and Modernization [@www-vol9-v3]
During Stage 1, Volumes 1 through 7 were conceptualized, organized and written. The finalized Version 1 documents can be downloaded from the V1.0 Final Version page of the NBD-PWG website [@www-nist-bdra-v1-wegpage].
During Stage 2, the NBD-PWG developed Version 2 of the NBDIF Version 1 volumes, with the exception of Volume 5, which contained the completed architecture survey work that was used to inform Stage 1 work of the NBD-PWG. The goals of Version 2 were to enhance the Version 1 content, define general interfaces between the NBDRA components by aggregating low-level interactions into high-level general interfaces, and demonstrate how the NBDRA can be used. As a result of the Stage 2 work, the need for NBDIF Volume 8 and NBDIF Volume 9 were identified and the two new volumes were created. Version 2 of the NBDIF volumes, resulting from Stage 2 work, can be downloaded from the V2.0 Final Version page of the NBD-PWG website [@www-nist-bdra-v2-wegpage].
Reference architectures provide "an authoritative source of information about a specific subject area that guides and constrains the instantiations of multiple architectures and solutions" [@www-dodaf-arch]. Reference architectures generally serve as a foundation for solution architectures and may also be used for comparison and alignment of instantiations of architectures and solutions.
The goal of the NBD-PWG Reference Architecture Subgroup is to develop an open reference architecture for Big Data that achieves the following objectives:
- Provides a common language for the various stakeholders;
- Encourages adherence to common standards, specifications, and patterns;
- Provides consistent methods for implementation of technology to solve similar problem sets;
- Illustrates and improves understanding of the various Big Data components, processes, and systems, in the context of a vendor- and technology-agnostic Big Data conceptual model;
- Provides a technical reference for U.S. government departments, agencies, and other consumers to understand, discuss, categorize, and compare Big Data solutions; and
- Facilitates analysis of candidate standards for interoperability, portability, reusability, and extendibility.
The NBDRA is a high-level conceptual model crafted to serve as a tool to facilitate open discussion of the requirements, design structures, and operations inherent in Big Data. The NBDRA is intended to facilitate the understanding of the operational intricacies in Big Data. It does not represent the system architecture of a specific Big Data system, but rather is a tool for describing, discussing, and developing system-specific architectures using a common framework of reference. The model is not tied to any specific vendor products, services, or reference implementation, nor does it define prescriptive solutions that inhibit innovation.
The NBDRA does not address the following:
- Detailed specifications for any organization's operational systems;
- Detailed specifications of information exchanges or services; and
- Recommendations or standards for integration of infrastructure products.
The goals of the Subgroup were realized throughout the three planned phases of the NBD-PWG work, as outlined in @sec:production.
The NBDIF: Volume 8, References Architecture Interfaces is one of nine volumes, whose overall aims are to define and prioritize Big Data requirements, including interoperability, portability, reusability, extensibility, data usage, analytic techniques, and technology infrastructure to support secure and effective adoption of Big Data. The overall goals of this volume are to define and specify interfaces to implement the Big Data Reference Architecture. This volume arose from discussions during the weekly NBD-PWG conference calls. Topics included in this volume began to take form in Phase 2 of the NBD-PWG work. During the discussions, the NBD-PWG identified the need to specify a variety of interfaces.
Phase 3 work, which built upon the groundwork developed during Phase 2, included an early specification based on resource object specifications that provided a simplified version of an API interface design.
To enable interoperability between the NBDRA components, a list of well-defined NBDRA interfaces is needed. These interfaces are documented in this volume. To introduce them, the NBDRA structure will be followed, focusing on interfaces that allow bootstrapping of the NBDRA. The document begins with a summary of requirements that will be integrated into our specifications. Subsequently, each section will introduce a number of objects that build the core of the interface addressing a specific aspect of the NBDRA. A selected number of interface use cases will be showcased to outline how the specific interface can be used in a reference implementation of the NBDRA. Validation of this approach can be achieved while applying it to the application use cases that have been gathered in the NBDIF: Volume 3, Use Cases and Requirements document. These application use cases have considerably contributed towards the design of the NBDRA. Hence the expectation is that: (1) the interfaces can be used to help implement a Big Data architecture for a specific use case; and (2) the proper implementation. This approach can facilitate subsequent analysis and comparison of the use cases.
The organization of this document roughly corresponds to the process used by the NBD-PWG to develop the interfaces. Following the introductory material presented in @sec:introduction, the remainder of this document is organized as follows:
- @sec:interface-requirements presents the interface requirements;
- @sec:spec-paradigm presents the specification paradigm that is used;
- @sec:specification presents several objects grouped by functional use while providing a summary table of selected proposed objects in @sec:spec-table.
While each NBDIF volume was created with a specific focus within Big Data, all volumes are interconnected. During creation, the volumes gave and/or received input from other volumes. Broad topics (e.g., definition, architecture) may be discussed in several volumes with the discussion circumscribed by the volume’s particular focus. Arrows shown in @fig:nist-doc-nav indicate the main flow of input/output. Volumes 2, 3, and 5 (blue circles) are essentially standalone documents that provide output to other volumes (e.g., to Volume 6). These volumes contain the initial situational awareness research. Volumes 4, 7, 8, and 9 (green circles) primarily received input from other volumes during the creation of the particular volume. Volumes 1 and 6 (red circles) were developed using the initial situational awareness research and continued to be modified based on work in other volumes. These volumes also provided input to the green circle volumes.
The development of a Big Data reference architecture requires a thorough understanding of current techniques, issues, and concerns. To this end, the NBD-PWG collected use cases to gain an understanding of current applications of Big Data, conducted a survey of reference architectures to understand commonalities within Big Data architectures in use, developed a taxonomy to understand and organize the information collected, and reviewed existing technologies and trends relevant to Big Data. The results of these NBD-PWG activities were used in the development of the NBDRA (@fig:arch) and the interfaces presented herein. Detailed descriptions of these activities can be found in the other volumes of the NBDIF.
This vendor-neutral, technology- and infrastructure-agnostic conceptual model, the NBDRA, is shown in @fig:arch and represents a Big Data system composed of five logical functional components connected by interoperability interfaces (i.e., services). Two fabrics envelop the components, representing the interwoven nature of management and security and privacy with all five of the components. These two fabrics provide services and functionality to the five main roles in the areas specific to Big Data and are crucial to any Big Data solution. Note: None of the terminology or diagrams in these documents is intended to be normative or to imply any business or deployment model. The terms provider and consumer as used are descriptive of general roles and are meant to be informative in nature.
The NBDRA is organized around five major roles and multiple sub-roles aligned along two axes representing the two Big Data value chains: the Information Value (horizontal axis) and the Information Technology (IT; vertical axis). Along the Information Value axis, the value is created by data collection, integration, analysis, and applying the results following the value chain. Along the IT axis, the value is created by providing networking, infrastructure, platforms, application tools, and other IT services for hosting of and operating the Big Data in support of required data applications. At the intersection of both axes is the Big Data Application Provider role, indicating that data analytics and its implementation provide the value to Big Data stakeholders in both value chains. The term provider as part of the Big Data Application Provider and Big Data Framework Provider is there to indicate that those roles provide or implement specific activities and functions within the system. It does not designate a service model or business entity.
The DATA arrows in @fig:arch show the flow of data between the system's main roles. Data flows between the roles either physically (i.e., by value) or by providing its location and the means to access it (i.e., by reference). The SW arrows show transfer of software tools for processing of Big Data in situ. The Service Use arrows represent software programmable interfaces. While the main focus of the NBDRA is to represent the run-time environment, all three types of communications or transactions can happen in the configuration phase as well. Manual agreements (e.g., service-level agreements) and human interactions that may exist throughout the system are not shown in the NBDRA.
Detailed information on the NBDRA conceptual model is presented in the NBDIF: Volume 6, Reference Architecture document.
Prior to outlining the specific interfaces, general requirements are introduced and the interfaces are defined.
This section focuses on the high-level requirements of the interface approach that are needed to implement the reference architecture depicted in @fig:arch.
Due to the many different tools, services, and infrastructures available in the general area of Big Data, an interface ought to be as vendor-independent as possible, while, at the same time, be able to leverage best practices. Hence, a methodology is needed that allows extension of interfaces to adapt and leverage existing approaches, but also allows the interfaces to provide merit in easy specifications that assist the formulation and definition of the NBDRA.
As Big Data is not just about hosting data, but about analyzing data, the interfaces provided herein must encapsulate a rich infrastructure environment that is used by data scientists. This includes the ability to integrate (or plug-in) various compute resources and services to provide the necessary compute power to analyze the data. These resources and services include the following:
- Access to hierarchy of compute resources from the laptop/desktop, servers, data clusters, and clouds;
- The ability to integrate special-purpose hardware such as graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) that are used in accelerated analysis of data; and
- The integration of services including microservices that allow the analysis of the data by delegating them to hosted or dynamically deployed services on the infrastructure of choice.
From review of the use case collection, presented in the NBDIF: Volume 3, Use Cases and General Requirements document [@www-vol3-v3], the need arose to address the mechanism of preparing suitable infrastructures for various use cases. As not every infrastructure is suited for every use case, a custom infrastructure may be needed. As such, this document is not attempting to deliver a single deployed NBDRA, but allow the setup of an infrastructure that satisfies the particular use case. To achieve this task, it is necessary to provision software stacks and services while orchestrating their deployment and leveraging infrastructures. It is not the focus of this document to replace existing orchestration software and services, but provide an interface to them to leverage them as part of defining and creating the infrastructure. Various orchestration frameworks and services could therefore be leveraged, even as part of the same framework, and work in orchestrated fashion to achieve the goal of preparing an infrastructure suitable for one or more applications.
The creation of the infrastructure suitable for Big Data applications provides the basic computing environment. However, Big Data applications may require the creation of sophisticated applications as part of interactive experiments to analyze and probe the data. For this purpose, the applications must be able to orchestrate and interact with experiments conducted on the data while assuring reproducibility and correctness of the data. For this purpose, a System Orchestrator (either the data scientists or a service acting on behalf of the data scientist) is used as the command center to interact on behalf of the Big Data Application Provider to orchestrate dataflow from Data Provider, carry out the Big Data application life cycle with the help of the Big Data Framework Provider, and enable the Data Consumer to consume Big Data processing results. An interface is needed to describe these interactions and to allow leveraging of experiment management frameworks in scripted fashion. A customization of parameters is needed on several levels. On the highest level, application-motivated parameters are needed to drive the orchestration of the experiment. On lower levels, these high-level parameters may drive and create service-level agreements, augmented specifications, and parameters that could even lead to the orchestration of infrastructure and services to satisfy experiment needs.
The interfaces provided must encourage reusability of the infrastructure, services, and experiments described by them. This includes (1) reusability of available analytics packages and services for adoption; (2) deployment of customizable analytics tools and services; and (3) operational adjustments that allow the services and infrastructure to be adapted while at the same time allowing for reproducible experiment execution.
One of the important aspects of distributed Big Data services can be that the data served is simply too big to be moved to a different location. Instead, an interface could allow the description and packaging of analytics algorithms, and potentially also tools, as a payload to a data service. This can be best achieved, not by sending the detailed execution, but by sending an interface description that describes how such an algorithm or tool can be created on the server and be executed under security considerations (integrated with authentication and authorization in mind).
Although the focus of this document is not security and privacy, which are documented in the NBDIF: Volume 4, Security and Privacy [@www-vol4-v3], the interfaces defined herein must be capable of integration into a secure reference architecture that supports secure execution, secure data transfer, and privacy. Consequently, the interfaces defined herein can be augmented with frameworks and solutions that provide such mechanisms. Thus, diverse requirement needs stemming from different use cases addressing security need to be distinguished. To contrast that the security requirements between applications can vary drastically, the following example is provided. Although many of the interfaces and their objects to support Big Data applications in physics are similar to those in healthcare, they differ in the integration of security interfaces and policies. While in physics the protection of data is less of an issue, it is a stringent requirement in healthcare. Thus, deriving architectural frameworks for both may use largely similar components, but addressing security issues will be very different. The security of interfaces may be addressed in other documents. In this document, they are considered an advanced use case showcasing that the validity of the specifications introduced here is preserved, even if security and privacy requirements differ vastly among application use cases.
This section summarizes the requirements for the interfaces of the NBDRA components. The five components are listed in @fig:arch and addressed in @sec:system-orchestrator-requirements (System Orchestrator Interface Requirements) and @sec:data-application-requirements (Big Data Application Provider to Big Data Framework Provider Interface) of this document. The five main functional components of the NBDRA represent the different technical roles within a Big Data system and are the following:
- System Orchestrator: Defines and integrates the required data application activities into an operational vertical system (see @sec:system-orchestrator-requirements);
- Data Provider: Introduces new data or information feeds into the Big Data system (see @sec:data-provider-requirements);
- Data Consumer: Includes end users or other systems that use the results of the Big Data Application Provider (see @sec:data-consumer-requirements);
- Big Data Application Provider: Executes a data life cycle to meet security and privacy requirements as well as System Orchestrator-defined requirements (see @sec:data-application-requirements);
- Big Data Framework Provider: Establishes a computing framework in which to execute certain transformation applications while protecting the privacy and integrity of data (see @sec:provider-requirements); and
- Big Data Application Provider to Framework Provider Interface: Defines an interface between the application specification and the provider (see @sec:app-provider-requirements).
The System Orchestrator role includes defining and integrating the required data application activities into an operational vertical system. Typically, the System Orchestrator involves a collection of more specific roles, performed by one or more actors, which manage and orchestrate the operation of the Big Data system. These actors may be human components, software components, or some combination of the two. The function of the System Orchestrator is to configure and manage the other components of the Big Data architecture to implement one or more workloads that the architecture is designed to execute. The workloads managed by the System Orchestrator may be assigning/provisioning framework components to individual physical or virtual nodes at the lower level, or providing a graphical user interface that supports the specification of workflows linking together multiple applications and components at the higher level. The System Orchestrator may also, through the Management Fabric, monitor the workloads and system to confirm that specific quality of service requirements is met for each workload, and may elastically assign and provision additional physical or virtual resources to meet workload requirements resulting from changes/surges in the data or number of users/transactions. The interface to the System Orchestrator must be capable of specifying the task of orchestration the deployment, configuration, and the execution of applications within the NBDRA. A simple vendor-neutral specification to coordinate the various parts either as simple parallel language tasks or as a workflow specification is needed to facilitate the overall coordination. Integration of existing tools and services into the System Orchestrator as extensible interfaces is desirable.
The Data Provider role introduces new data or information feeds into the Big Data system for discovery, access, and transformation by the Big Data system. New data feeds are distinct from the data already in use by the system and residing in the various system repositories. Similar technologies can be used to access both new data feeds and existing data. The Data Provider actors can be anything from a sensor, to a human inputting data manually, to another Big Data system. Interfaces for data providers must be able to specify a data provider so it can be located by a data consumer. It also must include enough details to identify the services offered so they can be pragmatically reused by consumers. Interfaces to describe pipes and filters must be addressed.
Like the Data Provider, the role of Data Consumer within the NBDRA can be an actual end user or another system. In many ways, this role is the mirror image of the Data Provider, with the entire Big Data framework appearing like a Data Provider to the Data Consumer. The activities associated with the Data Consumer role include the following:
- Search and Retrieve,
- Download,
- Analyze Locally,
- Reporting,
- Visualization, and
- Data to Use for Their Own Processes.
The interface for the data consumer must be able to describe the consuming services and how they retrieve information or leverage data consumers.
The Big Data Application Provider role executes a specific set of operations along the data life cycle to meet the requirements established by the System Orchestrator, as well as meeting security and privacy requirements. The Big Data Application Provider is the architecture component that encapsulates the business logic and functionality to be executed by the architecture. The interfaces to describe Big Data applications include interfaces for the various subcomponents including collections, preparation/curation, analytics, visualization, and access. Some of the interfaces used in these subcomponents can be reused from other interfaces, which are introduced in other sections of this document. Where appropriate, application-specific interfaces will be identified and examples provided with a focus on use cases as identified in the NBDIF: Volume 3, Use Cases and General Requirements.
In general, the collection activity of the Big Data Application Provider handles the interface with the Data Provider. This may be a general service, such as a file server or web server configured by the System Orchestrator to accept or perform specific collections of data, or it may be an application-specific service designed to pull data or receive pushes of data from the Data Provider. Since this activity is receiving data at a minimum, it must store/buffer the received data until it is persisted through the Big Data Framework Provider. This persistence need not be to physical media but may simply be to an in-memory queue or other service provided by the processing frameworks of the Big Data Framework Provider. The collection activity is likely where the extraction portion of the Extract, Transform, Load (ETL)/Extract, Load, Transform (ELT) cycle is performed. At the initial collection stage, sets of data (e.g., data records) of similar structure are collected (and combined), resulting in uniform security, policy, and other considerations. Initial metadata is created (e.g., subjects with keys are identified) to facilitate subsequent aggregation or look-up methods.
The preparation activity is where the transformation portion of the ETL/ELT cycle is likely performed, although analytics activity will also likely perform advanced parts of the transformation. Tasks performed by this activity could include data validation (e.g., checksums/hashes, format checks), cleaning (e.g., eliminating bad records/fields), outlier removal, standardization, reformatting, or encapsulating. This activity is also where source data will frequently be persisted to archive storage in the Big Data Framework Provider and provenance data will be verified or attached/associated. Verification or attachment may include optimization of data through manipulations (e.g., deduplication) and indexing to optimize the analytics process. This activity may also aggregate data from different Data Providers, leveraging metadata keys to create an expanded and enhanced data set.
The analytics activity of the Big Data Application Provider includes the encoding of the low-level business logic of the Big Data system (with higher-level business process logic being encoded by the System Orchestrator). The activity implements the techniques to extract knowledge from the data based on the requirements of the vertical application. The requirements specify the data processing algorithms to produce new insights that will address the technical goal. The analytics activity will leverage the processing frameworks to implement the associated logic. This typically involves the activity providing software that implements the analytic logic to the batch and/or streaming elements of the processing framework for execution. The messaging/communication framework of the Big Data Framework Provider may be used to pass data or control functions to the application logic running in the processing frameworks. The analytic logic may be broken up into multiple modules to be executed by the processing frameworks which communicate, through the messaging/communication framework, with each other and other functions instantiated by the Big Data Application Provider.
The visualization activity of the Big Data Application Provider prepares elements of the processed data and the output of the analytic activity for presentation to the Data Consumer. The objective of this activity is to format and present data in such a way as to optimally communicate meaning and knowledge. The visualization preparation may involve producing a text-based report or rendering the analytic results as some form of graphic. The resulting output may be a static visualization and may simply be stored through the Big Data Framework Provider for later access. However, the visualization activity frequently interacts with the access activity, the analytics activity, and the Big Data Framework Provider (processing and platform) to provide interactive visualization of the data to the Data Consumer based on parameters provided to the access activity by the Data Consumer. The visualization activity may be completely application-implemented, leverage one or more application libraries, or may use specialized visualization processing frameworks within the Big Data Framework Provider.
The access activity within the Big Data Application Provider is focused on the communication/interaction with the Data Consumer. Like the collection activity, the access activity may be a generic service such as a web server or application server that is configured by the System Orchestrator to handle specific requests from the Data Consumer. This activity would interface with the visualization and analytic activities to respond to requests from the Data Consumer (who may be a person) and uses the processing and platform frameworks to retrieve data to respond to Data Consumer requests. In addition, the access activity confirms that descriptive and administrative metadata and metadata schemes are captured and maintained for access by the Data Consumer and as data is transferred to the Data Consumer. The interface with the Data Consumer may be synchronous or asynchronous in nature and may use a pull or push paradigm for data transfer.
Data for Big Data applications are delivered through data providers. They can be either local providers, data contributed by a user, or distributed data providers, data on the Internet. This interface must be able to provide the following functionality:
- Interfaces to files,
- Interfaces to virtual data directories,
- Interfaces to data streams, and
- Interfaces to data filters.
This Big Data Framework Provider element provides all the resources necessary to host/run the activities of the other components of the Big Data system. Typically, these resources consist of some combination of physical resources, which may host/support similar virtual resources. The NBDRA needs interfaces that can be used to deal with the underlying infrastructure to address networking, computing, and storage.
As part of the NBDRA platforms, interfaces are needed that can address platform needs and services for data organization, data distribution, indexed storage, and file systems.
The processing frameworks for Big Data provide the necessary infrastructure software to support implementation of applications that can deal with the volume, velocity, variety, and variability of data. Processing frameworks define how the computation and processing of the data is organized. Big Data applications rely on various platforms and technologies to meet the challenges of scalable data analytics and operation. A requirement is the ability to interface easily with computing services that offer specific analytics services, batch processing capabilities, interactive analysis, and data streaming.
Several crosscutting interface requirements within the Big Data Framework Provider include messaging, communication, and resource management. Often these services may be hidden from explicit interface use as they are part of larger systems that expose higher-level functionality through their interfaces. However, such interfaces may also be exposed on a lower level in case finer-grained control is needed. The need for such crosscutting interface requirements will be extracted from the NBDIF: Volume 3, Use Cases and General Requirements document.
Messaging and communications frameworks have their roots in the High Performance Computing environments long popular in the scientific and research communities. Messaging/Communications Frameworks were developed to provide application programming interfaces (APIs) for the reliable queuing, transmission, and receipt of data.
As Big Data systems have evolved and become more complex, and as businesses work to leverage limited computation and storage resources to address a broader range of applications and business challenges, the requirement to effectively manage those resources has grown significantly. While tools for resource management and elastic computing have expanded and matured in response to the needs of cloud providers and virtualization technologies, Big Data introduces unique requirements for these tools. However, Big Data frameworks tend to fall more into a distributed computing paradigm, which presents additional challenges.
Big Data Application Provider to Big Data Framework Provider Interface {#sec:app-provider-requirements}
The Big Data Framework Provider typically consists of one or more hierarchically organized instances of the components in the NBDRA IT value chain (@fig:arch). There is no requirement that all instances at a given level in the hierarchy be of the same technology. In fact, most Big Data implementations are hybrids that combine multiple technology approaches to provide flexibility or meet the complete range of requirements, which are driven from the Big Data Application Provider.
This section summarizes the elementary specification paradigm.
To avoid vendor lock-in, Big Data systems must be able to deal with hybrid and multiple frameworks. This is not only true for Clouds, containers, DevOps, but also for components of the NBDRA.
A resource-oriented architecture represents a software architecture and programming paradigm for designing and developing software in the form of resources. It is often associated with REpresentational State Transfer (REST) interfaces. The resources are software components which can be reused in concrete reference implementations. The service specification is conducted with OpenAPI, allowing use to provide it in a very general form that is independent of the framework or computer language in which the services can be specified. Note that OpenAPI defines services in REST The previous version only specified the resource objects.
To accelerate discussion among the NBD-PWG members, contributors to this document are encouraged to also provide the NBD-PWG with examples.
Previous work that shaped the current version of this volumes and are documented In GitHub [@nist-github-bdra-issues] with prior versions of Volume 8 [@nist-vol8-hist][@nist-vol8-draft] and Cloudmesh [@nist-cloudmesh] in support of the NIST Big Data Architecture Framework [@www-bdra-working-group].
During the design phase and development period of each version of this document, enhancements are managed through GitHub and community contributions are managed via GitHub issues. This allows preservation of the history of this document. When a new version is ready, the version will be tagged in GitHub. Older versions will, through this process, also be available as historical documents. Discussions about objects in written form are communicated as GitHub issues.
Due to the easy extensibility of the resource objects specified in this document and their interfaces, it is important to introduce a terminology that allows the definition of interface compliancy. We define three levels of interface compliance as follows:
-
Full Compliance: These are reference implementations that provide full compliance to the objects defined in this document. A version number is added to assure that the snapshot in time of the objects is associated with the version. A full complient framework implements all objects.
-
Partial Compliance: These are reference implementations that provide partial compliance to the objects defined in this document. A version number will is added to assure that the snapshot in time of the objects is associated with the version. This reference implementation implements a partial list of the objects and interfaces. A document is to be added that specifies the differences to a full complient implementation.
-
Extended Compliance: In addition to full and partial compliance additional resources can be identified while documenting additional resource objects and interfaces that are not included in the current specification. The extended complience document can lead to additional improvements of the current specification.
Documents generated during a reference implementation can be forwarded to the Reference Architecture Subgroup for further discussion and for possible future modifications based on additional practical user feedback.
The specifications in this section are provided through an automated document creation process using the actual OpenAPI specifications yaml files as the source. All OpenAPI specifications located in the cloudmesh/cloudmesh-nist/spec/ directory in GitHub [@nist-github-bdra-spec].
Limitations of the current implementation are as follows. It is a demonstration that showcases the generation of a fully functioning REST service based on the specifications provided in this document. However, it is expected that scalability, distribution of services, and other advanced options need to be addressed based on application requirements.
The following table lists the current set of resource objects that are defined in this draft. Additional objects are also available in GitHub [@nist-github-bdra-spec].
@tbl:spec-table shows the list of currently included specification in this version of the document.
{include=./specstable.md}
@fig:bdra-provider-view shows the provider view of the specification resources.
@fig:spec shows the resources view of the specification resources.
Mechanisms are not included in this specification to manage authentication to external services. However, the working group has shown multiple solutions to this as part of cloudmesh. This includes the posibility of a
- Local configuration file: A configuration file is managed locally to allow access to the clouds. It is the designer's responsibility not to expose such credentials.
- Session based authentication: No passwords are stored in the configuration file and access is granted on a per session basis where the password needs to be entered.
- Service based authentication: The authentication is delegated to an external process. The service that acts on behalf of the user needs to have access to the appropriate cloud provider credentials
An example for a configuration file is provided at [@cloudmesh4-yaml].
In case of an error or a successful response, the response header contains a HTTP code (see https://tools.ietf.org/html/rfc7231). The response body usually contains the following:
-
The HTTP response code;
-
An accompanying message for the HTTP response code; and
-
A field or object where the error occurred.
Table 1: HTTP Response Codes
HTTP Response Operation Description
200 Ok GET, PUT, DELETE No error, operation successful. 201 Created POST Successful creation of a resource. 204 No Content GET, PUT, DELETE Successful but no content. 400 Bad Request GET, POST, PUT, DELETE The request could not be understood. 401 Unauthorized GET, POST, PUT, DELETE User must authorize. 403 Forbidden GET, POST, PUT, DELETE The request has been refused due to authorization failure. 404 Not Found GET, POST, PUT, DELETE The requested resource could not be found. 405 Not Allowed GET, POST, PUT, DELETE The method is not allowed on the resource. 500 Server Error GET, POST PUT Internal Server error.
In the specification such responses are indicated and if an simple response is returned the term Message is used.
Resources
Timestamps can be used in conjunction with andy server side implementation of the interfaces. It can be useful to return information about when a particular resource has been created, updated, or accessed. To simplify the specification in the document we have not explicitly listed that a timestamp is part of the reource, but we can assume it may be added as part of the service implementation. To obtain an example timestamp a simple get function is provided.
{include=./spec/timestamp.md}
As part of services an identity often needs to be specified. In addition, such persons [@www-eduperson] are often part of groups. Thus, three important terms related to the identity are distinguished as follows:
- Organization: The information representing an Organization that manages a Big Data Service (@sec:spec-organization)
- Group: A group that a person may belong to that is important to define access to services (included in @sec:spec-organization)
- User: The information identifying the profile of a person (@sec:spec-user)
{include=./spec/organization.md}
{include=./spec/user.md}
{include=./spec/account.md}
### Public Key Store {#sec:spec-publickeystore}
{include=./spec/publickeystore.md}
#### publickeystore.yaml
```{include=./spec/publickeystore.yaml}
{include=./spec/alias.md}
{include=./spec/variables.md}
{include=./spec/default.md}
{include=./spec/filestore.md}
{include=./spec/replica.md}
{include=./spec/database.md}
{include=./spec/virtualdirectory.md}
{include=./spec/virtualcluster.md}
{include=./spec/non.md}
{include=./spec/scheduler.md}
{include=./spec/queue.md}
This section summarizes a basic interface specification of virtual machines.
{include=./spec/image.md}
{include=./spec/flavor.md}
{include=./spec/vm.md}
{include=./spec/secgroup.md}
{include=./spec/nic.md}
{include=./spec/containers.md}
{include=./spec/mapreduce.md}
{include=./spec/microservice.md}
{include=./spec/reservation.md}
{include=./spec/stream.md}
{include=./spec/filter.md}
{include=./spec/deployment.md}