-
Notifications
You must be signed in to change notification settings - Fork 41
Quick Start to Registering Data
Using herd is all about registering and processing data. This tutorial outlines the steps required to start registering data and provides background and examples. Let's start with terminology - details are available in our terminology page but for the purposes of this tutorial we should establish:
- Storage Platform - a platform that has the ability to store data, e.g. S3
- Storage - a named instance and location of some Storage Platform, e.g. 'S3_Storage'
- Namespace - a designation for organization and ownership of objects within the catalog, e.g. 'Application_A'
- Business Object Definition (BDef) - the name and associated meta-data definition of an object, e.g. 'Orders'. We will ultimately register data against this object definition.
- Business Object Format - information required to understand the Business Object Data; includes file type, schema, etc.
- Business Object Data (BData) - a representation of actual data files associated with an object. Includes a format definition and contains references to files in some Storage.
The steps in this tutorial include setup prerequisites, one-time activities such as defining formats and object definitions, and finally registration of actual data objects.
This tutorial includes examples of using REST calls. These examples use cURL, the freely-available command line tool for making HTTP requests. You can also use browser-based tools such as PostMan or Advanced Rest Client
All cURL examples have a placeholder your_hostname which you should replace with the appropriate value for your environment. Additionally, if you do not have xmllint in your environment, you can omit the | xmllint --format -
at the end of each command. Everything will run fine but the response output will not be formatted.
If there is a body to the REST call, the cURL command will reference a file which appears above the cURL command in the example. All examples in this tutorial are in XML with the typical preamble () omitted as it's more human-readable. All REST interfaces are accessible through JSON request as well, as documented on our [API Usage] page.
The following must be complete prior to using this tutorial:
- Demo Install steps should be complete
Let's start by using the REST interfaces to create the highest-level organizational structure and register a Storage instance where we'll write files and register data. These are operations that are only required when you first instantiate distinct Storage instances or Namespaces.
First, let's create the named Storage instance that everyone will use to register data objects. You may decide to have multiple named Storages in your environment for various reasons such as storage-level permissions or encryption schemes. For now, let's add 'S3_Storage' and include a piece of metadata for its S3 bucket name. You should fill in the placeholder name your_bucket_name with the actual bucket name you are using.
Each Storage is associated with a Storage Platform. In this example we are using the 'S3' Storage Platform that was created during the Demo Install process.
Add Storage XML body 'storage.xml':
<storageCreateRequest>
<name>S3_Storage</name>
<storagePlatformName>S3</storagePlatformName>
<attributes>
<attribute>
<name>bucket.name</name>
<value>your_bucket_name</value>
</attribute>
</attributes>
</storageCreateRequest>
Add Storage cURL syntax:
cat storage.xml | curl -H "Content-Type: application/xml" -d @- -X POST http://your_hostname/herd-app/rest/storages| xmllint --format -
More details about this service are available in the Add Storage REST documentation.
Now it's time to add a Namespace, the top-level organizational structure of the data catalog. You may add as many Namespaces as you like for your desired organizational structure, but let's begin with one for 'Application_A'.
Add Namespace XML body 'namespace.xml':
<namespaceCreateRequest>
<namespaceCode>Application_A</namespaceCode>
</namespaceCreateRequest>
Add Namespace cURL syntax:
cat namespace.xml | curl -H "Content-Type: application/xml" -d @- -X POST http://your_hostname/herd-app/rest/namespaces| xmllint --format -
More details about this service are available in the Add Namespace REST documentation.
At this point the higher-level structures are in place and it's time to move on to the world of Business Objects.
Prior to registering BData instances we must register object definitions and formats. These operations are only required the first time you create a new object definition or any time you want to define new formats.
The BDef contains information about the object such as its name, description, Data Provider, and Namespace to which it belongs. For this example, let's use our previously defined entities to construct a BDef with the name 'Definition_B'. The 'EXCHANGE' Data Provider was created by the Demo Install process.
Create BDef XML body 'definition.xml':
<businessObjectDefinitionCreateRequest>
<namespace>Application_A</namespace>
<businessObjectDefinitionName>Definition_B</businessObjectDefinitionName>
<dataProviderName>EXCHANGE</dataProviderName>
<description>BRIEF DESCRIPTION OF THE OBJECT</description>
</businessObjectDefinitionCreateRequest>
Create BDef cURL syntax:
cat definition.xml | curl -H "Content-Type: application/xml" -d @- -X POST http://your_hostname/herd-app/rest/businessObjectDefinitions | xmllint --format -
More details about this service are available in the Create Business Object Definition REST documentation.
Now that the Business Object is defined, we must define at least one Format before registering data against the object.
The Business Object Format provides the all the details needed for a user or application to read and process a particular set of data files (e.g. file format and the internal schema). Each Format registered is a metadata object describing the specifics of the data layout and partitioning for a previously registered Business Object Definition.
Multiple Formats registered against the same BDef have different Usages to distinguish themselves (e.g. 'SRC' or 'INCOMING' for source data or 'PRC' for processed data). There is no pre-defined list of Usages, so the division of formats is entirely in your control. For this example, let's use the BDef we just created and make a Format with the Usage 'SRC', and partition key name 'MARKET_CODE'. Note that the Demo Install process created a few File Types including 'GZ' used in this example.
The result is version 0 of the Format. Revising this particular Format will create new Format instances with incremented version numbers.
Add Business Object Format XML body 'format.xml':
<businessObjectFormatCreateRequest>
<namespace>Application_A</namespace>
<businessObjectDefinitionName>Definition_B</businessObjectDefinitionName>
<businessObjectFormatUsage>SRC</businessObjectFormatUsage>
<businessObjectFormatFileType>GZ</businessObjectFormatFileType>
<partitionKey>MARKET_CODE</partitionKey>
</businessObjectFormatCreateRequest>
Add Business Object Format cURL syntax:
cat format.xml | curl -H "Content-Type: application/xml" -d @- -X POST http://your_hostname/herd-app/rest/businessObjectFormats | xmllint --format -
We omitted the optional 'schema' component for this example. More details about this service are available in the Add Business Object Format REST documentation.
Now that the Business Object Format is defined, we can proceed to register Business Object Data!
In most organizations, the previous steps occur infrequently compared to registering Business Object Data instances. Consider an example where you process data related to market exchanges. You create some BDefs and Formats before you begin any processing. On a regular basis you would register new Business Object Data instances belonging to a specified Format and partition them based on the key you named when you registered the Format. Our ongoing example uses 'MARKET_CODE' as the partition key, but it could be any quality that creates distinct partitions, such as the dates you received the data.
The BData represents actual data stored in a set of files. Each BData instance is identified by a primary partition (required) and sub-partition (optional) values and a previously registered Business Object Format that provides the all the details needed to know how to read and process a particular set of data files.
Our example below will combine all the elements we've defined so far, most notably in the naming scheme for the files. Let's register a BData instance with the primary partition value 'Exchange_C' and Format version 0. We will use the Storage 'S3_Storage' to define a Storage Unit that belongs to the BData instance. This example assumes you have already uploaded the files 'market_data_1.gz' and 'market_data_2.gz' to the correct file paths in your S3 bucket.
Our request creates version 0 of the BData instance. As with Formats, revising this BData will create new BData instances with incremented version numbers.
Add Business Object Data XML body 'data.xml':
<businessObjectDataCreateRequest>
<namespace>Application_A</namespace>
<businessObjectDefinitionName>Definition_B</businessObjectDefinitionName>
<businessObjectFormatUsage>SRC</businessObjectFormatUsage>
<businessObjectFormatFileType>GZ</businessObjectFormatFileType>
<businessObjectFormatVersion>0</businessObjectFormatVersion>
<partitionKey>MARKET_CODE</partitionKey>
<partitionValue>Exchange_C</partitionValue>
<storageUnits>
<storageUnit>
<storageName>S3_Storage</storageName>
<storageFiles>
<storageFile>
<filePath>application_a/exchange/definition_b/src/gz/schm-v0/data-v0/market_code=Exchange_C/market_data_1.gz</filePath>
<fileSizeBytes>9511</fileSizeBytes>
<rowCount>1000</rowCount>
</storageFile>
<storageFile>
<filePath>application_a/exchange/definition_b/src/gz/schm-v0/data-v0/market_code=Exchange_C/market_data_2.gz</filePath>
<fileSizeBytes>263</fileSizeBytes>
<rowCount>20</rowCount>
</storageFile>
</storageFiles>
</storageUnit>
</storageUnits>
</businessObjectDataCreateRequest>
Add BData cURL syntax:
cat data.xml | curl -H "Content-Type: application/xml" -d @- -X POST http://your_hostname/herd-app/rest/businessObjectData | xmllint --format -
More details about this service are available in the Register Business Object Data REST documentation
With the registration of object data, this tutorial is complete. Stay tuned for tutorials on future topics including data availability queries, DDL generation, cluster management, and job orchestration.
- Getting Started with herd
- herd Usage Pages
- herd API documentation
- herd Workflow Tasks
- herd Tools