Skip to content

Data Schema

Bradley Huffaker edited this page Dec 22, 2022 · 3 revisions

Schema

Goals:

  • Provide a clean high-level representation of the data in each dataset
  • Provide a method for searching datasets by the [data category](Data Category) of data they contain
  • Provide a method to discover datasets that share a data category, and thus could support joining of them.

Example

We will represent each Dataset as a Schema represented as a collection of relational Tables. This row, and each column in the table may have an optional set of References which indicate what the table or ids in column reference.

Reference

A Reference is made of a Category and optional Namespace. The Category identifies the type of category references (ASN, car, city, IP FLow, etc), while the Namespace classifies the namespace containing the reference's category identifiers (City Names, City Airport codes).

Consider two tables that both contain organization information: "ASes owned by an Organization" |org_id (Organization)| member (ASN)| and "Profits in 2022" |stock symbol (Organization)| name | profits|. The column names alone (org_id, stock symbol) do not reveal that the two tables describe the same type of entity (Organization). The shared Category notation enables the data curator to indicate that both org_id and stock symbol are organization identifiers. They could further annotate the namespace as caida.org_id and stock_market.stock_symobl, which means that if one had a mapping from org_id to stock symbol, one could join the two tables.

Category_id (City.name)

Optionally, one can add further granularity to the Category, using two strings separated by a dot (Category.Category_id). For example, the Category "city" has refinements "city.name" (San Diego, CA) and "city.airport_code" (SAN).

Format (JSON Schema in YAML):

We store each DataTable as a YAML representation of a JSON Schema Object. Each YAML document "---" will represent a separate DataTable as a JSON Schema object.

We use JSONschema's concept of a UI Schema to express how the form should be rendered (while the JSON schema tells what). The UI will only take special notice of the keys defined below.

Features:

  • title: name of the DataTable
  • description: description of the DataTable
  • primaryKeys: set of keys that identify the row
  • category: category identified by the primaryKeys (requires primary keys) [Optional]
  • properties: represented as table columns
  • Property: metadata for a property storied in the DataTable
    • title: overwrites the property key in the UI
    • descriptive: describes information in the property
    • category: category of information in the property [Optional]
    • propertyOrder: provides an ordering of the properties [Optional]
    • primaryKeys: set of keys that identify the object in the property (if property is an object)

Properties that do not contain a propertyOrder will appear after those with propertyOrder, in alphabetic order.

---
title: AS Information
description: Autonomous System Information
category: asn
type: object
properties:
  asn:
    description: Autonomous system
    type: integer
    category: asn
    examples: [23]
    propertyOrder: 1
  country:
    description: ASN's announed prefixes geolocated country
    type: string
    category: country.alpha-3
    examples: [USA]
    propertyOrder: 2

-----------------------------------------
| AS Information (asn)                  |
| asn (asn) |  country (country.alpha-3)| 
| 23        |            USA            |
-----------------------------------------

Nested Properties

If a property is itself an object, the table representation nests the object's properties.

# single column
ui:order
properties:
  asn:
    description: Autonomous system
    type: integer
    category: asn.number
    examples: [23]
  organization:
    description: ASN company's organization
    type: object
    properties:
      org_id:
        description: org_id from CAIDA's org2as database
        type: string
        category: organization.id
        examples: [level3]
        propertyOrder: 1
      country:
        description: Organization's headquater location
        type: string
        category: country.name
        examples: [Italy]
        propertyOrder: 2

--------------------------
| asn |   organizatoin   | 
--------------------------
|     | org_id | country | 
| 23  | level3 |  Italy  | 
--------------------------

Example

YAML

---
title: City Roads
type: object
primaryKeys: [from, to.name] 
category: road
properties:
  from:
    description: city name
    category: city
    type: string
    examples: [San Diego]
    propertyOrder: 1
  to:
    category: city
    type: object
    propertyOrder: 2
    properties:
      name:
        category: city.iata
        description: nearest airport code
        type: string
        examples: [LAX]
        propertyOrder: 1
      size:
        type: string
        examples: [1.5M]
        propertyOrder: 2
  length:
    type: number
    examples: [500]
    propertyOrder: 3

|  City Roads  (road)                                      |
|                    |  To  ( city )              | length |
|  *From*  ( city )  | *name* (city.iata)  | size |        |
| San Diego          | LAX                 | 1.5M | 500    |
* primary keys 

(hide/show)
| name    | description                         |
| From    | The start of the road               |
| To.name | The name of the to city             |
| to.size | the total population of the to city |
| length  | the road length                     |

Two key example

|-----------------------------------------------------------------------------------|
|                             (Road)                                                |
|                               |       (city.region_name)                          |----------|
| from_city (city.iata)         | to_city_name (city.name) | to_state (region.ansi) | distance |
|-------------------------------|--------------------------|------------------------|----------|

---
title: Roads Distances
type: object
properties:
  road:
    category: road
    primaryKeys: [from, to_city_name, to_state]
    propertyOrder: 1
    properties:
      from_city:
        category: city.iata
        examples: [SAN]
        propertyOrder: 1
      to_city:
        category: city.region_name
        type: object
        primaryKeys: [to_city_name, to_state]
        propertyOrder: 2
        properties:
          to_city_name:
            category: city.name
            description: nearest airport code
            type: string
            examples: [Los Angeles]
            propertyOrder: 1
          to_region:
            category: region.ansi
            type: string
            examples: [CA]
            propertyOrder: 2
  distance:
    type: number
    examples: [500]
    propertyOrder: 2