-
Notifications
You must be signed in to change notification settings - Fork 6
Data Schema
Schema
- Provide a clean high-level representation of the data in each dataset
- Provide a method for searching datasets by the [data category](Data Category) of data they contain
- Provide a method to discover datasets that share a data category, and thus could support joining of them.
We will represent each Dataset as a Schema represented as a collection of relational Tables. This row, and each column in the table may have an optional set of References which indicate what the table or ids in column reference.
A Reference is made of a Category and optional Namespace. The Category identifies the type of category references (ASN, car, city, IP FLow, etc), while the Namespace classifies the namespace containing the reference's category identifiers (City Names, City Airport codes).
Consider two tables that both contain organization information: "ASes owned by an Organization" |org_id (Organization)| member (ASN)|
and "Profits in 2022" |stock symbol (Organization)| name | profits|
. The column names alone (org_id, stock symbol) do not reveal that the two tables describe the same type of entity (Organization). The shared Category notation enables the data curator to indicate that both org_id and stock symbol are organization identifiers. They could further annotate the namespace as caida.org_id and stock_market.stock_symobl, which means that if one had a mapping from org_id to stock symbol, one could join the two tables.
Optionally, one can add further granularity to the Category, using two strings separated by a dot (Category.Category_id). For example, the Category "city" has refinements "city.name" (San Diego, CA) and "city.airport_code" (SAN).
We store each DataTable as a YAML representation of a JSON Schema Object. Each YAML document "---" will represent a separate DataTable as a JSON Schema object.
We use JSONschema's concept of a UI Schema to express how the form should be rendered (while the JSON schema tells what). The UI will only take special notice of the keys defined below.
- title: name of the DataTable
- description: description of the DataTable
- primaryKeys: set of keys that identify the row
- category: category identified by the primaryKeys (requires primary keys) [Optional]
- properties: represented as table columns
-
Property: metadata for a property storied in the DataTable
- title: overwrites the property key in the UI
- descriptive: describes information in the property
- category: category of information in the property [Optional]
- propertyOrder: provides an ordering of the properties [Optional]
- primaryKeys: set of keys that identify the object in the property (if property is an object)
Properties that do not contain a propertyOrder will appear after those with propertyOrder, in alphabetic order.
---
title: AS Information
description: Autonomous System Information
category: asn
type: object
properties:
asn:
description: Autonomous system
type: integer
category: asn
examples: [23]
propertyOrder: 1
country:
description: ASN's announed prefixes geolocated country
type: string
category: country.alpha-3
examples: [USA]
propertyOrder: 2
-----------------------------------------
| AS Information (asn) |
| asn (asn) | country (country.alpha-3)|
| 23 | USA |
-----------------------------------------
If a property is itself an object, the table representation nests the object's properties.
# single column
ui:order
properties:
asn:
description: Autonomous system
type: integer
category: asn.number
examples: [23]
organization:
description: ASN company's organization
type: object
properties:
org_id:
description: org_id from CAIDA's org2as database
type: string
category: organization.id
examples: [level3]
propertyOrder: 1
country:
description: Organization's headquater location
type: string
category: country.name
examples: [Italy]
propertyOrder: 2
--------------------------
| asn | organizatoin |
--------------------------
| | org_id | country |
| 23 | level3 | Italy |
--------------------------
---
title: City Roads
type: object
primaryKeys: [from, to.name]
category: road
properties:
from:
description: city name
category: city
type: string
examples: [San Diego]
propertyOrder: 1
to:
category: city
type: object
propertyOrder: 2
properties:
name:
category: city.iata
description: nearest airport code
type: string
examples: [LAX]
propertyOrder: 1
size:
type: string
examples: [1.5M]
propertyOrder: 2
length:
type: number
examples: [500]
propertyOrder: 3
| City Roads (road) |
| | To ( city ) | length |
| *From* ( city ) | *name* (city.iata) | size | |
| San Diego | LAX | 1.5M | 500 |
* primary keys
(hide/show)
| name | description |
| From | The start of the road |
| To.name | The name of the to city |
| to.size | the total population of the to city |
| length | the road length |
|-----------------------------------------------------------------------------------|
| (Road) |
| | (city.region_name) |----------|
| from_city (city.iata) | to_city_name (city.name) | to_state (region.ansi) | distance |
|-------------------------------|--------------------------|------------------------|----------|
---
title: Roads Distances
type: object
properties:
road:
category: road
primaryKeys: [from, to_city_name, to_state]
propertyOrder: 1
properties:
from_city:
category: city.iata
examples: [SAN]
propertyOrder: 1
to_city:
category: city.region_name
type: object
primaryKeys: [to_city_name, to_state]
propertyOrder: 2
properties:
to_city_name:
category: city.name
description: nearest airport code
type: string
examples: [Los Angeles]
propertyOrder: 1
to_region:
category: region.ansi
type: string
examples: [CA]
propertyOrder: 2
distance:
type: number
examples: [500]
propertyOrder: 2