-
Notifications
You must be signed in to change notification settings - Fork 6
how to contribute
CAIDA's catalog has the following main objects: Datasets, Software, Papers, Media, Recipes, and Group. It also has supporting objects: licenses, authors, and venues.
- Dataset represent static collection of information often in the form of flat files.
- Software provides a way to access a dataset (API), process dataset, or provides a human interface to data (UI).
- Papers includes both published and unpublished papers.
- Media includes presentations, image, videos, or interactive visualizations.
- Recipes is a write up of some useful bit of information and often include bits of code.
- Group is a higher level grouping of other objects.
Any object can link to any other object. All objects can be tagged.
Each object has a type
and an id
(we call this "short id" in this document). These can be combined to make a "full id" of the format "type:id". Each full id must be unique. type
+ :
+ id
. The short id is only used in the object's source JSON. All other instances in the catalog will use the full id.
e.g. for CAIDA AS Rank, type is "software", short id is "asrank". Catalog will refer to this by its full id, "software:asrank"
A valid short id includes only lower case characters (a-z), numbers (0-9), or underscore (_).
# create full_id, replacing illegal non-alphanumeric characters with an underscore
re_id_illegal = re.compile("[^a-z^\d^A-Z]+")
def id_create(type_,name):
name = re_id_illegal.sub("_",name)
name = re.sub("_+$","",re.sub("^_+","",name))
id_ = type_+":"+name
return id_.lower()
Except for Recipes (and CAIDA papers/presentations, see below), all objects exist as a single JSON file in the sources/(type)
directory (e.g. sources/software
). The type and file name need to match the type and short_id respectively.
Recipes are stored as a directory with the same name as the short_id with the content in the Readme.md as markdown in that directory.
More details are in how to contribute a recipe.
For now, CAIDA Papers and parts of Media (i.e., CAIDA Presentations) are stored and maintained in the CAIDA Publications Database (PubDB), so their JSON files are created programmatically from pubDB, exported to JSONs and held in catalog-data in/data/pubdb*.json
So no JSON file should be generated for them at this time.
If you find a CAIDA paper/presentation not in PubDB, please email [email protected], they should add the paper/presentation through PubDB and generate the JSON. Make sure to link your objects against them using full ids you can derive from the URLs's year and directory:
-
https://www.caida.org/publications/papers/2020/policy_challenges_mapping_internet/
=>
paper:2020_policy_challenges_mapping_internet
-
https://www.caida.org/publications/presentations/2020/kismet_nsfexpo/
=>
media:2020_kismet_nsfexpo
-
https://www.caida.org/publications/presentations/2019/ioda-np_paridine_final/
=>
media:2019_ioda_np_paridine_final
All links are bidirectional regardless of the object where it is specified. A link in B to C, will also create a link from C to B.
Version 1 of the UI does not support link labels, but it would be nice to include them now if possible for future use. There are three ways to specify a link label:
label
to_label
from_label
The label
is assumed to be bidirectional, while the from_label
and to_label
allow a different label to be used depending on the direction the link is read.
{
"from":"dataset:as2org",
"to":"dataset:asrank",
"from_label":"used to create",
"to_label":"created from"
}
For a refresher in git branching, read Branching in a Nutshell
- select a name of your update. it could be for a single file or for a group of files
- create a branch (internal user) / fork (external user) for your update
- create an issue for your update
- update/ create the files in place as necessary.
- check that your files compile
- run
python3 scripts/data-build.py
in the root of this repo - it should end by "writing" the compiled json files.
- when you are finished, post a request to your issue for review
- once it is reviewed, make a pull request against master (request to merge your branch/fork into master)
- someone will review the pull request
- Use the makefile to check for error messages. These will be missing ids or bad JSON.
# In this example two recipes have missing ids for dataset:ipv6_dnsnames_dataset and dataset:as_classification.
> make
(lots of output that you can ignore)
error sources/recipe/how_to_annotate_an_ark_traceroute_with_hostnames/Readme.md missing id dataset:ipv6_dnsnames_dataset
error sources/solution/how_to_parse_as_classification/README.md missing id dataset:as_classification
- resolve conflicts from the merge with master , merge, remove branch, create pull request against v1
- some will review the pull request
- merge with v1
Below are examples of most of the objects, using the AS Rank Group.