The tag2domain project is a framework for creating mappings of tags (labels, annotations) to domain names. To get an overview over the concepts you can read this blogpost.
Tags are meant to be like little sticky notes that are attached to some entity, for example, a domain name. The tags that belong to a certain topic of interest are grouped together into taxonomies. Additionally, each tag can be associated with a value that further specifies the tag. Within a taxonomy tags can be grouped into categories. For the purpose of this documentation we describe a tag by writing
(entity) taxonomy : category :: tag = value
As an example take the taxonomy "Website Language" that describes the language a website presents itself in. The tags are the languages, categories are language families and values are the confidence that the language has been detected correctly. Some example tags would then be
(domain_example_one.at) Website Languages : Indo-European :: German = confidence-high
(domain_example_two.at) Website Languages : Afro-Asiatic :: Arabic = manually-tagged
The same Website could present itself with multiple languages, so one website can be associated with multiple tags from the same taxonomy. Additionally, the set of languages a homepage is available in could change so each tag has a start timestamp and an end timestamp. The end timestamp may be unspecified to indicate that a property still applies (i.e. its absence has not yet been observed).
This library provides a database schema to maintain such a tagging infrastructure, services that convert individual measurements to tags, and an API that provides access to the gathered data.
Usually, properties are detected in discrete measurements, e.g. a web crawler visits a homepage and determines that a web page is mostly written in english. In the context of tag2domain, measurements are represented by JSON objects that have the following fields:
MEASUREMENT
key | example | description | required |
---|---|---|---|
version | 1 | version of the measurement format | yes |
tag_type | "domain" | tag type - each type corresponds to an intersection table | yes |
tagged_id | 42 | ID of the entity the measurement has been taken of. This ID must exist in the entity table that corresponds to the tag type | yes |
taxonomy | "Website Languages" | taxonomy that has been measured | yes |
producer | "Language Detection Bot" | name of the process that did the measurement | yes |
measured_at | "2020-12-20T12:00:00" | time when the measurement was taken | yes |
measurement_id | "ldb:576894" | optional measurement ID | no |
autogenerate_tags | true | if true tag types that are not already in the taxonomy are inserted. This requires that the allows_autotags column of the taxonomy is set to true. |
no |
autogenerate_values | false | if true values that are not already in the taxonomy are inserted. This requires that the allows_autovalues column of the taxonomy is set to true. |
no |
tags | [tag_1, tag_2, ...] | list of tags that have been found | yes |
Each tag is itself a JSON object:
TAG
key | example | description | required |
---|---|---|---|
tag | "de" | name or ID of the tag to be set | yes |
value | "confidence:high" | value to be set | no |
description | "Language: German" | dsecription of the tag. Required if autogenerate_tags == true | yes if autogenerate_tags is true |
extras | {} | JSON object that contains further details about the tag | yes if autogenerate_tags is true |
tag2domain assumes that a measurement describes a whole taxonomy. This means, that if a tag (e.g. "en") is not in a measurement of the taxonomy "Website Languages" then tag2domain assumes that there is no english language on the website.
Examples of measurements can be found here. The schema for measurements can be found here.
This package provides
- a postgres database schema that implements tag2domain,
- the tag2domain-api, a REST interface that allows querying of the tag2domain database and that (optionally) receives property measurements, and
- msm2tag2domain, a script that reads measurements from a kafka queue or a file and turns them into tag2domain tags.
The basic database schema is shown in the picture below. tag2domain is based on three core tables:
- taxonomy: contains the available taxonomies
- tags: contains the available tags
- taxonomy_tag_val: contains the available values
The tags defined in these tables are assigned to entities using (possibly multiple) intersection tables. In the example below the intersection table is named domain_tags and assigns tags to domains. These domains are themselves rows in a table named domains. This domains table must be maintained by the user (i.e. by some other program) and is not modified by tag2domain.
The glue section combines information from the intersection tables and the rest of the database into a single table that provides data to the API. In addition entities can be filtered by other criteria by setting up appropriate filter tables. See Advanced DB Configuration for details of this configuration.
tag2domain is written in python and uses PostgreSQL databases. The included demos use docker to run the code.
tag2domain is written so that it can be integrated into an existing database. Doing so requires some knowledge of PostgreSQL databases.
To get started clone the repository and navigate to the root directory:
git clone https://gitlab.sbg.nic.at/labs/dwh/tag2domain.git
cd tag2domain/
The three main components of the repository can be found in these subfolders:
- py_tag2domain (
py_tag2domain/
) - a python library that manages tags - tag2domain-api (
tag2domain_api/
) - a REST API for creating and fetching tags - msm2tag2domain (
msm2tag2domain/
) - a service that reads measurements from kafka queue and writes them to the database
To use tag2domain you will likely integrate its database tables into an existing database. A tutorial to do this integration can be found below. To get you started quickly some "all-in-one" demo setups are included that come with a preconfigured database. In this setup you can explore the features that are available and use the configuration files as a template for your own configuration.
The demo setups use docker and docker-compose to run the services. Please first make sure both are set up and working.
The all-in-one setup consists of a mock database that mimics a registry and the tag2domain-api that allows inserting and reading tags. To start the demo move to the root folder of this repository and run the following commands:
# copy the environment file
cp docker/all-in-one-demo/example.env ./.env
# open the .env file with your favourite editor and set the POSTGRES_USER and
# POSTGRES_PASSWORD options
# build the tag2domain-api container
docker-compose -f docker-compose.all-in-one.yml build
# bring up the database and tag2domain-api
docker-compose -f docker-compose.all-in-one.yml up -d
# check that the containers are running (both containers should be Up)
docker-compose -f docker-compose.all-in-one.yml ps
Open http://localhost:8001/docs in your browser (or the equivalent address if you don't run the containers on your local machine) and you should see the API documentation.
To inspect the database tables you can either make the mock database available to the outside of the docker network by modifying docker-compose.all-in-one.yml or you can exec into the db container:
# open a bash process in the db container
docker-compose -f docker-compose.all-in-one.yml exec db bash
# inside the container open psql
# See the .env file for POSTGRES_USER setting
db> psql -U <POSTGRES_USER> tag2domain_mock_db
A good place to start is the /api/v1/meta/taxonomies
endpoint that gives you
an overview over the taxonomies that are available in the mock database. The
/api/v1/domains/bytaxonomy
endpoint with taxonomy colors
can then be used
to retrieve a list of domains with this tag. To retrieve the tags that are set
for domain_test1.at
use the /api/v1/bydomain/
endpoint.
This setup uses the same mock database as the all-in-one setup but measurements are fetched from a kafka topic by the msm2tag2domain service. This setup requires a preexisting kafka installation.
First we have to configure the kafka setup:
# Run these commands from the root folder of the repository
# copy the env file
cp docker/all-in-one-demo/example.env ./.env
# open the .env file with your favourite editor and set the POSTGRES_USER and
# POSTGRES_PASSWORD options
# copy the configuration for msm2tag2domain
cp msm2tag2domain/docker/msm2tag2domain.cfg.example msm2tag2domain/docker/msm2tag2domain.cfg
# open msm2tag2domain/docker/msm2tag2domain.cfg and configure the database
# connection and the parameters for the kafka connection
# build the tag2domain-api and the msm2tag2domain container
docker-compose -f docker-compose.all-in-one-kafka.yml build
# bring up the database, tag2domain-api, and msm2tag2domain
docker-compose -f docker-compose.all-in-one-kafka.yml up -d
# check that the containers are running (all three containers should be up)
docker-compose -f docker-compose.all-in-one-kafka.yml ps
See the measurement examples for a guide on how to submit measurements to the kafka queue.
In practice you will likely want to integrate tag2domain into an existing database so the tags can refer to entities that already exist. In this section we show how tag2domain can be configured to work with an existing database.
As described in a previous section tag2domain can be linked to an existing database table that contains the entities that are to be tagged.
To set up the tag2domain tables first generate a config file:
# copy the example configuration
cp examples/db/db.config.sh.example db.config.sh
# open db.config.sh in your favourite text editor and set the parameters
Now generate the tag2domain tables:
# Load the configuration into your shell environment
source db.config.sh
# Create the core tables
bash db/db_master_script.sh
You can now check your database: there should be a new schema that contains three tables: tags, taxonomy and taxonomy_tag_val.
To generate an intersection table, first specify the entity table the intersection table will refer to and the column that contains the IDs to be referred to:
export TAG2DOMAIN_INTXN_TABLE_NAME=<YOUR INTERSECTION TABLE NAME>
export TAG2DOMAIN_ENTITY_TABLE=<YOUR ENTITY TABLE NAME>
export TAG2DOMAIN_ENTITY_ID_COLUMN=<YOUR ID COLUMN>
# TAG2DOMAIN_ENTITY_NAME_COLUMN is only required if the create_glue.sh script
# is used
export TAG2DOMAIN_ENTITY_NAME_COLUMN=<YOUR NAME COLUMN>
Note, that the entity table must exist prior to creating the intersection
table. Now run the create_intersection_table.sh
script:
bash scripts/db/create_intersection_table.sh
This will generate an intersection table in the tag2domain schema.
To configure the tag2domain services we will need intersection table configurations. Such a configuration can be generated using this command:
bash scripts/db/create_intxn_table_config.sh <TAG TYPE>
<TAG TYPE>
is the tag type name that will later be used to differentiate
between different intersection tables.
If only a single intersection table is used, the glue part of the database can now be generated using this command:
bash scripts/db/create_glue.sh
This script generates a single view name v_unified_tags
that combines the
entity table with the intersection table and the SQL functions that are used by
tag2domain-api. If multiple intersection tables are used one usually has to
tailor the glue tables to the application at hand. See the
Advanced DB Configuration document for details.
This document also covers how domain filters can be defined.
To configure tag2domain three files need to created/edited:
- edit
docker-compose.external-db.yml
that defines the API and msm2tag2domain services - create a .env file.
docker/all-in-one-demo/example.env
is a good place to start. - create a file
docker/external-db/tag2domain.cfg
. A template can be found here:docker/external-db/tag2domain.cfg.example
. The intersection tables are specified in this configuration file. You can either use the scriptscipts/db/create_intxn_table_config.sh
as described in the previous section, or usescripts/db/intersection_table_config.template
as a template.
Tests are provided in the tests/
folder. Most of the tests require a running database:
# copy an example environment file
cp docker/all-in-one-demo/example.env .env
# open .env with your favourite text editor and set the database parameters
# deploy the database in a docker container
docker-compose -f docker-compose.test-db.yml up -d
# check that the container is up and running
docker-compose -f docker-compose.test-db.yml ps
Now that the database is running we can prepare the configuration for the tests and install the required packages:
# copy the example configuration file
cp tests/config/db.cfg.example tests/config/db.cfg
# open tests/config/db.cfg in your favourite text editor and configure the
# database connection
# install the required python packages
pip install -r py_tag2domain/requirements.txt
pip install -r tag2domain_api/requirements.txt
# install pytest
pip install pytest
The tests can now be run using this command:
pytest tests/
There is a global taxonomy list on github, which serves as a place for anyone to propose taxonomies and document them. This list should be constantly growing by community contributions (should this project kick off). The important aspect for us is that every taxonomy is described as machine readable machine-tag format. Hence, we can include other taxonomies into our DB rather easily.
This project was partially funded by the CEF framework