Implementation of an LLM-based website classifier, as described in Thomas Daniels's master thesis.


WebCat requires Python 3.10 or above. Create a new virtual environment to isolate dependencies from other projects:

python -m venv .venv
source .venv/bin/activate

Install the necessary dependencies: pip install -r requirements.txt. For tests and typechecking, also install the dependencies from requirements-dev.txt.

A part of the model input consists of the domain name split in words, for which wordsegment is used. In the wordsegment/ directory, you still need to create a unigrams.txt and bigrams.txt file. For English, the files unigrams.txt and bigrams.txt from the original library can be used. For other languages, they could be derived from datasets such as Web 1T 5-gram.

Setup container

It is possible to run WebCat inside a container. Inside this container a webserver will be started. To access this webserver the relevant port needs to be exported. It maybe also a good idea to export the folder which contains the generated parquet files.

You need to install the dependencies from requirements-webserver.txt to use the webserver container.


# Build the container
docker build -t webcat-server -f Containerfile .

# Start the container 
docker run --rm \
-v ./data/models:/data/models \
-v ./data/hdf5_files:/data/hdf5_files \
-v ./data/parquet_files_in:/data/parquet_files_in \
-v ./data/parquet_files_out:/data/parquet_files_out \
# Uncomment the next line to pass all gpus to into the container
# --gpus all \
-p 8000:8000 \

The webserver has an integrated documentation which is reachable on the path /docs or /redoc.

It will also expose a metrics endpoint which is reachable on the path /metrics.

Obtaining or constructing a dataset

Set the following environment variables, or fill the .env file:


Datasets can be loaded from Mercator, using a table (or multiple) that has at least a visit_id (UUID) and, for training/testing data, a label (text) column. Training data can be downloaded and saved like this:

python train OUT_X OUT_Y TABLE1 [TABLE2 ...]

Where OUT_X and OUT_Y are the paths for the output files, saved in Parquet format. TABLE1 is just the name of the table.

Similar for testing data (if a website has multiple labels from different annotators, use one row per label - visit_id is not expected to be unique):

python test OUT_X OUT_Y TABLE1 [TABLE2 ...]

If there are labels in this test set that should be ignored, edit FETCH_TEST_SET_IGNORE_LABELS in

Data without label, that can be used for predictions, can be downloaded like this:

python predict OUT_X TABLE1 [TABLE2 ...]

Without Mercator

The format of x values is the same for training, testing, and prediction. The format of y values differs between training and testing.

For the x values, construct a Parquet file with the following columns, and a row per training sample:

  • visit_id - a unique textual identifier for a web page snapshot
  • domain_name - domain name of the website (including the TLD)
  • body_text - the document text (title text and body text concatenated - just the text, no HTML)
  • meta_text - the meta description of the web page, empty string if nonexistent
  • external_hosts - a list of external domain names that the web page links to
  • Each numerical feature defined in as NUMERICAL_FEATURES (edit this config variable as desired)

For the y values for training, construct a Parquet file with one column, named label, containing a string with the label of the web page. The rows must be in the same order as the x values.

For the y values for testing, construct a Parquet file with one column, named labels, containing a list of strings with the labels of the web page. This is to support multiple ground truth labels from different human annotators, and a distinction between unanimous and controversial websites is made during evaluation. Even if you have only one label per web page, this column must contain a list. The rows must be in the same order as the x values.


Before training or predicting, the dataset still needs to be preprocessed to obtain inputs that can directly be fed into the model. The input file(s) is the dataset(s) from the previous step (Parquet files), and the output is an HDF5 file. The preprocessing step for predictions also requires a trained model.

For training:

python train IN_X IN_Y OUT

To specify the training/validation split, use --split FRACTION. The default is 0.15.

For predictions:

python predict IN_X MODEL OUT 


After preprocessing, the training process can be executed with this command, outputting a model file:

python train INPUTS OUT

INPUTS is the path of the preprocessed data (the output of the previous step). OUT is the path where the model should be saved.

Optional arguments:

  • --batch-size (default 24)
  • --epochs (default 4)
  • --learning-rate (default 2e-5)
  • --seed (accepts an integer or the string random, default random)

If your dataset only has two distinct labels, the model is automatically trained as a binary model instead of a multiclass model (which means that AUC-PR instead of F1-score is used during validation, and the prediction output will also include the confidence value between 0 and 1). The two labels are ordered lexicographically and the first one is taken as the negative label and the second one as positive. That is not configurable, but it works out fine if label pairs such as "No"/"Yes", "False"/"True", or "Negative"/"Positive" are chosen.


python predict DATA MODEL OUT

DATA is the HDF5 file from the preprocessing step, MODEL the path to the trained model, and OUT is the path where the predictions should be stored (as a Parquet file).

If the output should include the entropies of the predictions, use the --entropies flag. This is useful to find the most uncertain websites for active learning.

A class distribution of the generated predictions can be printed using:

python distribution PREDICTIONS


Testing the model takes place in two steps:

  • Generating predictions on a test set, as explained in the previous section.
  • Comparing those predictions to the real labels.

For a multiclass model, this comparison can be done using


This prints an overview of the performance of the model. TRUE_Y is the path to the Parquet file containing the true labels. MODEL1_PRED is the result of predict. It's possible to compare the results of two different models by passing the predictions of a second model as well, for the same set of true labels.

This script gives a detailed breakdown of the performance per class. Pass the --no-details flag to only return a summary.

A confusion matrix can be plotted using:


For a binary model, use instead:


A precision-recall curve can be plotted using:

python TRUE_Y MODEL_PRED --plot


The predictions of three models can be combined (by majority vote) like this:

python combine OUT P1 P2 P3

Note: This assumes that the order of the websites is the same in all three files! (If you used predict three times on the same input data, this will be the case.)

If you only want to keep predictions that have a majority (so, delete those where all three models predicted a different class), pass the --delete-if-no-majority flag. This is useful for the self-distillation process, which can be performed using the following steps:

  • Train 3 models from the manually labelled training set, with a different seed (or all three randomly seeded).
  • Use these models to generate predictions on an unseen set of websites, of the same size as the training set.
  • Combine the resulting 3 sets of predictions using combine with --delete-if-no-majority.
  • Use those predictions as additional training data (push them to a Mercator table and re-do the training process starting from "Obtaining or constructing a dataset").

Active learning

  • Generate the entropies of predictions on a large unseen dataset with predict --entropies
  • Sort the results with python sort IN OUT [--take N], so the most uncertain ones are at the top of the list
  • Manually label those visits
  • Re-do the training process with the extra labels

Using a different model (architecture)

To use a different model than the default XLM-RoBERTa BASE model, take these steps:

  • Edit the PRETRAINED_MODEL variable in
  • Change the necessary class names in to use the correct implementation from the transformers library. This is not needed if you want to use xlm-roberta-large instead of xlm-roberta-base but it is necessary if you use a different architecture.


