BLISS is a dataset for testing the generalization capabilities of artificial models for language induction. The benchmark score represent how well a model generalizes in inverse relation how little data it was trained on.
This repository contains the datasets and data generation scripts for training and testing a model on BLISS.
For the full method and specs see the paper Benchmarking Neural Network Generalization for Language Induction.
- aⁿbⁿ
- aⁿbⁿcⁿ
- aⁿbⁿcⁿdⁿ
- aⁿbᵐcⁿ⁺ᵐ
- Dyck-1
- Dyck-2
Please use the following citation if you use the datasets in your work:
@inproceedings{Lan_Chemla_Katzir_2023,
title={Benchmarking Neural Network Generalization for Grammar Induction},
author={Lan, Nur and Chemla, Emmanuel and Katzir, Roni},
booktitle={Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)},
pages={131--140},
year={2023}
}
Following Gers & Schmidhuber (2001), all sequences start and end with the symbol #
. This makes it possible to test for strict acceptance/rejection.
All files contain strings surrounded with #
from both sides. Inputs and targets need to be trimmed accordingly.
Example:
aⁿbⁿ | |
Input string | #aaabbb |
Target string | aaabbb# |
All datasets are provided with boolean mask tensors for testing model outputs:
-
Deterministic step masks - some languages have deterministic phases where a model's accuracy can be tested. For example,
aⁿbⁿ
sequences become deterministic after seeing the firstb
. A good model will not assign any probability toa
after seeing the firstb
. -
Valid symbol masks - languages like Dyck don't have any deterministic parts (a new parenthesis can always be opened). But the set of valid symbols at each time step is limited. For example, for a Dyck-1 sequence, after seeing
#((
, a good model must not assign any probability to the end-of-sequence symbol.
aⁿbⁿ | |
String example | aaabbb |
Input sequence |
[#,a,a,a,b,b,b] |
Target sequence | [a,a,a,b,b,b,#] |
Vocabulary | {"#": 0, "a": 1, "b": 2} |
Deterministic steps mask (boolean) | [0,0,0,0,1,1,1] |
Deterministic step mask shape | (batch_size, sequence_length) |
Dyck-1 | |
String example | (())() |
Input sequence |
[#,(,(,),),(,)] |
Target sequence | [(,(,),),(,),#] |
Vocabulary | {"#": 0, "(": 1, ")": 2} |
Valid symbols mask (boolean) | [[1,1,0], [0,1,1], [0,1,1], [0,1,1], [1,1,0], [0,1,1], [1,1,0]] |
Valid symbol mask shape | (batch_size, sequence_length, vocabulary_size) |
Each folder in datasets
has the following structure:
<language_name>
train_<batch_size>_p_<prior>_seed_<seed>.txt.zip
– train set of sizebatch_size
sampled using probabilityprior
and using the randomseed
.test.txt.zip
– first 15,000 strings of the language sorted by length.aⁿbᵐcⁿ⁺ᵐ
is sorted byn+m
values. Dyck are sorted by length+lexicographically.preview.txt
– first 10 strings of the language.test_deterministic_mask.npz
– boolean mask for deterministic time steps, for relevant languages (all but Dyck languages). Shape:(batch_size, sequence_length)
.test_valid_next_symbols.npz
– boolean mask for relevant symbols, for Dyck languages. Shape:(batch_size, sequence_length, vocabulary_size)
.
Load npz
mask files using :
np.load(filename)["data"]
️🚨 The password to all zip files is 1234
. Why?
To generate new training data using a different seed, prior, or batch size, run:
python generate_dataset.py --lang [language-name] --seed [seed] --prior [prior]
Example:
python generate_dataset.py --lang an_bn --seed 100 --prior 0.3
To prevent test set contamination by large language models who train on crawled data and then test on it, all dataset files except previews are zipped and password-protected.
The password to all zip files is 1234
.
Each dataset folder contains preview.txt
for easy inspection of the data.
- Python ≥ 3.5
Quick setup:
pip install -r requirements.txt