This folder contains the data that forms the basis of the tests in the paper. We consider three different types of data: synthetic data, semi_natural data and natural data. The natural data differs per test and is directly extracted from natural corpora (e.g. it consists of natural sentences containing a particular idiom under consideration). The other two data types are templated, you can find this data in their respective folders. The vocabulary folder contains the vocabulary items that we used to vary the templates.
For our synthetic test data, we have taken inspiration from literature on probing hierarchical structure in language models: we consider the synthetic data generated by Lakretz et al (2019), which contains a large number of sentences with a fixed syntactic structure and diverse lexical material. We extend the set of templates in the dataset and the vocabulary used, resulting in the following ten templates:
# | Template | Example sentence |
---|---|---|
1 | The Npeople Vtransitive the Nslelite | The poet criticises the king |
2 | The Npeople Adv Vtransitive the Nslelite | The victim carefully observes the queen |
3 | The Npeople P the Nslvehicle Vtransitive the Nslelite | The athlete near the bike observes the leader |
4 | The Npeople and the Npeople Vpltransitive the Nslelite. | The poet and the child understand the mayor |
5 | The Nslquantity of Nplpeople P the Nslvehicle Vsltransitive the Nslelite | The group of friends beside the bike forgets the queen |
6 | The Npeople Vtransitive that the Npeople Vplintransitive | The farmer sees that the lawyers cry |
7 | The Npeople Adv Vtransitive that the Npeople Vplintransitive | The mother probably thinks that the fathers scream |
8 | The Npeople Vtransitive that the Nplpeople Vplintransitive Adv | The mother thinks that the fathers scream carefully |
9 | The Npeople that Vintransitive Vtransitive the Nslelite | The poets that sleep understand the queen |
10 | The Npeople that Vtransitive Pro Vsltransitive the Nslelite | The mother that criticises him recognises the queen |
For each of the templates, we generated 3000 sentences.
In the synthetic data, we have full control over the sentence structure and lexical items, but the sentences are shorter (9 tokens vs. 16 in OPUS) and simpler than typical in NMT data. To obtain more complex yet plausible test sentences, we employ a data-driven approach: to generate semi-natural data, we use the tree substitution grammar Double DOP (Van Cranenburgh et al., 2016), we obtain noun and verb phrases whose structures frequently occur in OPUS.
To generate the data, we follow the following process:
- Sample 100k English OPUS sentences.
- Generate a treebank using the disco-dop library and the discodop parser en ptb command. We used the library’s
--fmt
bracket to turn off discontinuous parsing, which the library was originally developped for. - Compute tree fragments from the resulting treebank (discodop fragments). These tree fragments are the building blocks of a Tree-Substitution Grammar.
- We assume the most frequent fragments to be common syntactic structures in English. To construct complex test sentences, we collect the 100 most frequent fragments containing at least 15 non-terminal nodes for NPs and VPs.
- Selection of three VP and five NP fragments to be used in our final semi-natural templates. These structures are selected through qualitative analysis for their diversity.
- Extract sentences matching the eight fragments (discodop treesearch).
- Create semi-natural sentences by varying one lexical item and varying the matching NPs and VPs retrieved in the previous step.
We then embed the etracted NPs and VPs in ten synthetic templates, resulting in the following 10 semi-natural templates:
# | Template | Example sentence |
---|---|---|
1 | The Npeople (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) | The woman wants to use the Internet as a means of communication. |
2 | The Npeople (VP (VBP ) (VP (VBG ) (S (VP (TO ) (VP (VB ) (S (VP (TO ) (VP ))))))))) | The men are gonna have to move off-camera. |
3 | The Npeople (VP (VB ) (NP (NP ) (PP (IN ) (NP ))) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))) | The doctors retain 10 % of these amounts by way of collection costs. |
4 | The Npeople reads an article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) | The friend reads an article about the development of ascites in rats with liver cirrhosis. |
5 | The Npeople reads an article about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP )))))) | The teachers read an article about the degree of progress that can be achieved by the industry. |
6 | An article about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))) is read by the Npeople. | An article about the inland transport of dangerous goods from a variety of Member States is read by the lawyer. |
7 | An article about (NP (NP ) (PP (IN ) (NP (NP ) (, ,) (SBAR (S (WHNP (WDT )) (VP )))))) , is read by the Npeople . | An article about the criterion on price stability , which was 27 % , is read by the child. |
8 | Did the Npeople hear about (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP (NP ) (PP (IN ) (NP ))))))). | Did the friend hear about an inhospitable fringe of land on the shores of the Dead Sea? |
9 | Did the Npeople hear about (NP (NP (DT ) (NN )) (PP (IN ) (NP (NP ) (SBAR (S (WHNP (WDT )) (VP ))))))? | Did the teacher hear about the march on Employment which happened here on Sunday? |
10 | Did the Npeople hear about (NP (NP ) (SBAR (S (VP (TO ) (VP (VB ) (NP (NP ) (PP (IN ) (NP )))))))) ? | Did the lawyers hear about a qualification procedure to examine the suitability of the applicants? |
As for the synthetic data, we generate 3000 samples for each template.