-
Notifications
You must be signed in to change notification settings - Fork 2
Generating a COGS corpus
alexanderkoller edited this page Aug 20, 2022
·
1 revision
The reimplemented COGS grammar is available here: https://github.com/coli-saar/cogs-generator-alto
The corpus is generated in the variable-free format introduced by Qiu et al. 2022. Use the following command:
java -cp <alto.jar> de.up.ling.irtg.script.CogsCorpusGenerator [options] <grammar.irtg>
Here <alto.jar>
stands for the Alto jarfile, and <grammar.irtg>
is the reimplemented COGS grammar. The options are as follows:
-
--count <N>
says that we want to generate a corpus with<N>
instances -
--suppress-duplicates
says that the same sentence should never be generated twice -
--previous-instances <filename>
reads a previously generated corpus from<filename>
; if you also choose--suppress-duplicates
, the tool guarantees that you won't generate a sentence again that was already part of the old corpus. -
--pp-depth <min>-<max>
restricts the PP embedding depth to a minimum of<min>
and a maximum of<max>
. For instance, write--pp-depth 0-2
to generate instances with PP depth at most two. -
--cp-depth <min>-<max>
restricts the CP embedding depth in the same way. For instance,--cp-depth 3-12
generates instances with CP embedding depth three to twelve.
The corpus generator prints the new instances to stdout. It prints error messages and a progress report to stderr.