Skip to content

Corpora

alexanderkoller edited this page Aug 12, 2022 · 6 revisions

This section will explain how Alto represents corpora in files. There are two types of corpora: unannotated and annotated.

Unannotated Corpora

Unannotated corpora take the following form:

/// IRTG unannotated corpus file, v1.0
///
/// Alto Lab corpus #2: PTB Section 00, <= 100 characters
/// (exported on 2017-03-31 11:47:04)
///
/// interpretation i: class de.up.ling.irtg.algebra.StringAlgebra

the woman watches the woman
the man watches the man

"///" indicates a comment (you can freely choose the comment symbol -- it is the start of the first non-blank line in the file).

In the first line the comment must always be followed by "IRTG unannotated corpus file, v1.0".

Before the corpus begins it is necessary to declare all the interpretations that are present in the instances of this corpus. This can be any nonempty subset of the grammar with which you'll want to process the corpus. In the example, we are declaring an interpretation i over the string algebra.

The rest of the file then contains the instances of the corpus. Each instance consists of as many lines as you declared interpretations in the corpus header, in the same order in which you declared them. Thus, the example corpus has two instances.

Blank lines and lines starting with the comment symbol are ignored when Alto reads the corpus.

Annotated Corpora

An annotated corpus contains an IRTG derivation tree for each instance, in addition to the values on the interpretations. It looks as follows:

/// IRTG annotated corpus file, v1.0
///
/// Alto Lab corpus #2: PTB Section 00, <= 100 characters
/// (exported on 2017-03-31 17:32:50)
///
/// interpretation string: class de.up.ling.irtg.algebra.WideStringAlgebra
/// interpretation tree: class de.up.ling.irtg.algebra.TreeWithAritiesAlgebra

Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
S(NP-SBJ(NP(NNP(Pierre),NNP(Vinken)),','(','),ADJP(NP(CD('61'),NNS(years)),JJ(old)),','(',')),VP(MD(will),VP(VB(join),NP(DT(the),NN(board)),PP-CLR(IN(as),NP(DT(a),JJ(nonexecutive),NN(director))),NP-TMP(NNP('Nov.'),CD('29')))),'.'('.'))
r28(r10(r3(r1,r2),r4,r9(r7(r5,r6),r8),r4),r26(r11,r25(r12,r15(r13,r14),r21(r16,r20(r17,r18,r19)),r24(r22,r23))),r27)

The first line must always contain IRTG annotated corpus file, v1.0. You declare interpretations as in the unannotated case above.

Each instance consists of one line per interpretation, followed by one line for the derivation tree of an IRTG representing the "correct" analysis for the instance. This can be useful e.g. for maximum likelihood estimation with respect to a given IRTG. The derivation tree always comes last in each instance.

Clone this wiki locally