Skip to content
Kateryna Tymoshenko edited this page Mar 14, 2017 · 4 revisions

RelTextRank is a flexible Java pipeline for converting pairs of raw texts into structured representations and enriching them with semantic information about the relations between the two pieces of text (e.g., lexical exact match). The pipeline takes a pair of texts as input and produces their structural representations with the relational markup. For example, the proposed pipeline can represent an input question and answer sentence pairs as syntactico-semantic structures, enriching them with relational information, e.g., links between question class, focus and named entities, and serializes them as training and test files for the tree kernel-based reranking framework. The pipeline generates a number of dependency and shallow chunk-based representations shown to achieve competitive results in previous work. It also enable easy evaluation of the models thanks to cross-validation facilities.

The pipeline is based on the Apache UIMA technology, which allows for the creation of highly modular applications and analysis of a large volume of unstructured information.

Quick start

To launch the pipeline you need to run one of the System Java classes described in System.

High-level pipeline description

The overall schema of the pipeline is presented below Pipeline schema

A basic unit of input to RelTextRank is a text Texti, and a list of n texts, Texti1,...,Textin, that should be classified or reranked as relevant or not for Texti. For example, in the question answering setting Texti would be a question and Textij, j=1,...,n would be a list of n candidate answer passages. The output of the system is a file in the SVMLight-TK format containing the relational structures and feature vectors generated from the <Texti, (Texti1,...,Textin)> tuples.

When launched, the RelTextRank System module first initializes the other modules, such as the UIMA text analysis pipeline responsible for linguistic annotation of input texts, the Experiment module responsible for generating the structural representations enriched with the relational labels, and, finally, the OutputWriter module which generates the output in the SVMLight-TK format.

At runtime, when presented with a < Texti, (Texti1,...,Textin) > tuple, RelTextRank generates (Texti,Textij) pairs, and performs the following steps for each pair.

Step 1. Linguistic annotation.

First, we perform linguistic analysis of the input texts using the UIMA pipeline. This subsystem runs a pipeline of UIMA Analysis Engines (AEs), which wrap linguistic annotators, e.g. Sentence Splitters, Tokenizers, Syntactic parsers, thus converting input text pairs (Texti,Textij) into UIMA Common Analysis Structures or CASes, (CASi,CASij). CASes contain the original texts and all the linguistic annotations produced by the AEs. AEs produce linguistic annotations defined by a UIMA Type System. In addition, there is an option to persist the produced CASes, and not to rerun the annotators when re-processing a specific document.

Step 2. Generation of structural representations and feature vectors.

The Experiment module is the core architectural component of the system which generates the relational structures from (CAS_i,CASij), RelStructi and RelStructij and the feature vector representation for the input text pair, FVi,ij. Here, the Projector module generates (RelStructi, RelStructij) and the VectorFeatureExtractor module generates FVi,ij.

See Experiment for the list of Experiment modules and the descriptions of the representations they generate.

See VectorFeatureExtractor for the list of feature extractors that can be used for FVi,ij generation.

Step 3. Generation of the output files.

Once, all the pairs generated from the < Texti, (Texti1,...,Textin) > tuple have been processed, the OutputWriter module writes them into training/test files. Output strategies with examples are described in System

References

The proposed system was used as a component in a number of research works, including: