-
Notifications
You must be signed in to change notification settings - Fork 6
Home
RelTextRank is a flexible Java pipeline for converting pairs of raw texts into structured representations and enriching them with semantic information about the relations between the two pieces of text (e.g., lexical exact match). The pipeline takes a pair of texts as input and produces their structural representations with the relational markup. For example, the proposed pipeline can represent an input question and answer sentence pairs as syntactico-semantic structures, enriching them with relational information, e.g., links between question class, focus and named entities, and serializes them as training and test files for the tree kernel-based reranking framework. The pipeline generates a number of dependency and shallow chunk-based representations shown to achieve competitive results in previous work. It also enable easy evaluation of the models thanks to cross-validation facilities.
The pipeline is based on the Apache UIMA technology, which allows for the creation of highly modular applications and analysis of a large volume of unstructured information.
To launch the pipeline you need to run one of the System Java classes described in System.
The overall schema of the pipeline is presented below
A basic unit of input to RelTextRank is a text Texti, and a list of n texts, Texti1,...,Textin, that should be classified or reranked as relevant or not for Texti. For example, in the question answering setting Texti would be a question and Textij, j=1,...,n would be a list of n candidate answer passages. The output of the system is a file in the SVMLight-TK format containing the relational structures and feature vectors generated from the <Texti, (Texti1,...,Textin)> tuples.
When launched, the RelTextRank System module first initializes the other modules, such as the UIMA text analysis pipeline responsible for linguistic annotation of input texts, the Experiment module responsible for generating the structural representations enriched with the relational labels, and, finally, the OutputWriter module which generates the output in the SVMLight-TK format.
At runtime, when presented with a < Texti, (Texti1,...,Textin) > tuple, RelTextRank generates (Texti,Textij) pairs, and performs the following steps for each pair.
First, we perform linguistic analysis of the input texts using the UIMA pipeline. This subsystem runs a pipeline of UIMA Analysis Engines (AEs), which wrap linguistic annotators, e.g. Sentence Splitters, Tokenizers, Syntactic parsers, thus converting input text pairs (Texti,Textij) into UIMA Common Analysis Structures or CASes, (CASi,CASij). CASes contain the original texts and all the linguistic annotations produced by the AEs. AEs produce linguistic annotations defined by a UIMA Type System. In addition, there is an option to persist the produced CASes, and not to rerun the annotators when re-processing a specific document.
The Experiment module is the core architectural component of the system which generates the relational structures from (CAS_i,CASij), RelStructi and RelStructij and the feature vector representation for the input text pair, FVi,ij. Here, the Projector module generates (RelStructi, RelStructij) and the VectorFeatureExtractor module generates FVi,ij.
See Experiment for the list of Experiment modules and the descriptions of the representations they generate.
See VectorFeatureExtractor for the list of feature extractors that can be used for FVi,ij generation.
Once, all the pairs generated from the < Texti, (Texti1,...,Textin) > tuple have been processed, the OutputWriter module writes them into training/test files. Output strategies with examples are described in System
The proposed system was used as a component in a number of research works, including:
- K. Tymoshenko, D. Bonadiman, and A. Moschitti. 2016a. Convolutional Neural Networks vs. Convolution Kernels: Feature Engineering for Answer Sentence Reranking. In Proceedings of NAACL-HLT. ACL.
- K. Tymoshenko, D. Bonadiman, and A. Moschitti. 2016b. Learning to rank non-factoid answers: Comment selection in web forums. In Proceedings of CIKM. ACM, CIKM ’16.
- K. Tymoshenko and A. Moschitti. 2015. Assessing the impact of syntactic and semantic structures for answer passages reranking. In CIKM. ACM.