This repository contains the DFKI Product Corpus, a dataset of 174 English web pages and social media posts annotated for product and company named entities, and the relation CompanyProvidesProduct. The goal is to make extraction of non-standard, B2B products and relations from unstructured text easier and more reliable. The corpus is also annotated for coreference chains of companies and products.
You can find the full description of the corpus here: https://www.dfki.de/web/forschung/projekte-publikationen/publikationen/publikation/9428/
The corpus is provided in two formats - AVRO and JSON, using the train/test split described in the paper. For details on the schema used for storing annotations, see below.
Changes for Version 2:
- Annotated 21 more documents, for a total of 174.
- Created a train/dev/test split
- Annotation change: phrases that are used like proper name products in certain contexts or enumerations even without being explicitly used in a CompanyProvidesProduct relation mention, e.g. "[drip chamber] segment", are now marked as products.
- Annotation change: phrases clearly recognizable as a physical product, esp. in market descriptions where specific products are mentioned, e.g. "[fusion pumps]" in the "[IV equipment] market", are now marked as products.
The DFKI Product Corpus is released as CC-BY 4.0. If you use this data, you should cite the accompanying paper:
A Corpus Study and Annotation Schema for Named Entity Recognition and Relation Extraction of Business Products. Saskia Schön, Veselina Mironova, Aleksandra Gabryszak and Leonhard Hennig. Proceedings of LREC, 2018. (bib) (pdf)
The corpus consists of Documents which store the original text and all annotations, according to the following AVRO schema:
You can use the following JAVA tools to read the AVRO version of the corpus:
To read the corpus, use the following code snippet:
File inputFile = new File("train.avro");
DataFileReader<Document> reader = AvroUtils.createReader(inputFile);
while (reader.hasNext()) {
Document doc = reader.next();
// do something
}
Each document contains a list of ConceptMentions, which correspond to Named Entities and other typed concepts (e.g. trigger phrases):
for (ConceptMention c : doc.getConceptMentions()) {
String nerTag = c.getType();
String value = c.getNormalizedValue();
int start = c.getSpan().getStart();
int end = c.getSpan().getEnd();
String originalText = doc.getText().substring(start, end);
// etc ...
}
Similarly, you can retrieve RelationMentions:
for (RelationMention r : doc.getRelationMentions()) {
String relationType = c.getName();
int start = c.getSpan().getStart();
int end = c.getSpan().getEnd();
String originalText = doc.getText().substring(start, end);
for (RelationArgument arg : r.getRelationArguments()) {
String role = arg.getRole();
ConceptMention c = arg.getConceptMention();
// ...
}
// ...
}
ConceptMentions and RelationMentions are stored at the document level, and for each sentence as well. You can access a sentence's list of RelationMentions using:
for (Sentence s : doc.getSentences()) {
List<RelationMention> relationMentions = s.getRelationMentions();
// ...
}