Skip to content

This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the relation CompanyProvidesProduct.

License

Notifications You must be signed in to change notification settings

DFKI-NLP/product-corpus

Repository files navigation

DFKI Product Corpus

This repository contains the DFKI Product Corpus, a dataset of 174 English web pages and social media posts annotated for product and company named entities, and the relation CompanyProvidesProduct. The goal is to make extraction of non-standard, B2B products and relations from unstructured text easier and more reliable. The corpus is also annotated for coreference chains of companies and products.

You can find the full description of the corpus here: https://www.dfki.de/web/forschung/projekte-publikationen/publikationen/publikation/9428/

The corpus is provided in two formats - AVRO and JSON, using the train/test split described in the paper. For details on the schema used for storing annotations, see below.

Changes for Version 2:

  • Annotated 21 more documents, for a total of 174.
  • Created a train/dev/test split
  • Annotation change: phrases that are used like proper name products in certain contexts or enumerations even without being explicitly used in a CompanyProvidesProduct relation mention, e.g. "[drip chamber] segment", are now marked as products.
  • Annotation change: phrases clearly recognizable as a physical product, esp. in market descriptions where specific products are mentioned, e.g. "[fusion pumps]" in the "[IV equipment] market", are now marked as products.

Use

The DFKI Product Corpus is released as CC-BY 4.0. If you use this data, you should cite the accompanying paper:

A Corpus Study and Annotation Schema for Named Entity Recognition and Relation Extraction of Business Products. Saskia Schön, Veselina Mironova, Aleksandra Gabryszak and Leonhard Hennig. Proceedings of LREC, 2018. (bib) (pdf)

Format

The corpus consists of Documents which store the original text and all annotations, according to the following AVRO schema:

You can use the following JAVA tools to read the AVRO version of the corpus:

To read the corpus, use the following code snippet:

File inputFile = new File("train.avro");
DataFileReader<Document> reader = AvroUtils.createReader(inputFile);
while (reader.hasNext()) {
   Document doc = reader.next();
   // do something
}

Each document contains a list of ConceptMentions, which correspond to Named Entities and other typed concepts (e.g. trigger phrases):

for (ConceptMention c : doc.getConceptMentions()) {
    String nerTag = c.getType();
    String value = c.getNormalizedValue();
    int start = c.getSpan().getStart();
    int end = c.getSpan().getEnd();
    String originalText = doc.getText().substring(start, end);
    // etc ...
}

Similarly, you can retrieve RelationMentions:

for (RelationMention r : doc.getRelationMentions()) {
    String relationType = c.getName();
    int start = c.getSpan().getStart();
    int end = c.getSpan().getEnd();
    String originalText = doc.getText().substring(start, end);
    for (RelationArgument arg : r.getRelationArguments()) {
        String role = arg.getRole();
        ConceptMention c = arg.getConceptMention();
        // ...
    }
    // ...
}

ConceptMentions and RelationMentions are stored at the document level, and for each sentence as well. You can access a sentence's list of RelationMentions using:

for (Sentence s : doc.getSentences()) {
    List<RelationMention> relationMentions = s.getRelationMentions();
    // ...
}

Annotation Guidelines

Product Corpus Annotation Guidelines v1.0 (Feb 2018)

Coreference Annotation Guidelines_v1.0 (Feb 2018)

About

This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the relation CompanyProvidesProduct.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published