DFKI Product Corpus

This repository contains the DFKI Product Corpus, a dataset of 174 English web pages and social media posts annotated for product and company named entities, and the relation CompanyProvidesProduct. The goal is to make extraction of non-standard, B2B products and relations from unstructured text easier and more reliable. The corpus is also annotated for coreference chains of companies and products.

You can find the full description of the corpus here: https://www.dfki.de/web/forschung/projekte-publikationen/publikationen/publikation/9428/

The corpus is provided in two formats - AVRO and JSON, using the train/test split described in the paper. For details on the schema used for storing annotations, see below.

Changes for Version 2:

Annotated 21 more documents, for a total of 174.
Created a train/dev/test split
Annotation change: phrases that are used like proper name products in certain contexts or enumerations even without being explicitly used in a CompanyProvidesProduct relation mention, e.g. "[drip chamber] segment", are now marked as products.
Annotation change: phrases clearly recognizable as a physical product, esp. in market descriptions where specific products are mentioned, e.g. "[fusion pumps]" in the "[IV equipment] market", are now marked as products.

Use

The DFKI Product Corpus is released as CC-BY 4.0. If you use this data, you should cite the accompanying paper:

A Corpus Study and Annotation Schema for Named Entity Recognition and Relation Extraction of Business Products. Saskia Schön, Veselina Mironova, Aleksandra Gabryszak and Leonhard Hennig. Proceedings of LREC, 2018. (bib) (pdf)

Format

The corpus consists of Documents which store the original text and all annotations, according to the following AVRO schema:

document.avsc

You can use the following JAVA tools to read the AVRO version of the corpus:

Corpus Reader Tools

To read the corpus, use the following code snippet:

File inputFile = new File("train.avro");
DataFileReader<Document> reader = AvroUtils.createReader(inputFile);
while (reader.hasNext()) {
   Document doc = reader.next();
   // do something
}

Each document contains a list of ConceptMentions, which correspond to Named Entities and other typed concepts (e.g. trigger phrases):

for (ConceptMention c : doc.getConceptMentions()) {
    String nerTag = c.getType();
    String value = c.getNormalizedValue();
    int start = c.getSpan().getStart();
    int end = c.getSpan().getEnd();
    String originalText = doc.getText().substring(start, end);
    // etc ...
}

Similarly, you can retrieve RelationMentions:

for (RelationMention r : doc.getRelationMentions()) {
    String relationType = c.getName();
    int start = c.getSpan().getStart();
    int end = c.getSpan().getEnd();
    String originalText = doc.getText().substring(start, end);
    for (RelationArgument arg : r.getRelationArguments()) {
        String role = arg.getRole();
        ConceptMention c = arg.getConceptMention();
        // ...
    }
    // ...
}

ConceptMentions and RelationMentions are stored at the document level, and for each sentence as well. You can access a sentence's list of RelationMentions using:

for (Sentence s : doc.getSentences()) {
    List<RelationMention> relationMentions = s.getRelationMentions();
    // ...
}

Annotation Guidelines

Product Corpus Annotation Guidelines v1.0 (Feb 2018)

Coreference Annotation Guidelines_v1.0 (Feb 2018)

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
v1_20180509		v1_20180509
v2_20190618		v2_20190618
.gitattributes		.gitattributes
.gitignore		.gitignore
Coreference_Guidelines.pdf		Coreference_Guidelines.pdf
LICENSE		LICENSE
Product_Corpus_Annotation_Guidelines_Feb_2018_v1.0.pdf		Product_Corpus_Annotation_Guidelines_Feb_2018_v1.0.pdf
README.MD		README.MD
document.avsc		document.avsc
paper.bib		paper.bib
sdw-tools-1.0-SNAPSHOT.jar		sdw-tools-1.0-SNAPSHOT.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DFKI Product Corpus

Use

Format

Annotation Guidelines

About

Releases

Packages

License

DFKI-NLP/product-corpus

Folders and files

Latest commit

History

Repository files navigation

DFKI Product Corpus

Use

Format

Annotation Guidelines

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages