SemPub2017

Semantic Publishing Challenge 2017

--- CANCELED --- : unfortunately due the low number of submissions we have to cancel the 2017 edition of the Semantic Publishing Challenge. Thanks to all participants and see you next year!

… sponsored by:

… co-located with: Extended Semantic Web Conference 2017. ###Latest News

EXTENDED DEADLINE: March 20, 2017: First paper submission (5 page document)
Evaluation tool available at SemPubEvaluator
Access to the Third SemWebEval Challenge proceedings will be enabled shortly (free of charge) until 17 March. Please use the link to check the submissions of last year.
February 2, 2017: Publication of tasks, rules and queries description

Motivation and objectives

As in 2016, 2015 and 2014, the goal is to facilitate measuring the excellence of papers, people and scientific venues by data analysis. Instead of considering publication venues as single and independent units, we focus on their explicit and implicit connections, interlinking and evolution. We achieve that thanks to the primary data source we are using, which is highly relevant for computer science: the CEUR-WS.org workshop proceedings, which have accumulated 1,800 proceedings volumes with around 30,000 papers over 20 years and thus covers the majority of workshops in computer science. We go beyond the tasks of the 2016 challenge in two ways: (1) refining and extending the set of quality-related data to be extracted and (2) linking and exploiting existing Linked Open Data sources about authors, publications, topics, events, and communities. The best data produced in the 2017 challenge will be published at CEUR-WS.org or in a separate LOD, interlinked to the official CEUR-WS.org LOD and to the whole Linked Open Data Cloud.

###Dataset

The primary dataset used is the Linked Open Dataset that has been extracted from the CEUR-WS.org workshop proceedings (HTML tables of content and PDF papers) using the extraction tools winning the previous Challenges, plus its full original PDF source documents (for extracting further information). The most recent workshop proceedings metadata have explicitly been released under the CC0 open data license; for the older proceedings, CEUR-WS.org has the permission to make that data accessible. In addition to the primary dataset, we use (as linking targets) existing Linked Open Datasets containing related information: Springer recently announced computer science proceedings LOD, the brand-new LOD of OpenAIRE including all EU-funded open access publications, Springer LD, DBLP, ScholarlyData (a refactoring of the Semantic Web Dog Food), COLINDA, and further datasets available under open licenses. The evaluation dataset will comprise a dataset of around 100 selected PDF full-text papers from these workshops. Like last year, the training dataset will be distinct from the evaluation dataset, as well as the expected results of queries against this subset. Both datasets will respect the diversity of the CEUR-WS.org workshop proceedings volumes with regard to content structure and quality.

Target Audience

The Challenge is open to people from industry and academia with diverse expertise which could participate in all tasks, or focus on specific ones. Task 1 and 2 address an audience with a background in mapping, information extraction, information retrieval and NLP and invites the previous years’ participants to refine their tools, as well as new teams. Task 3 additionally addresses the wide interlinking audience, without excluding in the same time other participants to participate in the challenge. Task 3 invites new participants as well as participants from Tasks 1 and 2.

Tasks

Our challenge invites submissions in one or more out of three tasks, which are independent of each other but are conceptually connected by taking into account increasingly more contextual information. Some tasks include sub-tasks but participants will compete in a task as a whole. They are encouraged to address all sub-tasks (even partially) to increase their chance to win.

Further details about the organization are provided in the general rules page.

The Challenge will include three tasks:

Task 1: Extracting information from the tables in papers

Participants are required to extract information from the tables of the papers (in PDF). Extracting content from tables is a difficult task, which has been tackled by different researchers in the past. Our focus is on tables in scientific papers and solutions for re-publishing structured data as LOD. Tables will be collected from CEUR-WS.org publications and participants will be required to identify their structure and content. The task then will require PDF mining and data processing techniques.

Task 2: Extracting information from the full text of the papers

Participants are required to extract information from the textual content of the papers (in PDF). That information should describe the organization of the paper and should provide a deeper understanding of the content and the context in which it was written. In particular, the extracted information is expected to answer queries about the internal organization of sections, tables, figures and about the authors’ affiliations and research institutions. The task mainly requires PDF mining techniques and some NLP processing.

Task 3: Interlinking

Participants are required to interlink the CEUR-WS.org linked dataset with relevant datasets already existing in the LOD Cloud. Task 3 can be accomplished as an entity interlinking/instance matching task that aims to address both interlinking data from the output of the other tasks as well as interlinking the CEUR-WS.org linked dataset – as produced in previous editions of this challenge – to external datasets. Moreover, as triples are generated from different sources and due to different activities, tracking provenance information becomes increasingly important.

###EVALUATION

In each task, the participants will be asked to refine and extend the initial CEUR-WS.org Linked Open Dataset, by information extraction or link discovery, i.e. they will produce an RDF graph. To validate the RDF graphs produced, a number of queries in natural language will be specified, and their expected results in CSV format. Participants are asked to submit both their dataset and the translation of the input (natural language queries) to work on that dataset. A few days before the deadline, a set of queries will be specified and be used for the final evaluation. Participants are asked then to run these queries on their dataset and to submit the produced output in CSV. Precision, recall, and F-measure will be calculated by comparing each query’s result set with the expected query result from a gold standard built manually. Participants’ overall performance in a task will be defined as the average F-measure over all queries of the task, with all queries having equal weight. For computing precision and recall, the same automated tool as for previous SemPub challenges will be used; this tool will be publicly available during the training phase. We reserve the right to disqualify participants whose dataset dumps are different from what their information extraction tools create from the source data, who are not using the core vocabulary, or whose SPARQL queries implement something different from the natural language queries given in the task definitions. The winners of each task will be awarded as last year.

###TARGET AUDIENCE

The Challenge is open to people from industry and academia with diverse expertise which could participate in all tasks, or focus on specific ones. Task 1 and 2 address an audience with a background in mapping, information extraction, information retrieval and NLP and invites the previous years’ participants to refine their tools, as well as new teams. Task 3 additionally addresses the wide interlinking audience, without excluding in the same time other participants to participate in the challenge. Task 3 invites new participants as well as participants from Tasks 1 and 2

###FEEDBACK AND DISCUSSION

A discussion group is open for participants to ask questions and to receive updates about the challenge: [email protected]. Participants are invited to subscribe to this group as soon as possible and to communicate their intention to participate. They are also invited to use this channel to discuss problems in the input dataset and to suggest changes.

###HOW TO PARTICIPATE Participants are first required to submit:

Abstract: no more than 200 words.
Description: It should explain the details of the automated annotation system, including why the system is innovative, how it uses Semantic Web technology, what features or functions the system provides, what design choices were made and what lessons were learned. The description should also summarize how participants have addressed the evaluation tasks. An outlook towards how the data could be consumed is appreciated but not strictly required. The description should be submitted as a 5 pages document.

If accepted, the participants are invited to submit their task results. In this second phase they are required to submit:

The Linked Open Dataset produced by their tool on the evaluation dataset (as a file or as a URL, in Turtle or RDF/XML).
A set of SPARQL queries that work on that LOD and correspond to the natural language queries provided as input
The output of these SPARQL queries on the evaluation dataset (in CSV format)

Accepted papers will be included in the Conference USB stick. After the conference, participants will be able to add data about the evaluation and to finalize the camera-ready for the final proceedings.

The final papers must not exceed 15 pages in length.

Papers must be submitted in PDF format, following the style of the Springer's Lecture Notes in Computer Science (LNCS) series (http://www.springer.com/computer/lncs/lncs+authors). Submissions in semantically structured HTML, e.g. in the RASH (http://cs.unibo.it/save-sd/rash/documentation/index.html), or Dokieli (https://github.com/linkeddata/dokieli) formats are also accepted as long as the final camera-ready version conforms to Springer's requirements (LaTeX/Word + PDF).

Further submission instructions will be published on the challenge wiki if required. All submissions should be provided via the submission system https://easychair.org/conferences/?conf=sempub17.

** NOTE: At least one author per accepted submission will have to register for the ESWC Conference, in order to be eligible for the prizes and to include the paper in the proceedings.

###JUDGING AND PRIZES After the first round of review, the Program Committee and the chairs will select a number of submissions conforming to the challenge requirements that will be invited to present their work. Submissions accepted for presentation will receive constructive reviews from the Program Committee, they will be included in the Springer post-proceedings of ESWC.

Six winners will be selected from those teams who participate in the challenge at ESWC. For each task we will select:

best performing tool, given to the paper which will get the highest score in the evaluation
best paper, selected by the Program and Challenge Committee

###IMPORTANT DATES

February 2, 2017: Publication of tasks, rules and queries description
February 2, 2017: Publication of the training dataset
March 7, 2017: Publication of the evaluation tool
March 20, 2017: Paper submission (5 page document) /*** extended deadline ***/
April 14, 2017: Notification and invitation to submit task results
April 23, 2017: Conference camera-ready papers submission (5-page document)
May 11, 2017: Publication of the evaluation dataset details
May 13, 2017: Results submission
May 30 - June 1: Challenge days
June 30, 2017: Camera ready paper for challenges post-proceedings (12 pages document)

###PROGRAM COMMITTEE