SemPub17_Task1

Task 1

… of the Semantic Publishing Challenge 2017.

Motivation

Participants are required to extract information from the tables of the papers (in PDF). Extracting content from tables is a difficult task, which has been tackled by different researchers in the past. Our focus is on tables in scientific papers and solutions for re-publishing structured data as LOD. Tables will be collected from CEUR-WS.org publications and participants will be required to identify their structure and content. The task then will require PDF mining and data processing techniques.

Data Source

Datasets can be downloaded here:

Training Dataset TD1

List of URLs for one-time download:

http://ceur-ws.org/Vol-782/GoerlitzAndStaab_COLD2011.pdf
http://ceur-ws.org/Vol-905/MontoyaEtAl_COLD2012.pdf 
http://ceur-ws.org/Vol-912/paper7.pdf 
http://ceur-ws.org/Vol-996/papers/ldow2013-paper-07.pdf 
http://ceur-ws.org/Vol-1456/paper5.pdf
http://ceur-ws.org/Vol-1457/SSWS2015_paper4.pdf 
http://ceur-ws.org/Vol-1597/PROFILES2016_paper4.pdf 
http://ceur-ws.org/Vol-1700/paper-01.pdf
http://ceur-ws.org/Vol-1193/paper_65.pdf
http://ceur-ws.org/Vol-846/paper_20.pdf
http://ceur-ws.org/Vol-1193/paper_62.pdf
http://ceur-ws.org/Vol-1193/paper_50.pdf
http://ceur-ws.org/Vol-1350/paper-35.pdf
http://ceur-ws.org/Vol-1193/paper_6.pdf
http://ceur-ws.org/Vol-1766/oaei16_paper13.pdf
http://ceur-ws.org/Vol-666/paper2.pdf
http://ceur-ws.org/Vol-1350/paper-37.pdf
http://ceur-ws.org/Vol-1265/owled2014_submission_1.pdf
http://ceur-ws.org/Vol-858/ore2012_paper3.pdf
http://ceur-ws.org/Vol-1193/paper_58.pdf
http://ceur-ws.org/Vol-1387/paper_1.pdf
http://ceur-ws.org/Vol-1193/paper_52.pdf
http://ceur-ws.org/Vol-1766/oaei16_paper0.pdf 
http://ceur-ws.org/Vol-1545/oaei15_paper0.pdf 
http://ceur-ws.org/Vol-1317/oaei14_paper0.pdf 
http://ceur-ws.org/Vol-1111/oaei13_paper0.pdf 
http://ceur-ws.org/Vol-946/oaei12_paper0.pdf
http://ceur-ws.org/Vol-814/oaei11_paper0.pdf 
http://ceur-ws.org/Vol-689/oaei10_paper0.pdf
http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-CabrioEt2013.pdf
http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-HeEt2013.pdf
http://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-CabrioEt2013.pdf
http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-UngerEt2014.pdf
http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-HamonEt2014.pdf
http://ceur-ws.org/Vol-1180/CLEF2014wn-QA-ParkEt2014.pdf
http://ceur-ws.org/Vol-1391/173-CR.pdf
http://ceur-ws.org/Vol-1391/164-CR.pdf
http://ceur-ws.org/Vol-1391/164-CR.pdf
http://ceur-ws.org/Vol-1391/173-CR.pdf

Expected output on TD1

The following ZIP file contains the expected output of all queries on all papers in the training dataset: sempub17-TD1.zip

The archive contains the full list of queries (in QUERIES-LIST.csv) and the output of each of them in a separate .csv file.

For each query there is an entry in QUERIES-LIST.csv indicating the identifier of the query and the natural language description. The output of that query is contained in the corresponding .csv file, as shown below:

QueryID	Natural language description	CSV output file
Q1.1	For each dataset D of FedBench in paper P, find the number of subjects X: X=subjects, P=http://ceur-ws.org/Vol-782/GoerlitzAndStaab_COLD2011.pdf	Q1.1.csv

Evalation dataset ED1

The final evaluation will be performed by using SemPubEvaluator on a set of papers to be disclosed a few days before the deadline.

Queries

Participants are required to produce a dataset for answering the following queries.

Q1.1 (Subjects of a dataset in a paper): For each dataset D of FedBench in paper P, find the number of subjects X.
Q1.2 (Test cases): Which are the test cases T for OAEI in year Y?
Q1.3 (Common test cases of two editions): Which test cases T did all editions have in common from year Y1 to Y2?
Q1.4 (Persistant participants): Which participants P participated in all editions from Y1 to Y2?
Q1.5 (Repeated test cases): Which participants P have addressed test case T for every edition from Y1 to Y2?
Q1.6 (Information of datasets): How many X did test dataset D have in year Y?
Q1.7 (Information of systems for datasets): What was the X of system S in year Y for test dataset D?
Q1.8 (Systems best performance): Which system S in year Y had the best X ever?
Q1.9 (All tools ever mentioned in one edition of QALD): FList the names of all systems that have ever been used in an experiment in an edition of QALD.
Q1.10 (All tools ever mentioned in one edition of QALD): For Multilingual QA in the QALD challenge, what system in what edition had the best precision, and what system in what edition had the worst recall?

These queries have to be translated in SPARQL according to the challenge's general rules and have to produce an output according to the detailed rules.