SemPub16_QueriesTask2

Queries for Task 2 of the Semantic Publishing Challenge

More details and explanations will be gradually added to this page. Participants are invited to use the mailing list (https://groups.google.com/forum/#!forum/sempub-challenge) to comment, to ask questions, and to get in touch with chairs and other participants.

General information and rules

Participants are required to translate the input queries into SPARQL queries that can be executed against the produced LOD. The dataset can use any vocabulary but the query result output must conform with the rules described on this page.

Some preliminary information and general rules:

queries must produce a CSV output, according to the rules detailed below. The evaluation will be performed automatically by comparing this output (on the evaluation dataset) with the expected results.
IRIs of workshop volumes and papers must follow the following naming convention:

type of resource	URI example
workshop volume	http://ceur-ws.org/Vol-1010/
paper	http://ceur-ws.org/Vol-1099/#paper3

Papers have fragment IDs like paper3 in the most recently published workshop proceedings. When processing older workshop proceedings, please derive such IDs from the filenames of the papers, by removing the PDF extension (e.g. paper3.pdf → paper3 or ldow2011-paper12.pdf → ldow2011-paper12 ).

IRIs of other resources (e.g. affilitations, funding agencies) must also be within the http://ceur-ws.org/ namespace, but in a path separate from http://ceur-ws.org/Vol-NNN/ for any number NNN.
the structure of the IRI used in the examples is not normative and does not provide any indication. Participants are free to use their own IRI structure and their own organization of classes and instances
All data relevant for the queries and available in the input dataset must be extracted and produced as output. Though the evaluation mechanisms will be implemented so as to take minor differences into account and to normalize them, participants are asked to extract as much as information as possible. Further details are given below for each query.
Since most of the queries take as input a paper (usually denoted as X), participants are required to use an unambiguous way of identifying input papers. To avoid errors, papers are identified by the URL of the PDF file, as available in http://ceur-ws-org.
The order of output records does not matter.

We do not provide further explanations for queries whose output looks clear. If they are not or there is any other issue, please feel free to ask on the mailing list.

Queries

Query Q2.1: Affiliations in a paper

Query: Identify the affiliations of the authors of the paper X

The correct identification of the affiliations is tricky and would require to model complex organizations, sub-organizations and units. A simplified approach is adopted for this task: participants are required to extract one single string for each affiliation as it appears in the header of the paper excluding data about the location (address, city, state).

Participants are also asked to extract the fullname of each author, without any further processing: author names must be extracted as they appear in the header. No normalizion on middlenames and initials is required.

During the evaluation process, these values will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Further notes:

If the affiliation is composed of multiple parts (for instance, it indicates a Department of a University) all these parts must be included in the same affiliation.
If the affiliation is described in multiple lines, all these lines must be included apart from data about the location (according to the general rule above). Multiple lines can be collapsed in a single one, since newlines and punctuations will be stripped during the evaluation.
In case of multiple affiliations for the same author, the query must return one line for each affiliation.
In case of multiple authors with the same affiliation, the query must return one line for each author.

Expected output format (CSV):

affiliation-iri, affiliation-fullname, author-iri, author-fullname
<IRI>,rdfs:Literal,<IRI>,rdfs:Literal
<IRI>,rdfs:Literal,<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q1.1: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf

affiliation-iri, affiliation-fullname, author-iri, author-fullname
<http://ceur-ws.org/affiliation/escuela-superior-politecnica-del-litoral>, "Escuela Superior Politécnica del Litoral", <http://ceur-ws.org/author/xavier-ochoa>, "Xavier Ochoa"

Query Q2.2: Countries in affiliations

Query: Identify the countries of the affiliations of the authors in the paper X

Participants are required to extract data about affiliations and to identify the country where each research institution is located.

During the evaluation process, the name of the countries will be normalized in lowercase.

Further notes:

the country names must be in English
if the country is not explicitely mentioned in the affiliation, it should be derived from external sources
the article 'the' in the country name is not relevant (for instance, 'The Netherlands' is considered equal to 'Netherlands')
some acronyms are normalized: for instance 'UK', 'U.K.' and 'United Kingdom' are considered equivalent; 'USA', 'U.S.A.' are equivalent to 'United Stated of America'

Expected output format (CSV):

country-iri, country-fullname
<IRI>,rdfs:Literal
<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q2.3: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper3.pdf

country-iri, country-fullname
<http://ceur-ws.org/country/germany>, "Germany"
<http://ceur-ws.org/country/the-netherlands>, "The Netherlands"

Query Q2.17: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1500/paper1.pdf

country-iri, country-fullname
<http://ceur-ws.org/country/canada>, "Canada"

Query Q2.20: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1500/paper4.pdf

country-iri, country-fullname
<http://ceur-ws.org/country/canada>, "Canada"
<http://ceur-ws.org/country/united-kingdom>, "United Kingdom"

Query Q2.3: Supplementary material

Query: Identify the supplementary material(s) for the paper X

Some scientific papers are equipped with supplementary material, that integrates the content of the paper. This material is linked in the fulltext (or in footnotes or in appendices) and might include: evaluation datasets, detailed report on evaluation, documentation, video, prototypes source code, etc..

Participants are required to identify these links in the paper and to extract the URL to access the supplementary material.

Important. The following data are NOT required to be extracted and included in the output:

technical reports and extended versions of the papers
external datasets which are mentioned in a paper but exist independenly from that paper. Datasets should be only considered if they are explicitly mentioned as supplementary material
existing software, libraries, APIs and technologies used to develop a system

To avoid confusion, the web site of the system (or model, ontology, prototype, etc.) is instead considered as supplementary material for the purposes of this task.

Expected output format (CSV):

material-url
<IRI>
<IRI>
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q3.15: Identify the supplementary material(s) for the paper http://ceur-ws.org/Vol-1521/paper6.pdf

material-url
"https://github.com/avijit1990"^^xsd:anyURI
"https://github.com/rishabhmisra"^^xsd:anyURI

Query Q3.26: Identify the supplementary material(s) for the paper https://trac.cs.upb.de/mechatronicuml/wiki/PaperModevva2015

material-url
"https://trac.cs.upb.de/mechatronicuml/wiki/PaperModevva2015"^^xsd:anyURI

Query Q2.4: Sections

Query: Identify the titles of the first-level sections of the paper X.

This year we would like to go deeper into the content of the papers. As first step, participants are required to extract the titles of the first-level sections of each paper. Though nested levels would be equally interesting, we limit the analysis to the main level only.

Sections must be represented as resources in the produced dataset identified by the section-iri value.

Section titles can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in titles has to be treated as normal text.

Participants are also required to identify the number of each section, even if the sections are not numbered in the original PDF source.

The numbering has to start from 1. The representation has to use arabic numerals, even if the original paper used roman numerals or letters.

Important. The following rules apply to special sections:

Abstracts must NOT be included in the output, unless the paper is abstract-only; in that case, the output has to indicate one section titled 'Abstract' and numbered '1'
The Reference section must be included in the output. For uniformity, it must be numbered even if it was not numbered in the original PDF source
Acknowledgements sections must be identified as separate sections
- for uniformity, these must be numbered even if they were not numbered in the original PDF source
- acknowledgements must be considered as separate sections even if they are just formatted as special paragraphs at the end of the paper; instead, if the acknowledgments are in a footnote or in the main text of the paper they are not relevant for the purposes of this task.

During the evaluation process, section titles will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Expected output format (CSV):

section-iri, section-number, section-title
<IRI>,xsd:integer, rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q4.1: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1518/paper1.pdf

section-iri, section-number, section-title
<http://ceur-ws.org/section/vol-1518-paper1_sec1>, "1"^^xsd:integer, "INTRODUCTION"
<http://ceur-ws.org/section/vol-1518-paper1_sec2>, "2"^^xsd:integer, "PREDICTING ACADEMIC RISK"
<http://ceur-ws.org/section/vol-1518-paper1_sec3>, "3"^^xsd:integer, "VISUALIZING UNCERTAINTY"
<http://ceur-ws.org/section/vol-1518-paper1_sec4>, "4"^^xsd:integer, "CASE-STUDY: RISK TO FAIL"
<http://ceur-ws.org/section/vol-1518-paper1_sec5>, "5"^^xsd:integer, "CONCLUSIONS AND FURTHER WORK"
<http://ceur-ws.org/section/vol-1518-paper1_sec6>, "6"^^xsd:integer, "ACKNOWLEDGMENTS"
<http://ceur-ws.org/section/vol-1518-paper1_sec7>, "7"^^xsd:integer, "REFERENCES"

Query Q4.10: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1521/paper1.pdf

section-iri, section-number, section-title
<http://ceur-ws.org/section/vol-1521-paper1_sec1>, "1"^^xsd:integer, "Abstract"

Query Q4.44: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1320/paper_22.pdf

section-iri, section-number, section-title
<http://ceur-ws.org/section/vol-1320-paper_22_sec1>, "1"^^xsd:integer, "Introduction"
<http://ceur-ws.org/section/vol-1320-paper_22_sec2>, "2"^^xsd:integer, "Backgrounds"
<http://ceur-ws.org/section/vol-1320-paper_22_sec3>, "3"^^xsd:integer, "Material and Methods"
<http://ceur-ws.org/section/vol-1320-paper_22_sec4>, "4"^^xsd:integer, "Results"
<http://ceur-ws.org/section/vol-1320-paper_22_sec5>, "5"^^xsd:integer, "Discussion"
<http://ceur-ws.org/section/vol-1320-paper_22_sec6>, "6"^^xsd:integer, "Conclusion"
<http://ceur-ws.org/section/vol-1320-paper_22_sec7>, "7"^^xsd:integer, "Authors’ contributions"
<http://ceur-ws.org/section/vol-1320-paper_22_sec8>, "8"^^xsd:integer, "Acknowledgements"
<http://ceur-ws.org/section/vol-1320-paper_22_sec9>, "9"^^xsd:integer, "Conflict of Interest"
<http://ceur-ws.org/section/vol-1320-paper_22_sec10>, "10"^^xsd:integer, "References"

Query Q2.5: Tables

Query: Identify the captions of the tables in the paper X

Participants are also required to extract information about other structural components of the papers, among which tables.

As first step, they are asked to extract the captions of the tables. These captions can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in the caption has to be treated as normal text.

Tables must be represented as resources in the produced dataset identified by the table-iri value.

Participants are also required to identify the number of each table.

Important. Caption labels, such as 'Table', 'Tab.', etc., must not be part of the number (which is an integer value).

The numbering has to start from 1. The representation has to use arabic numerals, even if the original paper used roman numerals or letters.

During the evaluation process, captions will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Expected output format (CSV):

table-iri, table-number, table-caption
<IRI>,xsd:integer,rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q5.2: Identify the captions of the tables in the paper http://ceur-ws.org/Vol-1518/paper2.pdf

table-iri, table-number, table-caption
<http://ceur-ws.org/table/vol-1518-paper2_tab1>, "1"^^xsd:integer, "Top 10 candidate summary sentences for Example 1"
<http://ceur-ws.org/table/vol-1518-paper2_tab2>, "2"^^xsd:integer, "Top 10 candidate summary sentences for Example 2"

Query Q5.11: Identify the captions of the tables in the paper http://ceur-ws.org/Vol-1521/paper2.pdf

table-iri, table-number, table-caption
<http://ceur-ws.org/table/vol-1521-paper2_tab1>, "1"^^xsd:integer, "Number Of Named Entities Per Each Type In NER Data Sets"
<http://ceur-ws.org/table/vol-1521-paper2_tab2>, "2"^^xsd:integer, "CoNLL F1 Scores on Turkish Formal Data Sets"
<http://ceur-ws.org/table/vol-1521-paper2_tab3>, "3"^^xsd:integer, "Results on Turkish Informal Data Sets"
<http://ceur-ws.org/table/vol-1521-paper2_tab4>, "4"^^xsd:integer, "Results on the MSM 2013 Data Set, ConLL F1 scores"
<http://ceur-ws.org/table/vol-1521-paper2_tab5>, "5"^^xsd:integer, "Results on the Ritter Data Set, ConLL F1 scores"
<http://ceur-ws.org/table/vol-1521-paper2_tab6>, "6"^^xsd:integer, "Top-5 Neighbours wrt Turkish Word Embeddings"

Query Q2.6: Figures

Query: Identify the captions of the figures in the paper X

Participants are also required to extract information about figures included in the papers.

As first step, they are asked to extract the captions of the figures. These captions can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in the caption has to be treated as normal text.

Important. In-line figures with no caption must not be taken into account. For the sake of simplicity, a figure composed of subfigures - with only one caption - has to be considered as one single figure (the caption describes all subfigures). Listings, pseudocode and algorithms are not relevant for the purpose of this task.

Figures must be represented as resources in the produced dataset identified by the figure-iri value.

Participants are also required to identify the number of each figure.

Important. Caption labels, such as 'Figure', 'Fig.', etc., must not be part of the number (which is an integer value). The number of each image has to match its position in the paper.

The numbering has to start from 1. The representation has to use arabic numerals, even if the original paper used roman numerals or letters.

During the evaluation process, captions will be normalized in lowercase; spaces, punctuations and special characters will be stripped.

Expected output format (CSV):

figure-iri, figure-number, figure-caption
<IRI>,xsd:integer,rdfs:Literal
<IRI>,xsd:integer,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q6.8: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1518/paper8.pdf

figure-iri, figure-number, figure-caption
<http://ceur-ws.org/figure/vol-1518-paper8_fig1>, "1"^^xsd:integer, "LAK Explorer components"
<http://ceur-ws.org/figure/vol-1518-paper8_fig2>, "2"^^xsd:integer, "LAK Explorer home page"
<http://ceur-ws.org/figure/vol-1518-paper8_fig3>, "3"^^xsd:integer, "Using autocomplete"
<http://ceur-ws.org/figure/vol-1518-paper8_fig4>, "4"^^xsd:integer, "The search results page"
<http://ceur-ws.org/figure/vol-1518-paper8_fig5>, "5"^^xsd:integer, "The search results page"
<http://ceur-ws.org/figure/vol-1518-paper8_fig6>, "6"^^xsd:integer, "Browsing similar papers"
<http://ceur-ws.org/figure/vol-1518-paper8_fig7>, "7"^^xsd:integer, "Visually representing similar papers"

Query Q6.9: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1518/paper9.pdf

figure-iri, figure-number, figure-caption
<http://ceur-ws.org/figure/vol-1518-paper9_fig1>, "1"^^xsd:integer, "Diagrammatic view of the methodological steps followed in this study."
<http://ceur-ws.org/figure/vol-1518-paper9_fig2>, "2"^^xsd:integer, "Four co-occurrence matrices constructed from the term-by-document matrix."
<http://ceur-ws.org/figure/vol-1518-paper9_fig3>, "3"^^xsd:integer, "Sampling of the output from the analysis"

Query Q6.23: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1317/om2014_Tpaper1.pdf

figure-iri, figure-number, figure-caption

Query Q2.7: Funding agencies

Query: Identify the funding agencies that funded the research presented in the paper X (or part of it).

Participants are required to extract the funding agencies explicitly mentioned in the paper. The analysis is restricted to these agencies only.

Each agency is identified by a name or an acronym, or both these values. All data in the paper must be extracted. Data must be copied directly from the paper, without looking at external data sources.

Funding agencies must be represented as resources in the produced dataset identified by the funding-agency-iri value.

Note: in case of papers whose research is supported by a EU project, EU Commission must not be included among the funding agencies. That is covered by query Q2.8.

Punctuation, spaces, prepositions and articles in the agency name will be normalized during the evaluation process.

Further notes:

for the sake of simplicity, if the paper mentions a project without any information about its funding agency, this must not be included
for the same reason, the hierarchical organization of agencies is not taken into account; if a funding agency is listed as body of another funding agency (for instance, 'National Center For Advancing Translational Sciences of the National Institutes of Health') the full name has to be considered
the article 'the' in the funding agency name is not relevant

Expected output format (CSV):

funding-agency-iri, funding-agency-name, funding-agency-acronym  
<IRI>,rdfs:Literal,rdfs:Literal
<IRI>,rdfs:Literal,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q7.33: Identify the funding agencies that supported the research presented in the paper http://ceur-ws.org/Vol-1006/paper5.pdf (or part of it)

funding-agency-iri, funding-agency-name, funding-agency-acronym
<http://ceur-ws.org/funding-agency/center-for-service-innovation>, "Center for Service Innovation", "CSI"

Query Q7.17: Identify the funding agencies that supported the research presented in the paper http://ceur-ws.org/Vol-1500/paper1.pdf (or part of it)

funding-agency-iri, funding-agency-name, funding-agency-acronym
<http://ceur-ws.org/funding-agency/nserc>, , "NSERC"

Query Q2.8: EU projects

Query: Identify the EU project(s) that supported the research presented in the paper X (or part of it).

The analysis is restricted to projects explicitly mentioned in the paper. The name of the projects must be copied directly from the paper, without looking at external data sources.

Projects must be represented as resources in the produced dataset identified by the project-iri value.

Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.

Further notes:

projects are identified by their name. If the paper mentions both the name and the EU agreement number, it is enough to include the name.
if the paper only mentions the number of the project, with no information about the name, the number must be included
the name of the project must be included without the string 'project'

Expected output format (CSV):

project-iri, project-name 
<IRI>,rdfs:Literal
<IRI>,rdfs:Literal
[...]

Examples in TD

Some examples of output are shown below, others can be found in the training dataset files.

Query Q8.3: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1518/paper3.pdf (or part of it)

project-iri, project-name
<http://ceur-ws.org/project/open-discovery-space>, "Open Discovery Space"

Query Q8.42: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1320/paper_7.pdf (or part of it)

project-iri, project-name
<http://ceur-ws.org/project/linked2safety>, "Linked2Safety"
<http://ceur-ws.org/project/geoknow>, "GeoKnow"

Query Q8.45: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1320/paper_31.pdf (or part of it)

project-iri, project-name
<http://ceur-ws.org/project/optique>, "OPTIQUE"

Query Q8.37: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1309/paper2.pdf

project-iri, project-name
<http://ceur-ws.org/project/246016>, "246016"

SemPub Challenge 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SemPub16_QueriesTask2

Queries for Task 2 of the Semantic Publishing Challenge

General information and rules

Queries

Query Q2.1: Affiliations in a paper

Examples in TD

Query Q2.2: Countries in affiliations

Examples in TD

Query Q2.3: Supplementary material

Examples in TD

Query Q2.4: Sections

Examples in TD

Query Q2.5: Tables

Examples in TD

Query Q2.6: Figures

Examples in TD

Query Q2.7: Funding agencies

Examples in TD

Query Q2.8: EU projects

Examples in TD

Clone this wiki locally