-
Notifications
You must be signed in to change notification settings - Fork 2
SemPub16_QueriesTask2
Queries for Task 2 of the Semantic Publishing Challenge
More details and explanations will be gradually added to this page. Participants are invited to use the mailing list (https://groups.google.com/forum/#!forum/sempub-challenge) to comment, to ask questions, and to get in touch with chairs and other participants.
Participants are required to translate the input queries into SPARQL queries that can be executed against the produced LOD. The dataset can use any vocabulary but the query result output must conform with the rules described on this page.
Some preliminary information and general rules:
- queries must produce a CSV output, according to the rules detailed below. The evaluation will be performed automatically by comparing this output (on the evaluation dataset) with the expected results.
- IRIs of workshop volumes and papers must follow the following naming convention:
type of resource | URI example |
---|---|
workshop volume | http://ceur-ws.org/Vol-1010/ |
paper | http://ceur-ws.org/Vol-1099/#paper3 |
Papers have fragment IDs like paper3
in the most recently published workshop proceedings. When processing older workshop proceedings, please derive such IDs from the filenames of the papers, by removing the PDF extension (e.g. paper3.pdf
→ paper3
or ldow2011-paper12.pdf
→ ldow2011-paper12
).
- IRIs of other resources (e.g. affilitations, funding agencies) must also be within the http://ceur-ws.org/ namespace, but in a path separate from http://ceur-ws.org/Vol-NNN/ for any number NNN.
- the structure of the IRI used in the examples is not normative and does not provide any indication. Participants are free to use their own IRI structure and their own organization of classes and instances
- All data relevant for the queries and available in the input dataset must be extracted and produced as output. Though the evaluation mechanisms will be implemented so as to take minor differences into account and to normalize them, participants are asked to extract as much as information as possible. Further details are given below for each query.
- Since most of the queries take as input a paper (usually denoted as X), participants are required to use an unambiguous way of identifying input papers. To avoid errors, papers are identified by the URL of the PDF file, as available in http://ceur-ws-org.
- The order of output records does not matter.
We do not provide further explanations for queries whose output looks clear. If they are not or there is any other issue, please feel free to ask on the mailing list.
Query: Identify the affiliations of the authors of the paper X
The correct identification of the affiliations is tricky and would require to model complex organizations, sub-organizations and units. A simplified approach is adopted for this task: participants are required to extract one single string for each affiliation as it appears in the header of the paper excluding data about the location (address, city, state).
Participants are also asked to extract the fullname of each author, without any further processing: author names must be extracted as they appear in the header. No normalizion on middlenames and initials is required.
During the evaluation process, these values will be normalized in lowercase; spaces, punctuations and special characters will be stripped.
Further notes:
- If the affiliation is composed of multiple parts (for instance, it indicates a Department of a University) all these parts must be included in the same affiliation.
- If the affiliation is described in multiple lines, all these lines must be included apart from data about the location (according to the general rule above). Multiple lines can be collapsed in a single one, since newlines and punctuations will be stripped during the evaluation.
- In case of multiple affiliations for the same author, the query must return one line for each affiliation.
- In case of multiple authors with the same affiliation, the query must return one line for each author.
Expected output format (CSV):
affiliation-iri, affiliation-fullname, author-iri, author-fullname <IRI>,rdfs:Literal,<IRI>,rdfs:Literal <IRI>,rdfs:Literal,<IRI>,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q1.1: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper1.pdf
affiliation-iri, affiliation-fullname, author-iri, author-fullname <http://ceur-ws.org/affiliation/escuela-superior-politecnica-del-litoral>, "Escuela Superior Politécnica del Litoral", <http://ceur-ws.org/author/xavier-ochoa>, "Xavier Ochoa"
Query: Identify the countries of the affiliations of the authors in the paper X
Participants are required to extract data about affiliations and to identify the country where each research institution is located.
During the evaluation process, the name of the countries will be normalized in lowercase.
Further notes:
- the country names must be in English
- if the country is not explicitely mentioned in the affiliation, it should be derived from external sources
- the article 'the' in the country name is not relevant (for instance, 'The Netherlands' is considered equal to 'Netherlands')
- some acronyms are normalized: for instance 'UK', 'U.K.' and 'United Kingdom' are considered equivalent; 'USA', 'U.S.A.' are equivalent to 'United Stated of America'
Expected output format (CSV):
country-iri, country-fullname <IRI>,rdfs:Literal <IRI>,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q2.3: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1518/paper3.pdf
country-iri, country-fullname <http://ceur-ws.org/country/germany>, "Germany" <http://ceur-ws.org/country/the-netherlands>, "The Netherlands"
Query Q2.17: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1500/paper1.pdf
country-iri, country-fullname <http://ceur-ws.org/country/canada>, "Canada"
Query Q2.20: Identify the affiliations of the authors of the paper http://ceur-ws.org/Vol-1500/paper4.pdf
country-iri, country-fullname <http://ceur-ws.org/country/canada>, "Canada" <http://ceur-ws.org/country/united-kingdom>, "United Kingdom"
Query: Identify the supplementary material(s) for the paper X
Some scientific papers are equipped with supplementary material, that integrates the content of the paper. This material is linked in the fulltext (or in footnotes or in appendices) and might include: evaluation datasets, detailed report on evaluation, documentation, video, prototypes source code, etc..
Participants are required to identify these links in the paper and to extract the URL to access the supplementary material.
Important. The following data are NOT required to be extracted and included in the output:
- technical reports and extended versions of the papers
- external datasets which are mentioned in a paper but exist independenly from that paper. Datasets should be only considered if they are explicitly mentioned as supplementary material
- existing software, libraries, APIs and technologies used to develop a system
To avoid confusion, the web site of the system (or model, ontology, prototype, etc.) is instead considered as supplementary material for the purposes of this task.
Expected output format (CSV):
material-url <IRI> <IRI> [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q3.15: Identify the supplementary material(s) for the paper http://ceur-ws.org/Vol-1521/paper6.pdf
material-url "https://github.com/avijit1990"^^xsd:anyURI "https://github.com/rishabhmisra"^^xsd:anyURI
Query Q3.26: Identify the supplementary material(s) for the paper https://trac.cs.upb.de/mechatronicuml/wiki/PaperModevva2015
material-url "https://trac.cs.upb.de/mechatronicuml/wiki/PaperModevva2015"^^xsd:anyURI
Query: Identify the titles of the first-level sections of the paper X.
This year we would like to go deeper into the content of the papers. As first step, participants are required to extract the titles of the first-level sections of each paper. Though nested levels would be equally interesting, we limit the analysis to the main level only.
Sections must be represented as resources in the produced dataset identified by the section-iri value.
Section titles can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in titles has to be treated as normal text.
Participants are also required to identify the number of each section, even if the sections are not numbered in the original PDF source.
The numbering has to start from 1. The representation has to use arabic numerals, even if the original paper used roman numerals or letters.
Important. The following rules apply to special sections:
- Abstracts must NOT be included in the output, unless the paper is abstract-only; in that case, the output has to indicate one section titled 'Abstract' and numbered '1'
- The Reference section must be included in the output. For uniformity, it must be numbered even if it was not numbered in the original PDF source
- Acknowledgements sections must be identified as separate sections
- for uniformity, these must be numbered even if they were not numbered in the original PDF source
- acknowledgements must be considered as separate sections even if they are just formatted as special paragraphs at the end of the paper; instead, if the acknowledgments are in a footnote or in the main text of the paper they are not relevant for the purposes of this task.
During the evaluation process, section titles will be normalized in lowercase; spaces, punctuations and special characters will be stripped.
Expected output format (CSV):
section-iri, section-number, section-title <IRI>,xsd:integer, rdfs:Literal <IRI>,xsd:integer,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q4.1: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1518/paper1.pdf
section-iri, section-number, section-title <http://ceur-ws.org/section/vol-1518-paper1_sec1>, "1"^^xsd:integer, "INTRODUCTION" <http://ceur-ws.org/section/vol-1518-paper1_sec2>, "2"^^xsd:integer, "PREDICTING ACADEMIC RISK" <http://ceur-ws.org/section/vol-1518-paper1_sec3>, "3"^^xsd:integer, "VISUALIZING UNCERTAINTY" <http://ceur-ws.org/section/vol-1518-paper1_sec4>, "4"^^xsd:integer, "CASE-STUDY: RISK TO FAIL" <http://ceur-ws.org/section/vol-1518-paper1_sec5>, "5"^^xsd:integer, "CONCLUSIONS AND FURTHER WORK" <http://ceur-ws.org/section/vol-1518-paper1_sec6>, "6"^^xsd:integer, "ACKNOWLEDGMENTS" <http://ceur-ws.org/section/vol-1518-paper1_sec7>, "7"^^xsd:integer, "REFERENCES"
Query Q4.10: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1521/paper1.pdf
section-iri, section-number, section-title <http://ceur-ws.org/section/vol-1521-paper1_sec1>, "1"^^xsd:integer, "Abstract"
Query Q4.44: Identify the first-level sections of the paper http://ceur-ws.org/Vol-1320/paper_22.pdf
section-iri, section-number, section-title <http://ceur-ws.org/section/vol-1320-paper_22_sec1>, "1"^^xsd:integer, "Introduction" <http://ceur-ws.org/section/vol-1320-paper_22_sec2>, "2"^^xsd:integer, "Backgrounds" <http://ceur-ws.org/section/vol-1320-paper_22_sec3>, "3"^^xsd:integer, "Material and Methods" <http://ceur-ws.org/section/vol-1320-paper_22_sec4>, "4"^^xsd:integer, "Results" <http://ceur-ws.org/section/vol-1320-paper_22_sec5>, "5"^^xsd:integer, "Discussion" <http://ceur-ws.org/section/vol-1320-paper_22_sec6>, "6"^^xsd:integer, "Conclusion" <http://ceur-ws.org/section/vol-1320-paper_22_sec7>, "7"^^xsd:integer, "Authors’ contributions" <http://ceur-ws.org/section/vol-1320-paper_22_sec8>, "8"^^xsd:integer, "Acknowledgements" <http://ceur-ws.org/section/vol-1320-paper_22_sec9>, "9"^^xsd:integer, "Conflict of Interest" <http://ceur-ws.org/section/vol-1320-paper_22_sec10>, "10"^^xsd:integer, "References"
Query: Identify the captions of the tables in the paper X
Participants are also required to extract information about other structural components of the papers, among which tables.
As first step, they are asked to extract the captions of the tables. These captions can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in the caption has to be treated as normal text.
Tables must be represented as resources in the produced dataset identified by the table-iri value.
Participants are also required to identify the number of each table.
Important. Caption labels, such as 'Table', 'Tab.', etc., must not be part of the number (which is an integer value).
The numbering has to start from 1. The representation has to use arabic numerals, even if the original paper used roman numerals or letters.
During the evaluation process, captions will be normalized in lowercase; spaces, punctuations and special characters will be stripped.
Expected output format (CSV):
table-iri, table-number, table-caption <IRI>,xsd:integer,rdfs:Literal <IRI>,xsd:integer,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q5.2: Identify the captions of the tables in the paper http://ceur-ws.org/Vol-1518/paper2.pdf
table-iri, table-number, table-caption <http://ceur-ws.org/table/vol-1518-paper2_tab1>, "1"^^xsd:integer, "Top 10 candidate summary sentences for Example 1" <http://ceur-ws.org/table/vol-1518-paper2_tab2>, "2"^^xsd:integer, "Top 10 candidate summary sentences for Example 2"
Query Q5.11: Identify the captions of the tables in the paper http://ceur-ws.org/Vol-1521/paper2.pdf
table-iri, table-number, table-caption <http://ceur-ws.org/table/vol-1521-paper2_tab1>, "1"^^xsd:integer, "Number Of Named Entities Per Each Type In NER Data Sets" <http://ceur-ws.org/table/vol-1521-paper2_tab2>, "2"^^xsd:integer, "CoNLL F1 Scores on Turkish Formal Data Sets" <http://ceur-ws.org/table/vol-1521-paper2_tab3>, "3"^^xsd:integer, "Results on Turkish Informal Data Sets" <http://ceur-ws.org/table/vol-1521-paper2_tab4>, "4"^^xsd:integer, "Results on the MSM 2013 Data Set, ConLL F1 scores" <http://ceur-ws.org/table/vol-1521-paper2_tab5>, "5"^^xsd:integer, "Results on the Ritter Data Set, ConLL F1 scores" <http://ceur-ws.org/table/vol-1521-paper2_tab6>, "6"^^xsd:integer, "Top-5 Neighbours wrt Turkish Word Embeddings"
Query: Identify the captions of the figures in the paper X
Participants are also required to extract information about figures included in the papers.
As first step, they are asked to extract the captions of the figures. These captions can be in lowercase or uppercase (they will be normalized during the evaluation). For the sake of simplicity, subscript or superscript text in the caption has to be treated as normal text.
Important. In-line figures with no caption must not be taken into account. For the sake of simplicity, a figure composed of subfigures - with only one caption - has to be considered as one single figure (the caption describes all subfigures). Listings, pseudocode and algorithms are not relevant for the purpose of this task.
Figures must be represented as resources in the produced dataset identified by the figure-iri value.
Participants are also required to identify the number of each figure.
Important. Caption labels, such as 'Figure', 'Fig.', etc., must not be part of the number (which is an integer value). The number of each image has to match its position in the paper.
The numbering has to start from 1. The representation has to use arabic numerals, even if the original paper used roman numerals or letters.
During the evaluation process, captions will be normalized in lowercase; spaces, punctuations and special characters will be stripped.
Expected output format (CSV):
figure-iri, figure-number, figure-caption <IRI>,xsd:integer,rdfs:Literal <IRI>,xsd:integer,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q6.8: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1518/paper8.pdf
figure-iri, figure-number, figure-caption <http://ceur-ws.org/figure/vol-1518-paper8_fig1>, "1"^^xsd:integer, "LAK Explorer components" <http://ceur-ws.org/figure/vol-1518-paper8_fig2>, "2"^^xsd:integer, "LAK Explorer home page" <http://ceur-ws.org/figure/vol-1518-paper8_fig3>, "3"^^xsd:integer, "Using autocomplete" <http://ceur-ws.org/figure/vol-1518-paper8_fig4>, "4"^^xsd:integer, "The search results page" <http://ceur-ws.org/figure/vol-1518-paper8_fig5>, "5"^^xsd:integer, "The search results page" <http://ceur-ws.org/figure/vol-1518-paper8_fig6>, "6"^^xsd:integer, "Browsing similar papers" <http://ceur-ws.org/figure/vol-1518-paper8_fig7>, "7"^^xsd:integer, "Visually representing similar papers"
Query Q6.9: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1518/paper9.pdf
figure-iri, figure-number, figure-caption <http://ceur-ws.org/figure/vol-1518-paper9_fig1>, "1"^^xsd:integer, "Diagrammatic view of the methodological steps followed in this study." <http://ceur-ws.org/figure/vol-1518-paper9_fig2>, "2"^^xsd:integer, "Four co-occurrence matrices constructed from the term-by-document matrix." <http://ceur-ws.org/figure/vol-1518-paper9_fig3>, "3"^^xsd:integer, "Sampling of the output from the analysis"
Query Q6.23: Identify the captions of the figures in the paper http://ceur-ws.org/Vol-1317/om2014_Tpaper1.pdf
figure-iri, figure-number, figure-caption
Query: Identify the funding agencies that funded the research presented in the paper X (or part of it).
Participants are required to extract the funding agencies explicitly mentioned in the paper. The analysis is restricted to these agencies only.
Each agency is identified by a name or an acronym, or both these values. All data in the paper must be extracted. Data must be copied directly from the paper, without looking at external data sources.
Funding agencies must be represented as resources in the produced dataset identified by the funding-agency-iri value.
Note: in case of papers whose research is supported by a EU project, EU Commission must not be included among the funding agencies. That is covered by query Q2.8.
Punctuation, spaces, prepositions and articles in the agency name will be normalized during the evaluation process.
Further notes:
- for the sake of simplicity, if the paper mentions a project without any information about its funding agency, this must not be included
- for the same reason, the hierarchical organization of agencies is not taken into account; if a funding agency is listed as body of another funding agency (for instance, 'National Center For Advancing Translational Sciences of the National Institutes of Health') the full name has to be considered
- the article 'the' in the funding agency name is not relevant
Expected output format (CSV):
funding-agency-iri, funding-agency-name, funding-agency-acronym <IRI>,rdfs:Literal,rdfs:Literal <IRI>,rdfs:Literal,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q7.33: Identify the funding agencies that supported the research presented in the paper http://ceur-ws.org/Vol-1006/paper5.pdf (or part of it)
funding-agency-iri, funding-agency-name, funding-agency-acronym <http://ceur-ws.org/funding-agency/center-for-service-innovation>, "Center for Service Innovation", "CSI"
Query Q7.17: Identify the funding agencies that supported the research presented in the paper http://ceur-ws.org/Vol-1500/paper1.pdf (or part of it)
funding-agency-iri, funding-agency-name, funding-agency-acronym <http://ceur-ws.org/funding-agency/nserc>, , "NSERC"
Query: Identify the EU project(s) that supported the research presented in the paper X (or part of it).
The analysis is restricted to projects explicitly mentioned in the paper. The name of the projects must be copied directly from the paper, without looking at external data sources.
Projects must be represented as resources in the produced dataset identified by the project-iri value.
Punctuation, spaces, prepositions and articles in these values will be normalized during the evaluation process.
Further notes:
- projects are identified by their name. If the paper mentions both the name and the EU agreement number, it is enough to include the name.
- if the paper only mentions the number of the project, with no information about the name, the number must be included
- the name of the project must be included without the string 'project'
Expected output format (CSV):
project-iri, project-name <IRI>,rdfs:Literal <IRI>,rdfs:Literal [...]
Some examples of output are shown below, others can be found in the training dataset files.
Query Q8.3: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1518/paper3.pdf (or part of it)
project-iri, project-name <http://ceur-ws.org/project/open-discovery-space>, "Open Discovery Space"
Query Q8.42: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1320/paper_7.pdf (or part of it)
project-iri, project-name <http://ceur-ws.org/project/linked2safety>, "Linked2Safety" <http://ceur-ws.org/project/geoknow>, "GeoKnow"
Query Q8.45: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1320/paper_31.pdf (or part of it)
project-iri, project-name <http://ceur-ws.org/project/optique>, "OPTIQUE"
Query Q8.37: Identify the EU project(s) that supported the research presented in the paper http://ceur-ws.org/Vol-1309/paper2.pdf
project-iri, project-name <http://ceur-ws.org/project/246016>, "246016"