-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested volumes in XML #317
Comments
Also addresses #291. |
Sounds like we have some consensus, here. I agree that And yes, if we eventually future-proof the IDs, we could nicely concatenate all IDs from the root down. This suggests that the paper ID here should be <root id="P18">
<volume id="1"> <!-- long papers -->
<editor>...</editor>
<paper id="001">
...
</paper>
</volume>
</proceedings> (or perhaps the first paper ID is |
I wrote conversion scripts and updated the schemas. Next will be the Anthology code. I may be offline this weekend, so if someone else is into it, feel free to go ahead. Otherwise I'll try to get to it early next week. |
Do you envision keeping the current three-level structure and just making the structure be reflected in the XML, or are you envisioning making the structure more flexible, for example, to allow four levels of nesting when needed? I do think it's good to choose the tags carefully, since we'll be stuck with them for a while! I actually think we should get rid of |
+1 for As to volumes, we're sometimes wrapping full proceedings (e.g. for workshops), sometimes only parts of them (e.g. the long paper category of a conference), so "volume" is kind of the best generic term already, IMO. But if we were to replace it, I'd say |
What about So it’d be I was only thinking of doing three levels for now. I don’t have a use-case in mind for four (do you?) and this wasn’t that much work, so it wouldn’t be hard to expand it later. |
Could you give an example of what you're thinking for videos? I thought that would be at the third level, not the second. A use-case for four levels was #285, where an issue of AJCL included the proceedings of ACL 1975. But it's arguable whether you'd truly want four levels. |
While we're discussing this...currently, if I understand correctly, we have
which works but feels inconsistent (and leads to some complicated code in anthologize.pl). Another case where the current setup feels inconsistent is that sometimes there is "middle matter" (like an introduction page for a special session, maybe) which is given its own pdf and bib entry (unlike front matter, which only gets its own pdf), and frequently there is back matter which does not get either its own pdf or bib entry. Would the new structure offer any better alternative? I don't have a great solution, though...the best I can think of is to have special paper ids
|
I am not sure that the single case of #285 warrants extending the format to support four levels, but maybe I am just saying I don't want to do it myself. Yes, you're right, videos would appear at level three. I haven't given it a lot of thought, but the idea was something like: <collection>
<proceedings>
<talk>
<author>...</author>
<title>...</title>
<video>...</video>
</talk>
</proceedings>
</collection> |
This is my understanding. Do people ever cite whole proceedings? For the full format (tagging #291), I was thinking maybe we should change the delimiter to as to distinguish it (e.g., (This brings to mind another thing I'd like to add, the handbook, which has a local guide which, when done right, is personalized and entertaining and worth reading (e.g., ACL 2014). |
I think no one cites whole proceedings, but some people might use No, I don't think anyone ever cites back matter, but I can imagine someone might want to read the PDF of the back matter without downloading the PDF of the whole proceedings. If we continue to number front matter like any other paper, then perhaps |
I think this sounds reasonable. |
Recording steps here, since I've had to rebase a few times:
|
I'm picking this up again after some time, and hope to finish it this week. I'd like some feedback. In the latest iteration (incorporating thoughts from above), I am removing redundant details from papers (e.g., address, publisher, year, month), since they can just inherit this from the <?xml version='1.0' encoding='UTF-8'?>
<collection id="N19">
<proceedings id="1">
<booktitle>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</booktitle>
<editor><first>Jill</first><last>Burstein</last></editor>
<editor><first>Christy</first><last>Doran</last></editor>
<editor><first>Thamar</first><last>Solorio</last></editor>
<publisher>Association for Computational Linguistics</publisher>
<address>Minneapolis, Minnesota</address>
<month>June</month>
<year>2019</year>
<paper id="0">
<title>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</title>
<url>N19-1</url>
</paper>
<paper id="1">
<title>Entity Recognition at First Sight: <fixed-case>I</fixed-case>mproving <fixed-case>NER</fixed-case> with Eye Movement Information</title>
<author><first>Nora</first><last>Hollenstein</last></author>
<author><first>Ce</first><last>Zhang</last></author>
<pages>1–10</pages>
<abstract>Previous research shows that eye-tracking data contains information about the lexical and syntactic properties of text, which can be used to improve natural language processing models. In this work, we leverage eye movement features from three corpora with recorded gaze information to augment a state-of-the-art neural model for named entity recognition (NER) with gaze embeddings. These corpora were manually annotated with named entity labels. Moreover, we show how gaze features, generalized on word type level, eliminate the need for recorded eye-tracking data at test time. The gaze-augmented models for NER using token-level and type-level features outperform the baselines. We present the benefits of eye-tracking features by evaluating the NER models on both individual datasets as well as in cross-domain settings.</abstract>
<url>N19-1001</url>
</paper>
<paper id="2"> Any strong objections? |
No. 1) I can't think of any reason why that information could diverge and 2) as you said it is easy to re-obtain it from the parent node. Do you update the schema file? If so, a few comments in there would be nice. The schema is complex to warrant it. |
The schema is updated (edit: but not yet pushed) and I’ll add comments. Another variation would be to gather the volume specific information in a
This would make it a bit simpler to iterate over papers, i.e., the children of proceedings, needing only to skip over the “meta” child. |
Using xpath, I would use something like |
I like that the Can |
I also see I missed @mbollmann’s comment above about using
So to be explicit I think we are converging on:
|
@davidweichiang I think that only proceedings should have a If collections would also have a meta section, the semantics are not clear to me. Should the collection actually have editors etc. or are these only properties of the volumes? If the collection has editors A and B, does defining an editor in a proceeding overwrite or append? It is certainly possible to define all that, but the whole logic will be much more complicated and we save maybe a few kilobytes of data, if at all. |
All right, in the latest push I've committed the schema, conversion script, and the converted |
The current iteration looks good to me. Does the conversion script preserve the front matter entries? I think while we're converting this, this would be a good idea to implement logic that
|
It does preserve them. I’ve been adopting the convention of assigning frontmatter to be paper ID 0; per David’s suggestion above, it is the bib entry for the proceedings itself, and the PDF is just the frontmatter. We could use a special attribute which would make this semantics clearer ( |
My quibble with The questions that @akoehn raised about the semantics of
I have no idea why you'd have that, but presumably the paper year would override the meta year. For BibTeX generation, there will need to be special rules that cause |
In my opinion, a node should never have a property that an ancestor has as well. Only in that case we don't have to think about inheritance or similar concepts. In @mjpost's schema, only the proceedings have My mental concept is not that of a template, the information in the Interestingly (and maybe contradictory to my claim above), papers currently may also have |
My original thinking was in fact an inheritance model, but we don’t actually have a use case and I’d be in favor of Arne’s proposal here. I think the meaning of meta here is clear and not too meta. I also agree about meta applying only at the volume and not collection level. Collections in this new scheme are just logical groupings of volumes. For example we could move to putting all of NAACL in a single For bibtex:
if paper.title:
title = paper.title
field = ‘title’
else:
title = volume.booktitle
field = ‘booktitle’ and similar for “author”. |
I took a look at the schema and understand now. I think it works to require each field to occur at a certain level, barring any unforeseen unusual situations. Should volume 1 have |
Re the front matter: Mainly I would like to get rid of the implicit logic which volumes have front matter and which don't. If you think identifying them by ID=0 is good enough then that's fine by me, as long as e.g. journals would skip that ID then. |
Yes http should have been removed. And yes paper 0 will be fixed-width formatted, but applied in code instead of xml. Front-matter will be fixed. This PR also addresses #156 by only generating links when an explicit |
If the frontmatter behaves differently (always id=0 and no author/title) why not make it its own type of entity? I see no downside to that. |
We could do that but it seemed largely redundant with paper in the schema. |
Regarding the
Therefore, it would be good to remove that field in the schema as part of this clean up. Regarding the front matter:
Having more structure encoded in the schema is a good thing imo, as it documents the intended structure and catches errors earlier (and probably in a clearer way) than in code. |
Do you imagine this in addition to <collection>
<volume>
<frontmatter>
<editor>...
<year>...
<address>...
<publisher>...
</frontmatter>
<paper id="1">
etc. |
I think it would be ideal to move towards distinguishing between a volume's front matter and the volume itself, so I'd say that Before changing the schema to disallow papers from having editors, we should check the cases where they do. Most of them are in J74-J79, where conference proceedings, or collections of abstracts, have been published in the journal. |
Replacing it seems wrong, journals don't have front matter but still need meta information. But I think I agree that a front matter tag or attribute is preferable. Explicit is better than implicit, IMO. |
Okay, so @davidweichiang, I think the early J journals will have to be updated to be marked as frontmatter manually, since none of the |
Okay, I am ready to merge #324. Note that this may break many scripts that rely on the old format, but if you are iterating over the |
(I will wait a bit to do this, maybe till tomorrow, in case anyone has questions). |
Scripts: - updated scripts for hierarchical format (#317) - made version-swapping when adding revisions more robust - minimal checking of extension type for attachments Other changes: - added missing abstract (closes #411) - fixed address (closes #384) - added correction (closes #404) - added revision (closes #408) - added revision to N19-1318 (closes #418) - S19-2026 revision (closes #414) - added missing author to S19-1906 (#408)
A summary of changes: - Introduces a nested format (closes acl-org#317) - URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) - URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) - Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 ) - It punts on C69 reformatting (closes acl-org#147) Relevant, but not completed: - Creating PDF volumes by pasting together individual papers (acl-org#226) - This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),
Scripts: - updated scripts for hierarchical format (acl-org#317) - made version-swapping when adding revisions more robust - minimal checking of extension type for attachments Other changes: - added missing abstract (closes acl-org#411) - fixed address (closes acl-org#384) - added correction (closes acl-org#404) - added revision (closes acl-org#408) - added revision to N19-1318 (closes acl-org#418) - S19-2026 revision (closes acl-org#414) - added missing author to S19-1906 (acl-org#408)
Originally posted by @mjpost in #285 (comment)
This would address problems that have come up in:
My take on the tag names:
<volume>
seems good to me for backwards-compatibility, as you can still query for<volume>
and process its children like before, but I don't know if that's actually true or relevant for anything.<proceedings>
for the top-level tag as personally I like to think of the inner volumes as the "proceedings". Maybe that's a subjective thing, but as this tag doesn't actually carry a lot of semantics except grouping proceedings volumes (see?) by a common prefix, maybe we could just go with<root>
or something else that's pretty neutral?The text was updated successfully, but these errors were encountered: