Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested volumes in XML #317

Closed
mbollmann opened this issue May 8, 2019 · 38 comments
Closed

Nested volumes in XML #317

mbollmann opened this issue May 8, 2019 · 38 comments

Comments

@mbollmann
Copy link
Member

Originally posted by @mjpost in #285 (comment)

I like this idea of having nested volumes, and agree we should have two different names. It is more XML like (and clearer) to host each of the separate sub-volumes (long papers, tutorial abstracts, etc) hierarchically underneath the top-level volume (e.g., P18). So we could do something like:

<proceedings id="P18">
  <volume id="1"> <!-- long papers -->
    <editor>...</editor>
    <paper id="1001">
      ...
    </paper>
  </volume>
</proceedings>

This would also help with storing extra-conference materials as discussed in #298.

This would address problems that have come up in:

My take on the tag names:

  1. Keeping the inner volumes under <volume> seems good to me for backwards-compatibility, as you can still query for <volume> and process its children like before, but I don't know if that's actually true or relevant for anything.
  2. I don't like <proceedings> for the top-level tag as personally I like to think of the inner volumes as the "proceedings". Maybe that's a subjective thing, but as this tag doesn't actually carry a lot of semantics except grouping proceedings volumes (see?) by a common prefix, maybe we could just go with <root> or something else that's pretty neutral?
@davidweichiang
Copy link
Collaborator

Also addresses #291.

@mjpost
Copy link
Member

mjpost commented May 8, 2019

Sounds like we have some consensus, here. I agree that proceedings is a bad choice for the top-level name. root is a good compromise. Another option is volumes.

And yes, if we eventually future-proof the IDs, we could nicely concatenate all IDs from the root down. This suggests that the paper ID here should be 001 instead of 1001:

<root id="P18">
  <volume id="1"> <!-- long papers -->
    <editor>...</editor>
    <paper id="001">
      ...
    </paper>
  </volume>
</proceedings>

(or perhaps the first paper ID is 1, and decimal padding is determined with an attribute on the proceedings or volume, which would make backwards compatibility seamless if in the future we moved to IDs of the form P21-1-1.

@mjpost
Copy link
Member

mjpost commented May 11, 2019

root works in-file but is strange in anthology.py when referring to multiple XML files. How about collection for the top-level tag?

I wrote conversion scripts and updated the schemas. Next will be the Anthology code. I may be offline this weekend, so if someone else is into it, feel free to go ahead. Otherwise I'll try to get to it early next week.

@davidweichiang
Copy link
Collaborator

Do you envision keeping the current three-level structure and just making the structure be reflected in the XML, or are you envisioning making the structure more flexible, for example, to allow four levels of nesting when needed?

I do think it's good to choose the tags carefully, since we'll be stuck with them for a while! I actually think we should get rid of <volume> altogether, to clearly distinguish from the old schema, and because there is so much ambiguity in the code about what a "volume" is. But I don't have a good alternative yet...

@mbollmann
Copy link
Member Author

+1 for <collection> as the root tag.

As to volumes, we're sometimes wrapping full proceedings (e.g. for workshops), sometimes only parts of them (e.g. the long paper category of a conference), so "volume" is kind of the best generic term already, IMO. But if we were to replace it, I'd say <proceedings> would also work, even though it's not quite accurate in all circumstances.

@mjpost
Copy link
Member

mjpost commented May 11, 2019

What about book to replace the current volume? It reflects the (mostly) former real-world practice, as well as the terminology that’s often used in instructions to pub and workshop chairs (sometimes called “book” chairs).

So it’d be collectionbookpaper, though we could have a different second-level tag for other things like keynote videos and so on.

I was only thinking of doing three levels for now. I don’t have a use-case in mind for four (do you?) and this wasn’t that much work, so it wouldn’t be hard to expand it later.

@davidweichiang
Copy link
Collaborator

Could you give an example of what you're thinking for videos? I thought that would be at the third level, not the second.

A use-case for four levels was #285, where an issue of AJCL included the proceedings of ACL 1975. But it's arguable whether you'd truly want four levels.

@davidweichiang
Copy link
Collaborator

While we're discussing this...currently, if I understand correctly, we have

  • Z99-9000.pdf is the front matter
  • Z99-9000.bib is the bib entry for the whole proceedings
  • Z99-9.pdf is the whole proceedings
  • Z99-9.bib are the bib entries for all the papers in the proceedings

which works but feels inconsistent (and leads to some complicated code in anthologize.pl).

Another case where the current setup feels inconsistent is that sometimes there is "middle matter" (like an introduction page for a special session, maybe) which is given its own pdf and bib entry (unlike front matter, which only gets its own pdf), and frequently there is back matter which does not get either its own pdf or bib entry.

Would the new structure offer any better alternative? I don't have a great solution, though...the best I can think of is to have special paper ids front and back, so

  • Z99-9 would be the whole proceedings
  • Z99-9-front would be the front matter
  • Z99-9-001 would be the first paper
  • Z99-9-back would be the back matter

@mjpost
Copy link
Member

mjpost commented May 14, 2019

I am not sure that the single case of #285 warrants extending the format to support four levels, but maybe I am just saying I don't want to do it myself.

Yes, you're right, videos would appear at level three. I haven't given it a lot of thought, but the idea was something like:

<collection>
    <proceedings>
        <talk>
            <author>...</author>
            <title>...</title>
            <video>...</video>
        </talk>
    </proceedings>
</collection>

@mjpost
Copy link
Member

mjpost commented May 14, 2019

While we're discussing this...currently, if I understand correctly, we have

  • Z99-9000.pdf is the front matter
  • Z99-9000.bib is the bib entry for the whole proceedings
  • Z99-9.pdf is the whole proceedings
  • Z99-9.bib are the bib entries for all the papers in the proceedings

This is my understanding. Do people ever cite whole proceedings? Z99-9000.bib is the most displeasing from this list and it would make sense to let people cite the front matter itself (e.g., "ACL 2019 was the largest ACL conference, ever (cite front matter that has stats)").

For the full format (tagging #291), I was thinking maybe we should change the delimiter to as to distinguish it (e.g., Z99.9.1 or Z99.9.001). Changing the front matter bib would address your problem. We could add keywords like front and back but we could also just make front and back matter papers 0 and N. But what is in back matter, and do people ever cite it? I'm just thinking of the index.

(This brings to mind another thing I'd like to add, the handbook, which has a local guide which, when done right, is personalized and entertaining and worth reading (e.g., ACL 2014).

@davidweichiang
Copy link
Collaborator

I think no one cites whole proceedings, but some people might use crossref to point to the BibTeX entry for a proceedings.

No, I don't think anyone ever cites back matter, but I can imagine someone might want to read the PDF of the back matter without downloading the PDF of the whole proceedings.

If we continue to number front matter like any other paper, then perhaps Z99-9000.bib could become the bib for the front matter and Z99-9.bib could contain the bib entries for both the whole proceedings as well as all the individual papers? Or, Z99-9.bib for the whole proceedings and Z99-9-all.bib for the individual papers?

@mjpost
Copy link
Member

mjpost commented May 15, 2019

Or, Z99-9.bib for the whole proceedings and Z99-9-all.bib for the individual papers?

I think this sounds reasonable.

@mjpost
Copy link
Member

mjpost commented Jun 1, 2019

Recording steps here, since I've had to rebase a few times:

  • run bin/consolidate_urls.py to make the <url> tag explicit
  • run bin/make_hierarchical.py, which converts to a hierarchical representation and adds volume-level URL tags for volumes that exist

@mjpost
Copy link
Member

mjpost commented Jun 2, 2019

I'm picking this up again after some time, and hope to finish it this week. I'd like some feedback. In the latest iteration (incorporating thoughts from above), I am removing redundant details from papers (e.g., address, publisher, year, month), since they can just inherit this from the <proceedings> tag that contains them. For example:

<?xml version='1.0' encoding='UTF-8'?>
<collection id="N19">
  <proceedings id="1">
    <booktitle>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</booktitle>
    <editor><first>Jill</first><last>Burstein</last></editor>
    <editor><first>Christy</first><last>Doran</last></editor>
    <editor><first>Thamar</first><last>Solorio</last></editor>
    <publisher>Association for Computational Linguistics</publisher>
    <address>Minneapolis, Minnesota</address>
    <month>June</month>
    <year>2019</year>
    <paper id="0">
      <title>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</title>
      <url>N19-1</url>
    </paper>
    <paper id="1">
      <title>Entity Recognition at First Sight: <fixed-case>I</fixed-case>mproving <fixed-case>NER</fixed-case> with Eye Movement Information</title>
      <author><first>Nora</first><last>Hollenstein</last></author>
      <author><first>Ce</first><last>Zhang</last></author>
      <pages>1–10</pages>
      <abstract>Previous research shows that eye-tracking data contains information about the lexical and syntactic properties of text, which can be used to improve natural language processing models. In this work, we leverage eye movement features from three corpora with recorded gaze information to augment a state-of-the-art neural model for named entity recognition (NER) with gaze embeddings. These corpora were manually annotated with named entity labels. Moreover, we show how gaze features, generalized on word type level, eliminate the need for recorded eye-tracking data at test time. The gaze-augmented models for NER using token-level and type-level features outperform the baselines. We present the benefits of eye-tracking features by evaluating the NER models on both individual datasets as well as in cross-domain settings.</abstract>
      <url>N19-1001</url>
    </paper>
    <paper id="2">

Any strong objections?

@akoehn
Copy link
Member

akoehn commented Jun 3, 2019

Any strong objections?

No. 1) I can't think of any reason why that information could diverge and 2) as you said it is easy to re-obtain it from the parent node.

Do you update the schema file? If so, a few comments in there would be nice. The schema is complex to warrant it.

@mjpost
Copy link
Member

mjpost commented Jun 3, 2019

The schema is updated (edit: but not yet pushed) and I’ll add comments.

Another variation would be to gather the volume specific information in a meta block, e.g.,

<collection id="N19">
  <proceedings id="1">
      <meta>
        <booktitle>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</booktitle>
        <editor><first>Jill</first><last>Burstein</last></editor>
        <editor><first>Christy</first><last>Doran</last></editor>
        <editor><first>Thamar</first><last>Solorio</last></editor>
        <publisher>Association for Computational Linguistics</publisher>
        <address>Minneapolis, Minnesota</address>
        <month>June</month>
        <year>2019</year>
      </meta>
      ...

This would make it a bit simpler to iterate over papers, i.e., the children of proceedings, needing only to skip over the “meta” child.

@akoehn
Copy link
Member

akoehn commented Jun 3, 2019

Using xpath, I would use something like collection[id="N19"]//paper to obtain all papers of a conference; there would be no difference between both approaches. However, I still like the meta tag as it stores away the meta data visually.

@davidweichiang
Copy link
Collaborator

I like that the <meta> tag lets the data specify what fields are inherited, instead of a hard-coded list of inherited fields. But what does <meta> mean -- would something like <inherit> be more informative?

Can <proceedings> have a <meta> child also?

@mjpost
Copy link
Member

mjpost commented Jun 3, 2019

I also see I missed @mbollmann’s comment above about using <volume> for the level 2 tag instead of <proceedings>. I agree about @davidweichiang’s point about using entirely new tags, but I think <volume> is more appropriate there than <proceedings>.

<meta> on the other hand is maybe too general, but I can’t think of a better one. We could add that to <proceedings> too but I don’t have a use case.

So to be explicit I think we are converging on:

<collection>
  <volume>
    <meta>...</meta>
    <paper>..</paper>
    ...
  </volume
</collection>

@akoehn
Copy link
Member

akoehn commented Jun 3, 2019

@davidweichiang I think that only proceedings should have a meta section. I like the term meta as it is standard (-> HTML uses it, I use it in corpora) and fitting (this denotes meta data to the proceedings in contrast to the content captured by the papers).

If collections would also have a meta section, the semantics are not clear to me. Should the collection actually have editors etc. or are these only properties of the volumes? If the collection has editors A and B, does defining an editor in a proceeding overwrite or append? It is certainly possible to define all that, but the whole logic will be much more complicated and we save maybe a few kilobytes of data, if at all.

@mjpost
Copy link
Member

mjpost commented Jun 3, 2019

All right, in the latest push I've committed the schema, conversion script, and the converted N19.xml.

@mbollmann
Copy link
Member Author

The current iteration looks good to me.

Does the conversion script preserve the front matter entries? I think while we're converting this, this would be a good idea to implement logic that

  1. keeps front matter entries for non-journal volumes, but maybe explicitly marks them as such (Make conference front matter show that it is front matter #62); and
  2. throw away the dummy front matter entries for journals, as they're no longer needed.

@mjpost
Copy link
Member

mjpost commented Jun 4, 2019

It does preserve them. I’ve been adopting the convention of assigning frontmatter to be paper ID 0; per David’s suggestion above, it is the bib entry for the proceedings itself, and the PDF is just the frontmatter. We could use a special attribute which would make this semantics clearer (<paper role=“frontmatter”>?) if you have ideas, but I’m not sure this is necessary.

@davidweichiang
Copy link
Collaborator

My quibble with <meta> is that everything in this file is metadata; it's not clear to me that what goes inside the <meta> tag is even more meta. My suggestion: <template>?

The questions that @akoehn raised about the semantics of <meta> are good ones; they should be ironed out for <volume> and <paper> as well. For example, what should this do:

<volume>
  <meta>
    <year>1999</year>
  </meta>
  <paper>
    <year>2000</year>
  </paper>
</volume>

I have no idea why you'd have that, but presumably the paper year would override the meta year.

For BibTeX generation, there will need to be special rules that cause <paper> not to inherit <editor>, unless it has id 0, in which case it does inherit <editor> but it doesn't inherit <booktitle>...is there any way to avoid such rules?

@akoehn
Copy link
Member

akoehn commented Jun 4, 2019

In my opinion, a node should never have a property that an ancestor has as well. Only in that case we don't have to think about inheritance or similar concepts. In @mjpost's schema, only the proceedings have meta information.

My mental concept is not that of a template, the information in the meta tags are meta information of the proceedings. It would also not really make sense to talk about an editor of a paper, it is always the editor of proceedings, which consists of papers.

Interestingly (and maybe contradictory to my claim above), papers currently may also have editor fields -- this is currently the only overlap and I don't see the reason for that.

@mjpost
Copy link
Member

mjpost commented Jun 4, 2019

My original thinking was in fact an inheritance model, but we don’t actually have a use case and I’d be in favor of Arne’s proposal here. I think the meaning of meta here is clear and not too meta. I also agree about meta applying only at the volume and not collection level. Collections in this new scheme are just logical groupings of volumes. For example we could move to putting all of NAACL in a single naacl.xml file instead of splitting out by years.

For bibtex:

  • all papers do inherit both booktitle and editor; they just don’t all export it
  • we could handle the logic David pointed to by leaving off both title and authors from the frontmatter paper; and have simple logic that just defaults to the volume metadata in their absence. Something like:
if paper.title:
    title = paper.title
    field =titleelse: 
    title = volume.booktitle
    field =booktitle

and similar for “author”.

@davidweichiang
Copy link
Collaborator

I took a look at the schema and understand now. I think it works to require each field to occur at a certain level, barring any unforeseen unusual situations.

Should volume 1 have <url>N19-1</url> instead of http:...?
Should paper 0 have <url>N19-1000</url> instead of N19-1?

@mbollmann
Copy link
Member Author

Re the front matter: Mainly I would like to get rid of the implicit logic which volumes have front matter and which don't. If you think identifying them by ID=0 is good enough then that's fine by me, as long as e.g. journals would skip that ID then.

@mjpost
Copy link
Member

mjpost commented Jun 4, 2019

Yes http should have been removed. And yes paper 0 will be fixed-width formatted, but applied in code instead of xml.

Front-matter will be fixed. This PR also addresses #156 by only generating links when an explicit <url> tag is present.

@akoehn
Copy link
Member

akoehn commented Jun 4, 2019

If the frontmatter behaves differently (always id=0 and no author/title) why not make it its own type of entity? I see no downside to that.

@mjpost
Copy link
Member

mjpost commented Jun 4, 2019

We could do that but it seemed largely redundant with paper in the schema.

@akoehn
Copy link
Member

akoehn commented Jun 5, 2019

Regarding the editor field for papers I mentioned above, @mbollmann wrote:

The Rails app just silently ignored on regular papers. I copied the logic, but added the warning.

Therefore, it would be good to remove that field in the schema as part of this clean up.

Regarding the front matter:
It would be no additional content in the XML, but we could then enforce

  • papers to have authors
  • the front matter to not have authors
  • papers to have a positive (i.e. non-zero) ID
  • front matter to have id == 0

Having more structure encoded in the schema is a good thing imo, as it documents the intended structure and catches errors earlier (and probably in a clearer way) than in code.

@mjpost
Copy link
Member

mjpost commented Jun 5, 2019

Do you imagine this in addition to <meta>? Or replacing it? I can see doing:

<collection>
  <volume>
    <frontmatter>
      <editor>...
      <year>...
      <address>...
      <publisher>...
    </frontmatter>

    <paper id="1">
      etc.

@davidweichiang
Copy link
Collaborator

I think it would be ideal to move towards distinguishing between a volume's front matter and the volume itself, so I'd say that <meta> and the front matter should be separate tags. And the front matter should have its own id that is different from the whole-volume id; its id would conventionally be 0 but I hope that that doesn't need to be hard-coded anywhere.

Before changing the schema to disallow papers from having editors, we should check the cases where they do. Most of them are in J74-J79, where conference proceedings, or collections of abstracts, have been published in the journal.

@mbollmann
Copy link
Member Author

Replacing it seems wrong, journals don't have front matter but still need meta information.

But I think I agree that a front matter tag or attribute is preferable. Explicit is better than implicit, IMO.

@mjpost
Copy link
Member

mjpost commented Jun 5, 2019

Okay, so <meta> and <frontmatter>, which makes sense.

@davidweichiang, I think the early J journals will have to be updated to be marked as frontmatter manually, since none of the J7?-?000 files exist (they start numbering at 1).

This was referenced Jun 13, 2019
@mjpost
Copy link
Member

mjpost commented Jun 20, 2019

Okay, I am ready to merge #324. Note that this may break many scripts that rely on the old format, but if you are iterating over the Anthology class, I think it will work.

@mjpost
Copy link
Member

mjpost commented Jun 20, 2019

(I will wait a bit to do this, maybe till tomorrow, in case anyone has questions).

@mjpost mjpost closed this as completed in 0b4ea37 Jun 21, 2019
mjpost added a commit that referenced this issue Jun 26, 2019
Scripts:
- updated scripts for hierarchical format (#317)
- made version-swapping when adding revisions more robust
- minimal checking of extension type for attachments

Other changes:
- added missing abstract (closes #411)
- fixed address (closes #384)
- added correction (closes #404)
- added revision (closes #408)
- added revision to N19-1318 (closes #418)
- S19-2026 revision (closes #414)
- added missing author to S19-1906 (#408)
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
A summary of changes:

- Introduces a nested format (closes acl-org#317)
- URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) 
- URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) 
- Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 )
- It punts on C69 reformatting (closes acl-org#147)

Relevant, but not completed:
- Creating PDF volumes by pasting together individual papers (acl-org#226)
- This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
Scripts:
- updated scripts for hierarchical format (acl-org#317)
- made version-swapping when adding revisions more robust
- minimal checking of extension type for attachments

Other changes:
- added missing abstract (closes acl-org#411)
- fixed address (closes acl-org#384)
- added correction (closes acl-org#404)
- added revision (closes acl-org#408)
- added revision to N19-1318 (closes acl-org#418)
- S19-2026 revision (closes acl-org#414)
- added missing author to S19-1906 (acl-org#408)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants