Nested volumes in XML #317

mbollmann · 2019-05-08T12:50:42Z

Originally posted by @mjpost in #285 (comment)

I like this idea of having nested volumes, and agree we should have two different names. It is more XML like (and clearer) to host each of the separate sub-volumes (long papers, tutorial abstracts, etc) hierarchically underneath the top-level volume (e.g., P18). So we could do something like:
<proceedings id="P18">
  <volume id="1"> 
    <editor>...</editor>
    <paper id="1001">
      ...
    </paper>
  </volume>
</proceedings>
This would also help with storing extra-conference materials as discussed in #298.

This would address problems that have come up in:

Wrong revision links #265 (comment)
Clean up J79 and renumber it to J74-J78 #285 (comment)
Potentially also relevant for Adding non-paper entries #298?

My take on the tag names:

Keeping the inner volumes under <volume> seems good to me for backwards-compatibility, as you can still query for <volume> and process its children like before, but I don't know if that's actually true or relevant for anything.
I don't like <proceedings> for the top-level tag as personally I like to think of the inner volumes as the "proceedings". Maybe that's a subjective thing, but as this tag doesn't actually carry a lot of semantics except grouping proceedings volumes (see?) by a common prefix, maybe we could just go with <root> or something else that's pretty neutral?

The text was updated successfully, but these errors were encountered:

davidweichiang · 2019-05-08T13:24:20Z

Also addresses #291.

mjpost · 2019-05-08T13:32:58Z

Sounds like we have some consensus, here. I agree that proceedings is a bad choice for the top-level name. root is a good compromise. Another option is volumes.

And yes, if we eventually future-proof the IDs, we could nicely concatenate all IDs from the root down. This suggests that the paper ID here should be 001 instead of 1001:

<root id="P18">
  <volume id="1"> <!-- long papers -->
    <editor>...</editor>
    <paper id="001">
      ...
    </paper>
  </volume>
</proceedings>

(or perhaps the first paper ID is 1, and decimal padding is determined with an attribute on the proceedings or volume, which would make backwards compatibility seamless if in the future we moved to IDs of the form P21-1-1.

mjpost · 2019-05-11T03:01:27Z

root works in-file but is strange in anthology.py when referring to multiple XML files. How about collection for the top-level tag?

I wrote conversion scripts and updated the schemas. Next will be the Anthology code. I may be offline this weekend, so if someone else is into it, feel free to go ahead. Otherwise I'll try to get to it early next week.

davidweichiang · 2019-05-11T14:37:09Z

Do you envision keeping the current three-level structure and just making the structure be reflected in the XML, or are you envisioning making the structure more flexible, for example, to allow four levels of nesting when needed?

I do think it's good to choose the tags carefully, since we'll be stuck with them for a while! I actually think we should get rid of <volume> altogether, to clearly distinguish from the old schema, and because there is so much ambiguity in the code about what a "volume" is. But I don't have a good alternative yet...

mbollmann · 2019-05-11T18:19:40Z

+1 for <collection> as the root tag.

As to volumes, we're sometimes wrapping full proceedings (e.g. for workshops), sometimes only parts of them (e.g. the long paper category of a conference), so "volume" is kind of the best generic term already, IMO. But if we were to replace it, I'd say <proceedings> would also work, even though it's not quite accurate in all circumstances.

mjpost · 2019-05-11T18:41:35Z

What about book to replace the current volume? It reflects the (mostly) former real-world practice, as well as the terminology that’s often used in instructions to pub and workshop chairs (sometimes called “book” chairs).

So it’d be collection → book → paper, though we could have a different second-level tag for other things like keynote videos and so on.

I was only thinking of doing three levels for now. I don’t have a use-case in mind for four (do you?) and this wasn’t that much work, so it wouldn’t be hard to expand it later.

davidweichiang · 2019-05-11T19:57:45Z

Could you give an example of what you're thinking for videos? I thought that would be at the third level, not the second.

A use-case for four levels was #285, where an issue of AJCL included the proceedings of ACL 1975. But it's arguable whether you'd truly want four levels.

davidweichiang · 2019-05-13T14:59:06Z

While we're discussing this...currently, if I understand correctly, we have

Z99-9000.pdf is the front matter
Z99-9000.bib is the bib entry for the whole proceedings
Z99-9.pdf is the whole proceedings
Z99-9.bib are the bib entries for all the papers in the proceedings

which works but feels inconsistent (and leads to some complicated code in anthologize.pl).

Another case where the current setup feels inconsistent is that sometimes there is "middle matter" (like an introduction page for a special session, maybe) which is given its own pdf and bib entry (unlike front matter, which only gets its own pdf), and frequently there is back matter which does not get either its own pdf or bib entry.

Would the new structure offer any better alternative? I don't have a great solution, though...the best I can think of is to have special paper ids front and back, so

Z99-9 would be the whole proceedings
Z99-9-front would be the front matter
Z99-9-001 would be the first paper
Z99-9-back would be the back matter

mjpost · 2019-05-14T01:54:01Z

I am not sure that the single case of #285 warrants extending the format to support four levels, but maybe I am just saying I don't want to do it myself.

Yes, you're right, videos would appear at level three. I haven't given it a lot of thought, but the idea was something like:

<collection>
    <proceedings>
        <talk>
            <author>...</author>
            <title>...</title>
            <video>...</video>
        </talk>
    </proceedings>
</collection>

mjpost · 2019-05-14T02:04:28Z

While we're discussing this...currently, if I understand correctly, we have

Z99-9000.pdf is the front matter

Z99-9000.bib is the bib entry for the whole proceedings

Z99-9.pdf is the whole proceedings

Z99-9.bib are the bib entries for all the papers in the proceedings

This is my understanding. Do people ever cite whole proceedings? Z99-9000.bib is the most displeasing from this list and it would make sense to let people cite the front matter itself (e.g., "ACL 2019 was the largest ACL conference, ever (cite front matter that has stats)").

For the full format (tagging #291), I was thinking maybe we should change the delimiter to as to distinguish it (e.g., Z99.9.1 or Z99.9.001). Changing the front matter bib would address your problem. We could add keywords like front and back but we could also just make front and back matter papers 0 and N. But what is in back matter, and do people ever cite it? I'm just thinking of the index.

(This brings to mind another thing I'd like to add, the handbook, which has a local guide which, when done right, is personalized and entertaining and worth reading (e.g., ACL 2014).

davidweichiang · 2019-05-14T02:17:33Z

I think no one cites whole proceedings, but some people might use crossref to point to the BibTeX entry for a proceedings.

No, I don't think anyone ever cites back matter, but I can imagine someone might want to read the PDF of the back matter without downloading the PDF of the whole proceedings.

If we continue to number front matter like any other paper, then perhaps Z99-9000.bib could become the bib for the front matter and Z99-9.bib could contain the bib entries for both the whole proceedings as well as all the individual papers? Or, Z99-9.bib for the whole proceedings and Z99-9-all.bib for the individual papers?

mjpost · 2019-05-15T17:52:33Z

Or, Z99-9.bib for the whole proceedings and Z99-9-all.bib for the individual papers?

I think this sounds reasonable.

mjpost · 2019-06-01T19:57:42Z

Recording steps here, since I've had to rebase a few times:

run bin/consolidate_urls.py to make the <url> tag explicit
run bin/make_hierarchical.py, which converts to a hierarchical representation and adds volume-level URL tags for volumes that exist

mjpost · 2019-06-02T19:34:35Z

I'm picking this up again after some time, and hope to finish it this week. I'd like some feedback. In the latest iteration (incorporating thoughts from above), I am removing redundant details from papers (e.g., address, publisher, year, month), since they can just inherit this from the <proceedings> tag that contains them. For example:

<?xml version='1.0' encoding='UTF-8'?>
<collection id="N19">
  <proceedings id="1">
    <booktitle>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</booktitle>
    <editor><first>Jill</first><last>Burstein</last></editor>
    <editor><first>Christy</first><last>Doran</last></editor>
    <editor><first>Thamar</first><last>Solorio</last></editor>
    <publisher>Association for Computational Linguistics</publisher>
    <address>Minneapolis, Minnesota</address>
    <month>June</month>
    <year>2019</year>
    <paper id="0">
      <title>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</title>
      <url>N19-1</url>
    </paper>
    <paper id="1">
      <title>Entity Recognition at First Sight: <fixed-case>I</fixed-case>mproving <fixed-case>NER</fixed-case> with Eye Movement Information</title>
      <author><first>Nora</first><last>Hollenstein</last></author>
      <author><first>Ce</first><last>Zhang</last></author>
      <pages>1–10</pages>
      <abstract>Previous research shows that eye-tracking data contains information about the lexical and syntactic properties of text, which can be used to improve natural language processing models. In this work, we leverage eye movement features from three corpora with recorded gaze information to augment a state-of-the-art neural model for named entity recognition (NER) with gaze embeddings. These corpora were manually annotated with named entity labels. Moreover, we show how gaze features, generalized on word type level, eliminate the need for recorded eye-tracking data at test time. The gaze-augmented models for NER using token-level and type-level features outperform the baselines. We present the benefits of eye-tracking features by evaluating the NER models on both individual datasets as well as in cross-domain settings.</abstract>
      <url>N19-1001</url>
    </paper>
    <paper id="2">

Any strong objections?

akoehn · 2019-06-03T11:53:17Z

Any strong objections?

No. 1) I can't think of any reason why that information could diverge and 2) as you said it is easy to re-obtain it from the parent node.

Do you update the schema file? If so, a few comments in there would be nice. The schema is complex to warrant it.

mjpost · 2019-06-03T15:00:40Z

The schema is updated (edit: but not yet pushed) and I’ll add comments.

Another variation would be to gather the volume specific information in a meta block, e.g.,

<collection id="N19">
  <proceedings id="1">
      <meta>
        <booktitle>Proceedings of the 2019 Conference of the North <fixed-case>A</fixed-case>merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)</booktitle>
        <editor><first>Jill</first><last>Burstein</last></editor>
        <editor><first>Christy</first><last>Doran</last></editor>
        <editor><first>Thamar</first><last>Solorio</last></editor>
        <publisher>Association for Computational Linguistics</publisher>
        <address>Minneapolis, Minnesota</address>
        <month>June</month>
        <year>2019</year>
      </meta>
      ...

This would make it a bit simpler to iterate over papers, i.e., the children of proceedings, needing only to skip over the “meta” child.

akoehn · 2019-06-03T15:20:27Z

Using xpath, I would use something like collection[id="N19"]//paper to obtain all papers of a conference; there would be no difference between both approaches. However, I still like the meta tag as it stores away the meta data visually.

davidweichiang · 2019-06-03T16:01:09Z

I like that the <meta> tag lets the data specify what fields are inherited, instead of a hard-coded list of inherited fields. But what does <meta> mean -- would something like <inherit> be more informative?

Can <proceedings> have a <meta> child also?

mjpost · 2019-06-03T17:27:08Z

I also see I missed @mbollmann’s comment above about using <volume> for the level 2 tag instead of <proceedings>. I agree about @davidweichiang’s point about using entirely new tags, but I think <volume> is more appropriate there than <proceedings>.

<meta> on the other hand is maybe too general, but I can’t think of a better one. We could add that to <proceedings> too but I don’t have a use case.

So to be explicit I think we are converging on:

<collection>
  <volume>
    <meta>...</meta>
    <paper>..</paper>
    ...
  </volume
</collection>

akoehn · 2019-06-03T17:51:00Z

@davidweichiang I think that only proceedings should have a meta section. I like the term meta as it is standard (-> HTML uses it, I use it in corpora) and fitting (this denotes meta data to the proceedings in contrast to the content captured by the papers).

If collections would also have a meta section, the semantics are not clear to me. Should the collection actually have editors etc. or are these only properties of the volumes? If the collection has editors A and B, does defining an editor in a proceeding overwrite or append? It is certainly possible to define all that, but the whole logic will be much more complicated and we save maybe a few kilobytes of data, if at all.

mjpost · 2019-06-03T22:20:01Z

All right, in the latest push I've committed the schema, conversion script, and the converted N19.xml.

mbollmann · 2019-06-04T00:27:59Z

The current iteration looks good to me.

Does the conversion script preserve the front matter entries? I think while we're converting this, this would be a good idea to implement logic that

keeps front matter entries for non-journal volumes, but maybe explicitly marks them as such (Make conference front matter show that it is front matter #62); and
throw away the dummy front matter entries for journals, as they're no longer needed.

mjpost · 2019-06-04T12:59:56Z

It does preserve them. I’ve been adopting the convention of assigning frontmatter to be paper ID 0; per David’s suggestion above, it is the bib entry for the proceedings itself, and the PDF is just the frontmatter. We could use a special attribute which would make this semantics clearer (<paper role=“frontmatter”>?) if you have ideas, but I’m not sure this is necessary.

davidweichiang · 2019-06-04T16:30:07Z

My quibble with <meta> is that everything in this file is metadata; it's not clear to me that what goes inside the <meta> tag is even more meta. My suggestion: <template>?

The questions that @akoehn raised about the semantics of <meta> are good ones; they should be ironed out for <volume> and <paper> as well. For example, what should this do:

<volume>
  <meta>
    <year>1999</year>
  </meta>
  <paper>
    <year>2000</year>
  </paper>
</volume>

I have no idea why you'd have that, but presumably the paper year would override the meta year.

For BibTeX generation, there will need to be special rules that cause <paper> not to inherit <editor>, unless it has id 0, in which case it does inherit <editor> but it doesn't inherit <booktitle>...is there any way to avoid such rules?

akoehn · 2019-06-04T16:51:33Z

In my opinion, a node should never have a property that an ancestor has as well. Only in that case we don't have to think about inheritance or similar concepts. In @mjpost's schema, only the proceedings have meta information.

My mental concept is not that of a template, the information in the meta tags are meta information of the proceedings. It would also not really make sense to talk about an editor of a paper, it is always the editor of proceedings, which consists of papers.

Interestingly (and maybe contradictory to my claim above), papers currently may also have editor fields -- this is currently the only overlap and I don't see the reason for that.

mjpost · 2019-06-04T17:15:50Z

My original thinking was in fact an inheritance model, but we don’t actually have a use case and I’d be in favor of Arne’s proposal here. I think the meaning of meta here is clear and not too meta. I also agree about meta applying only at the volume and not collection level. Collections in this new scheme are just logical groupings of volumes. For example we could move to putting all of NAACL in a single naacl.xml file instead of splitting out by years.

For bibtex:

all papers do inherit both booktitle and editor; they just don’t all export it
we could handle the logic David pointed to by leaving off both title and authors from the frontmatter paper; and have simple logic that just defaults to the volume metadata in their absence. Something like:

if paper.title:
    title = paper.title
    field = ‘title’
else: 
    title = volume.booktitle
    field = ‘booktitle’

and similar for “author”.

davidweichiang · 2019-06-04T17:20:14Z

I took a look at the schema and understand now. I think it works to require each field to occur at a certain level, barring any unforeseen unusual situations.

Should volume 1 have <url>N19-1</url> instead of http:...?
Should paper 0 have <url>N19-1000</url> instead of N19-1?

mbollmann · 2019-06-04T17:26:51Z

Re the front matter: Mainly I would like to get rid of the implicit logic which volumes have front matter and which don't. If you think identifying them by ID=0 is good enough then that's fine by me, as long as e.g. journals would skip that ID then.

mjpost · 2019-06-04T19:30:57Z

Yes http should have been removed. And yes paper 0 will be fixed-width formatted, but applied in code instead of xml.

Front-matter will be fixed. This PR also addresses #156 by only generating links when an explicit <url> tag is present.

akoehn · 2019-06-04T20:00:19Z

If the frontmatter behaves differently (always id=0 and no author/title) why not make it its own type of entity? I see no downside to that.

mjpost · 2019-06-04T22:21:25Z

We could do that but it seemed largely redundant with paper in the schema.

akoehn · 2019-06-05T05:35:51Z

Regarding the editor field for papers I mentioned above, @mbollmann wrote:

The Rails app just silently ignored on regular papers. I copied the logic, but added the warning.

Therefore, it would be good to remove that field in the schema as part of this clean up.

Regarding the front matter:
It would be no additional content in the XML, but we could then enforce

papers to have authors
the front matter to not have authors
papers to have a positive (i.e. non-zero) ID
front matter to have id == 0

Having more structure encoded in the schema is a good thing imo, as it documents the intended structure and catches errors earlier (and probably in a clearer way) than in code.

mjpost · 2019-06-05T05:52:38Z

Do you imagine this in addition to <meta>? Or replacing it? I can see doing:

<collection>
  <volume>
    <frontmatter>
      <editor>...
      <year>...
      <address>...
      <publisher>...
    </frontmatter>

    <paper id="1">
      etc.

davidweichiang · 2019-06-05T11:52:48Z

I think it would be ideal to move towards distinguishing between a volume's front matter and the volume itself, so I'd say that <meta> and the front matter should be separate tags. And the front matter should have its own id that is different from the whole-volume id; its id would conventionally be 0 but I hope that that doesn't need to be hard-coded anywhere.

Before changing the schema to disallow papers from having editors, we should check the cases where they do. Most of them are in J74-J79, where conference proceedings, or collections of abstracts, have been published in the journal.

mbollmann · 2019-06-05T14:36:09Z

Replacing it seems wrong, journals don't have front matter but still need meta information.

But I think I agree that a front matter tag or attribute is preferable. Explicit is better than implicit, IMO.

mjpost · 2019-06-05T15:44:35Z

Okay, so <meta> and <frontmatter>, which makes sense.

@davidweichiang, I think the early J journals will have to be updated to be marked as frontmatter manually, since none of the J7?-?000 files exist (they start numbering at 1).

mjpost · 2019-06-20T16:57:11Z

Okay, I am ready to merge #324. Note that this may break many scripts that rely on the old format, but if you are iterating over the Anthology class, I think it will work.

mjpost · 2019-06-20T16:57:36Z

(I will wait a bit to do this, maybe till tomorrow, in case anyone has questions).

Scripts: - updated scripts for hierarchical format (#317) - made version-swapping when adding revisions more robust - minimal checking of extension type for attachments Other changes: - added missing abstract (closes #411) - fixed address (closes #384) - added correction (closes #404) - added revision (closes #408) - added revision to N19-1318 (closes #418) - S19-2026 revision (closes #414) - added missing author to S19-1906 (#408)

A summary of changes: - Introduces a nested format (closes acl-org#317) - URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) - URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) - Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 ) - It punts on C69 reformatting (closes acl-org#147) Relevant, but not completed: - Creating PDF volumes by pasting together individual papers (acl-org#226) - This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),

Scripts: - updated scripts for hierarchical format (acl-org#317) - made version-swapping when adding revisions more robust - minimal checking of extension type for attachments Other changes: - added missing abstract (closes acl-org#411) - fixed address (closes acl-org#384) - added correction (closes acl-org#404) - added revision (closes acl-org#408) - added revision to N19-1318 (closes acl-org#418) - S19-2026 revision (closes acl-org#414) - added missing author to S19-1906 (acl-org#408)

mbollmann added the enhancement label May 8, 2019

mjpost added a commit that referenced this issue May 10, 2019

started volume work, postponing till #317

6496178

mjpost mentioned this issue May 10, 2019

Nested volumes and explicit <url> tags #324

Merged

9 tasks

mjpost mentioned this issue May 14, 2019

Official publication date #319

Closed

This was referenced May 28, 2019

Announcement of updates / RSS feed #358

Closed

additions + handy script (closes #349) #366

Merged

This was referenced Jun 13, 2019

reorg of C69 #407

Closed

Anthology mirrors #295

Closed

mjpost closed this as completed in 0b4ea37 Jun 21, 2019

akoehn mentioned this issue Jun 21, 2019

schema: only positive integers for ids, year has to be a year. #419

Merged

mjpost mentioned this issue Jun 24, 2019

normalize_anth.py: find papers in nested volumes #423

Merged

Nested volumes in XML #317

Nested volumes in XML #317

Comments

mbollmann commented May 8, 2019

davidweichiang commented May 8, 2019

mjpost commented May 8, 2019 • edited Loading

mjpost commented May 11, 2019

davidweichiang commented May 11, 2019

mbollmann commented May 11, 2019

mjpost commented May 11, 2019

davidweichiang commented May 11, 2019

davidweichiang commented May 13, 2019

mjpost commented May 14, 2019

mjpost commented May 14, 2019

davidweichiang commented May 14, 2019

mjpost commented May 15, 2019

mjpost commented Jun 1, 2019

mjpost commented Jun 2, 2019

akoehn commented Jun 3, 2019

mjpost commented Jun 3, 2019 • edited Loading

akoehn commented Jun 3, 2019

davidweichiang commented Jun 3, 2019

mjpost commented Jun 3, 2019

akoehn commented Jun 3, 2019

mjpost commented Jun 3, 2019

mbollmann commented Jun 4, 2019

mjpost commented Jun 4, 2019

davidweichiang commented Jun 4, 2019

akoehn commented Jun 4, 2019

mjpost commented Jun 4, 2019

davidweichiang commented Jun 4, 2019

mbollmann commented Jun 4, 2019

mjpost commented Jun 4, 2019

akoehn commented Jun 4, 2019

mjpost commented Jun 4, 2019

akoehn commented Jun 5, 2019

mjpost commented Jun 5, 2019

davidweichiang commented Jun 5, 2019

mbollmann commented Jun 5, 2019

mjpost commented Jun 5, 2019

mjpost commented Jun 20, 2019

mjpost commented Jun 20, 2019

mjpost commented May 8, 2019 •

edited

Loading

mjpost commented Jun 3, 2019 •

edited

Loading