Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLs in authoritative XML #156

Closed
mbollmann opened this issue Feb 28, 2019 · 21 comments
Closed

URLs in authoritative XML #156

mbollmann opened this issue Feb 28, 2019 · 21 comments

Comments

@mbollmann
Copy link
Member

I already noted in #145 that revision URLs are not stored in the XML, but inferred in the Rails application. Now, I'm pretty sure that actually this is the case for all URLs.

For example, P05.xml has entries of the following format:

<url>http://www.aclweb.org/anthology/P/P05/P05-1026</url>

These URLs are 404, and the current website actually (correctly) links to http://www.aclweb.org/anthology/P05-1026 instead.

On the other hand, other files (e.g. A00.xml) have no <url> elements at all, but still produce a correct link to the PDF on the website.

This should probably handled in some uniform way; either provide the correct URLs in the XML always, or never provide them when they're in the Anthology and follow the standard format. I think it's particularly problematic to have invalid URLs in there, as it's not intuitive that the current website basically ignores them.

@mbollmann
Copy link
Member Author

mbollmann commented Feb 28, 2019

More info: The current treatment already produces some invalid URLs on the live website as well. For example, LREC 2010 is not hosted on ACL servers, and the individual papers link to the LREC website instead. In the XML, that link is given in <paper href=...>.

The front matter and the proceedings volume itself, however, don't have this href information, so the Rails app automatically produces http://www.aclweb.org/anthology/L10-1000 as the URL, which doesn't exist.

In other words, we need to distinguish between "default link to the Anthology" and "no link at all", which probably only works if we explicitly provide the correct URL in the XML always. Thoughts?

mbollmann pushed a commit that referenced this issue Feb 28, 2019
@davidweichiang
Copy link
Collaborator

This is set by workshop/pub chairs in START. I assume that the Anthology chair tells the pub chair what the right value is, and the pub chair tells the workshop chairs what their right value is. But IMO the pub/workshop chairs shouldn't even have to think about this. They should give the Anthology a tarball in which all the filenames start with A00-0 or some placeholder like @@@-@, and the Anthology should automatically fill in the right value.

@mjpost
Copy link
Member

mjpost commented Mar 1, 2019

The example link you give will work if you do

<url>http://www.aclweb.org/anthology/P/P05/P05-1026.pdf</url>

For the flat links (http://www.aclweb.org/anthology/P05-1026), an Apache rewrite rule is used to direct them to the above file. But no redirection exists for the hierarchical URLs.

I agree that the URLs should be specified explicitly, and if none is present, the site will not provide a link.

As to link formats, I have come to like the flat addressing style as long as it's organized hierarchically underneath.

@davidweichiang
Copy link
Collaborator

I'm working on anthologize.pl now. I can have anthologize.pl rewrite the URL with the correct value. Or, an ingestion script on the Anthology side can do it. Let me know what you think is better.

(Why does LREC 2010 have the URL in an href attribute instead of a url element? Is there a difference in meaning?)

@mjpost
Copy link
Member

mjpost commented Mar 9, 2019

I think it's fine to have anthologize.pl write it out, to have things correct as early as possible. We can always overwrite it on ingest.

That's strange about LREC 2010, but note that it seems to have both an href attribute and the url element, and the href is prioritized. It's non-ACL so I think it's fine to link externally, though on the other hand, pointing to our own copy might help with search results, and LREC papers actually can be hard to find, so the Anthology is providing a service here.

@davidweichiang
Copy link
Collaborator

Somewhat related question: there are other links of various kinds in the XML:

  • <url>URL</url>
  • <software>filename</software>
  • <dataset>filename</dataset>
  • <attachment type="...">filename</attachment> where the type is 'note', 'presentation', 'poster', 'attachment', or missing (the generic 'attachment' happens, I think, when the pub chair does not follow the instruction to classify the attachment as software or dataset)
  • <mrf src="latexml">filename.xhtml</mrf> (machine readable format?)
  • <video href="URL" tag="video"/>
  • <paper href="URL">...</paper>

I guess these have accumulated over time, but there isn't a lot of consistency. In particular, it seems that the various attachments are just stored as filenames, whereas the paper is stored as a full URL (which is ignored!).

My suggestion: deprecate the <url> tag (and <paper href="URL">) and replace it with a <fulltext> tag, which has several forms:

  • <fulltext href="URL"/> for an external link
  • <fulltext>X99-9999.pdf</fulltext> for an internal link
  • Maybe the <mrf> should be changed to something like <fulltext type="latexml">X99-9999.xhtml</fulltext>. But the conversion seems to be not very good anyway.
  • Maybe LaTeX source could be added as <fulltext type="latex">

@mbollmann
Copy link
Member Author

FWIW, I already conflate <software>, <dataset>, and <video> with <attachment> (setting the tag name as the attachment type) during YAML conversion, so the YAML is already "cleaned up" in that regard.

I don't do anything with the <mrf> tag yet, though, it kinda escaped my attention...

@knmnyn
Copy link
Collaborator

knmnyn commented Mar 12, 2019 via email

@davidweichiang
Copy link
Collaborator

Also, TACL has <href>.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Mar 21, 2019

The new version of anthologize.pl I've just pushed always rewrites the <url> field to http://www.aclweb.org/anthology/Z99-9999. It prints a warning if this is different from what the original value was.

That made it easier to detect nonstandard URLs in old conferences (ebdf086), although I didn't attempt to change <paper href="..."> or <href>...</href>.

I think that might close this issue? Or should we fix <paper href="..."> and <href>...</href>?

@knmnyn
Copy link
Collaborator

knmnyn commented Mar 22, 2019

The original intent was to hide the indirection logic and present a canonical version to the user for both citation and reference. With respect to @davidweichiang 's last comment shouldn't it be http://www.aclweb.org/anthology/Z99-9999 not with the intermediate Z/ path?

One of the URLs (with the relative mapping logic) are for files hosted within the Anthology. However, some of our sister organizations host their own files (e.g., LREC used to not let Anthology host copies of their proceedings) and so absolute URLs were needed.

I would continue to favor having separate standards for paths/URLs within the Anthology hosted materials and ones that exist outside (needing absolute references).

@mbollmann
Copy link
Member Author

With respect to @davidweichiang 's last comment shouldn't it be http://www.aclweb.org/anthology/Z99-9999 not with the intermediate Z/ path?

👍 +1

I would continue to favor having separate standards for paths/URLs within the Anthology hosted materials and ones that exist outside (needing absolute references).

I also mentioned somewhere that it will affect the possibility of full mirroring (#28); if we give absolute URLs always, that's a problem for mirroring the full site including the PDFs, and one actual advantage of only specifying an internal filename.

However, about having <url> vs. href, I still believe internal filenames can unambiguously be distinguished from absolute URLs, so I don't see the need to handle them in different ways.

@davidweichiang
Copy link
Collaborator

Sorry, the Z/ was a typo. I've normalized all the URLs to the form you wrote.

@mjpost
Copy link
Member

mjpost commented Mar 22, 2019

What if we encoded (in the XML) only relative paths for Anthology URLs (e.g., just Z99-9999), and absolute URLs for non-ACL content for which we host metadata but not files (eg., LREC)? Hugo would then prefix the base URL to all such relative addresses. That would also address the mirroring issue (#22). Would this introduce any problems?

@davidweichiang
Copy link
Collaborator

I like that, but I thought @knmnyn said that it was important to have the full URL because of DOIs.

@mjpost
Copy link
Member

mjpost commented Mar 22, 2019

This would be just in the XML—the site generation code would see it's a relative URL and would prepend the baseURL when building the site.

@davidweichiang
Copy link
Collaborator

davidweichiang commented Apr 9, 2019

#31 #209 #242

@knmnyn
Copy link
Collaborator

knmnyn commented Apr 9, 2019

@davidweichiang @mjpost : I think @mjpost 's idea from

What if we encoded (in the XML) only relative paths for Anthology URLs (e.g., just Z99-9999), and absolute URLs for non-ACL content for which we host metadata but not files (eg., LREC)? Hugo would then prefix the base URL to all such relative addresses. That would also address the mirroring issue (#22). Would this introduce any problems?

should be the default and that @mjpost reply on your comment is correct.

@davidweichiang
Copy link
Collaborator

I'd like to request that the distinction between external links and internal links not be made by using the tags <url> and <href>, because those two words basically mean the same thing (right?). I should think that a relative vs. absolute URL would be good enough, as discussed above, but if not, then how about

  • <file> (internal) vs <url> (external)
  • <url type="internal"> vs <url type="external">

@knmnyn
Copy link
Collaborator

knmnyn commented Apr 9, 2019

I agree with @davidweichiang 's suggestion, but I think the distinction should be in the tag, not overload the content to serve a logical distinction.

You'd want both internal and external to be allowed to coexist and perhaps even allow multiples if needed at some future point.

@mbollmann
Copy link
Member Author

I still feel absolute vs. relative is easy enough to distinguish and doesn't need separate tags, but I would be fine with the other suggestions as well.

However, for the website, there's currently the assumption that we just have one URL, and that is used e.g. as the link target for the "PDF" button and when users click on the paper title on the paper page. If multiple URLs can occur in the XML (or internal + external), we need to figure out what the logic should be for which URL gets linked where.

@mjpost mjpost mentioned this issue May 2, 2019
3 tasks
@mjpost mjpost closed this as completed in 0b4ea37 Jun 21, 2019
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021
A summary of changes:

- Introduces a nested format (closes acl-org#317)
- URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) 
- URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) 
- Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 )
- It punts on C69 reformatting (closes acl-org#147)

Relevant, but not completed:
- Creating PDF volumes by pasting together individual papers (acl-org#226)
- This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants