URLs in authoritative XML #156

mbollmann · 2019-02-28T10:42:24Z

I already noted in #145 that revision URLs are not stored in the XML, but inferred in the Rails application. Now, I'm pretty sure that actually this is the case for all URLs.

For example, P05.xml has entries of the following format:

<url>http://www.aclweb.org/anthology/P/P05/P05-1026</url>

These URLs are 404, and the current website actually (correctly) links to http://www.aclweb.org/anthology/P05-1026 instead.

On the other hand, other files (e.g. A00.xml) have no <url> elements at all, but still produce a correct link to the PDF on the website.

This should probably handled in some uniform way; either provide the correct URLs in the XML always, or never provide them when they're in the Anthology and follow the standard format. I think it's particularly problematic to have invalid URLs in there, as it's not intuitive that the current website basically ignores them.

The text was updated successfully, but these errors were encountered:

mbollmann · 2019-02-28T13:00:31Z

More info: The current treatment already produces some invalid URLs on the live website as well. For example, LREC 2010 is not hosted on ACL servers, and the individual papers link to the LREC website instead. In the XML, that link is given in <paper href=...>.

The front matter and the proceedings volume itself, however, don't have this href information, so the Rails app automatically produces http://www.aclweb.org/anthology/L10-1000 as the URL, which doesn't exist.

In other words, we need to distinguish between "default link to the Anthology" and "no link at all", which probably only works if we explicitly provide the correct URL in the XML always. Thoughts?

davidweichiang · 2019-03-01T14:42:46Z

This is set by workshop/pub chairs in START. I assume that the Anthology chair tells the pub chair what the right value is, and the pub chair tells the workshop chairs what their right value is. But IMO the pub/workshop chairs shouldn't even have to think about this. They should give the Anthology a tarball in which all the filenames start with A00-0 or some placeholder like @@@-@, and the Anthology should automatically fill in the right value.

mjpost · 2019-03-01T17:22:01Z

The example link you give will work if you do

<url>http://www.aclweb.org/anthology/P/P05/P05-1026.pdf</url>

For the flat links (http://www.aclweb.org/anthology/P05-1026), an Apache rewrite rule is used to direct them to the above file. But no redirection exists for the hierarchical URLs.

I agree that the URLs should be specified explicitly, and if none is present, the site will not provide a link.

As to link formats, I have come to like the flat addressing style as long as it's organized hierarchically underneath.

davidweichiang · 2019-03-09T01:19:46Z

I'm working on anthologize.pl now. I can have anthologize.pl rewrite the URL with the correct value. Or, an ingestion script on the Anthology side can do it. Let me know what you think is better.

(Why does LREC 2010 have the URL in an href attribute instead of a url element? Is there a difference in meaning?)

mjpost · 2019-03-09T01:48:40Z

I think it's fine to have anthologize.pl write it out, to have things correct as early as possible. We can always overwrite it on ingest.

That's strange about LREC 2010, but note that it seems to have both an href attribute and the url element, and the href is prioritized. It's non-ACL so I think it's fine to link externally, though on the other hand, pointing to our own copy might help with search results, and LREC papers actually can be hard to find, so the Anthology is providing a service here.

davidweichiang · 2019-03-09T20:28:53Z

Somewhat related question: there are other links of various kinds in the XML:

<url>URL</url>
<software>filename</software>
<dataset>filename</dataset>
<attachment type="...">filename</attachment> where the type is 'note', 'presentation', 'poster', 'attachment', or missing (the generic 'attachment' happens, I think, when the pub chair does not follow the instruction to classify the attachment as software or dataset)
<mrf src="latexml">filename.xhtml</mrf> (machine readable format?)
<video href="URL" tag="video"/>
<paper href="URL">...</paper>

I guess these have accumulated over time, but there isn't a lot of consistency. In particular, it seems that the various attachments are just stored as filenames, whereas the paper is stored as a full URL (which is ignored!).

My suggestion: deprecate the <url> tag (and <paper href="URL">) and replace it with a <fulltext> tag, which has several forms:

<fulltext href="URL"/> for an external link
<fulltext>X99-9999.pdf</fulltext> for an internal link
Maybe the <mrf> should be changed to something like <fulltext type="latexml">X99-9999.xhtml</fulltext>. But the conversion seems to be not very good anyway.
Maybe LaTeX source could be added as <fulltext type="latex">

mbollmann · 2019-03-11T15:35:00Z

FWIW, I already conflate <software>, <dataset>, and <video> with <attachment> (setting the tag name as the attachment type) during YAML conversion, so the YAML is already "cleaned up" in that regard.

I don't do anything with the <mrf> tag yet, though, it kinda escaped my attention...

knmnyn · 2019-03-12T01:45:38Z

Hi Marcel, all: It is a great suggestion to fold in all of the tags to <attachments/> and allow the attribute field within the tag to define the type and its treatment. Ideally the MRF (machine readable format) tag would be involved in that too, but there are many potential types of tags that could be defined there, so that's why originally it was left as a separate field (to distinguish automated contributions from NLP software being run over the publications) Cheers, Min

…

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: [email protected] (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Mon, Mar 11, 2019 at 11:35 PM Marcel Bollmann ***@***.***> wrote: FWIW, I already conflate <software>, <dataset>, and <video> with <attachment> (setting the tag name as the attachment type) during YAML conversion, so the YAML is already "cleaned up" in that regard. I don't do anything with the <mrf> tag yet, though, it kinda escaped my attention... — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#156 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AANP6270r4mVC-aUSBk8r88HiTPCNyhmks5vVnekgaJpZM4bWg6O> .

davidweichiang · 2019-03-12T17:10:23Z

Also, TACL has <href>.

davidweichiang · 2019-03-21T21:19:46Z

The new version of anthologize.pl I've just pushed always rewrites the <url> field to http://www.aclweb.org/anthology/Z99-9999. It prints a warning if this is different from what the original value was.

That made it easier to detect nonstandard URLs in old conferences (ebdf086), although I didn't attempt to change <paper href="..."> or <href>...</href>.

I think that might close this issue? Or should we fix <paper href="..."> and <href>...</href>?

knmnyn · 2019-03-22T07:06:20Z

The original intent was to hide the indirection logic and present a canonical version to the user for both citation and reference. With respect to @davidweichiang 's last comment shouldn't it be http://www.aclweb.org/anthology/Z99-9999 not with the intermediate Z/ path?

One of the URLs (with the relative mapping logic) are for files hosted within the Anthology. However, some of our sister organizations host their own files (e.g., LREC used to not let Anthology host copies of their proceedings) and so absolute URLs were needed.

I would continue to favor having separate standards for paths/URLs within the Anthology hosted materials and ones that exist outside (needing absolute references).

mbollmann · 2019-03-22T09:35:20Z

With respect to @davidweichiang 's last comment shouldn't it be http://www.aclweb.org/anthology/Z99-9999 not with the intermediate Z/ path?

👍 +1

I would continue to favor having separate standards for paths/URLs within the Anthology hosted materials and ones that exist outside (needing absolute references).

I also mentioned somewhere that it will affect the possibility of full mirroring (#28); if we give absolute URLs always, that's a problem for mirroring the full site including the PDFs, and one actual advantage of only specifying an internal filename.

However, about having <url> vs. href, I still believe internal filenames can unambiguously be distinguished from absolute URLs, so I don't see the need to handle them in different ways.

davidweichiang · 2019-03-22T10:45:13Z

Sorry, the Z/ was a typo. I've normalized all the URLs to the form you wrote.

mjpost · 2019-03-22T11:04:05Z

What if we encoded (in the XML) only relative paths for Anthology URLs (e.g., just Z99-9999), and absolute URLs for non-ACL content for which we host metadata but not files (eg., LREC)? Hugo would then prefix the base URL to all such relative addresses. That would also address the mirroring issue (#22). Would this introduce any problems?

davidweichiang · 2019-03-22T11:28:44Z

I like that, but I thought @knmnyn said that it was important to have the full URL because of DOIs.

mjpost · 2019-03-22T11:54:44Z

This would be just in the XML—the site generation code would see it's a relative URL and would prepend the baseURL when building the site.

davidweichiang · 2019-04-09T11:52:07Z

#31 #209 #242

knmnyn · 2019-04-09T13:25:22Z

@davidweichiang @mjpost : I think @mjpost 's idea from

What if we encoded (in the XML) only relative paths for Anthology URLs (e.g., just Z99-9999), and absolute URLs for non-ACL content for which we host metadata but not files (eg., LREC)? Hugo would then prefix the base URL to all such relative addresses. That would also address the mirroring issue (#22). Would this introduce any problems?

should be the default and that @mjpost reply on your comment is correct.

davidweichiang · 2019-04-09T16:10:15Z

I'd like to request that the distinction between external links and internal links not be made by using the tags <url> and <href>, because those two words basically mean the same thing (right?). I should think that a relative vs. absolute URL would be good enough, as discussed above, but if not, then how about

<file> (internal) vs <url> (external)
<url type="internal"> vs <url type="external">

knmnyn · 2019-04-09T16:44:36Z

I agree with @davidweichiang 's suggestion, but I think the distinction should be in the tag, not overload the content to serve a logical distinction.

You'd want both internal and external to be allowed to coexist and perhaps even allow multiples if needed at some future point.

mbollmann · 2019-04-14T19:58:30Z

I still feel absolute vs. relative is easy enough to distinguish and doesn't need separate tags, but I would be fine with the other suggestions as well.

However, for the website, there's currently the assumption that we just have one URL, and that is used e.g. as the link target for the "PDF" button and when users click on the paper title on the paper page. If multiple URLs can occur in the XML (or internal + external), we need to figure out what the logic should be for which URL gets linked where.

A summary of changes: - Introduces a nested format (closes acl-org#317) - URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) - URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) - Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 ) - It punts on C69 reformatting (closes acl-org#147) Relevant, but not completed: - Creating PDF volumes by pasting together individual papers (acl-org#226) - This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),

mbollmann pushed a commit that referenced this issue Feb 28, 2019

Hotfix for #156

876600c

mbollmann mentioned this issue Mar 16, 2019

Front matter for journals #181

Closed

mbollmann mentioned this issue Apr 16, 2019

Broken & missing links on the server #264

Open

mjpost mentioned this issue May 2, 2019

Anthology mirrors #295

Closed

3 tasks

mjpost mentioned this issue May 9, 2019

Nested volumes and explicit <url> tags #324

Merged

9 tasks

akoehn mentioned this issue May 21, 2019

A Makefile for the anthology #348

Closed

7 tasks

mjpost mentioned this issue Jun 4, 2019

Nested volumes in XML #317

Closed

mjpost closed this as completed in 0b4ea37 Jun 21, 2019

najtin pushed a commit to ir-anthology/ir-anthology that referenced this issue Jun 9, 2021

Hotfix for acl-org#156

4dfbeab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URLs in authoritative XML #156

URLs in authoritative XML #156

mbollmann commented Feb 28, 2019

mbollmann commented Feb 28, 2019 •

edited

Loading

davidweichiang commented Mar 1, 2019

mjpost commented Mar 1, 2019

davidweichiang commented Mar 9, 2019

mjpost commented Mar 9, 2019

davidweichiang commented Mar 9, 2019

mbollmann commented Mar 11, 2019

knmnyn commented Mar 12, 2019 via email

davidweichiang commented Mar 12, 2019

davidweichiang commented Mar 21, 2019 •

edited

Loading

knmnyn commented Mar 22, 2019

mbollmann commented Mar 22, 2019

davidweichiang commented Mar 22, 2019

mjpost commented Mar 22, 2019

davidweichiang commented Mar 22, 2019

mjpost commented Mar 22, 2019 •

edited

Loading

davidweichiang commented Apr 9, 2019 •

edited

Loading

knmnyn commented Apr 9, 2019

davidweichiang commented Apr 9, 2019

knmnyn commented Apr 9, 2019

mbollmann commented Apr 14, 2019

URLs in authoritative XML #156

URLs in authoritative XML #156

Comments

mbollmann commented Feb 28, 2019

mbollmann commented Feb 28, 2019 • edited Loading

davidweichiang commented Mar 1, 2019

mjpost commented Mar 1, 2019

davidweichiang commented Mar 9, 2019

mjpost commented Mar 9, 2019

davidweichiang commented Mar 9, 2019

mbollmann commented Mar 11, 2019

knmnyn commented Mar 12, 2019 via email

davidweichiang commented Mar 12, 2019

davidweichiang commented Mar 21, 2019 • edited Loading

knmnyn commented Mar 22, 2019

mbollmann commented Mar 22, 2019

davidweichiang commented Mar 22, 2019

mjpost commented Mar 22, 2019

davidweichiang commented Mar 22, 2019

mjpost commented Mar 22, 2019 • edited Loading

davidweichiang commented Apr 9, 2019 • edited Loading

knmnyn commented Apr 9, 2019

davidweichiang commented Apr 9, 2019

knmnyn commented Apr 9, 2019

mbollmann commented Apr 14, 2019

mbollmann commented Feb 28, 2019 •

edited

Loading

davidweichiang commented Mar 21, 2019 •

edited

Loading

mjpost commented Mar 22, 2019 •

edited

Loading

davidweichiang commented Apr 9, 2019 •

edited

Loading