-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URLs in authoritative XML #156
Comments
More info: The current treatment already produces some invalid URLs on the live website as well. For example, LREC 2010 is not hosted on ACL servers, and the individual papers link to the LREC website instead. In the XML, that link is given in The front matter and the proceedings volume itself, however, don't have this In other words, we need to distinguish between "default link to the Anthology" and "no link at all", which probably only works if we explicitly provide the correct URL in the XML always. Thoughts? |
This is set by workshop/pub chairs in START. I assume that the Anthology chair tells the pub chair what the right value is, and the pub chair tells the workshop chairs what their right value is. But IMO the pub/workshop chairs shouldn't even have to think about this. They should give the Anthology a tarball in which all the filenames start with |
The example link you give will work if you do
For the flat links ( I agree that the URLs should be specified explicitly, and if none is present, the site will not provide a link. As to link formats, I have come to like the flat addressing style as long as it's organized hierarchically underneath. |
I'm working on anthologize.pl now. I can have anthologize.pl rewrite the URL with the correct value. Or, an ingestion script on the Anthology side can do it. Let me know what you think is better. (Why does LREC 2010 have the URL in an href attribute instead of a url element? Is there a difference in meaning?) |
I think it's fine to have anthologize.pl write it out, to have things correct as early as possible. We can always overwrite it on ingest. That's strange about LREC 2010, but note that it seems to have both an |
Somewhat related question: there are other links of various kinds in the XML:
I guess these have accumulated over time, but there isn't a lot of consistency. In particular, it seems that the various attachments are just stored as filenames, whereas the paper is stored as a full URL (which is ignored!). My suggestion: deprecate the
|
FWIW, I already conflate I don't do anything with the |
Hi Marcel, all:
It is a great suggestion to fold in all of the tags to <attachments/> and
allow the attribute field within the tag to define the type and its
treatment.
Ideally the MRF (machine readable format) tag would be involved in that
too, but there are many potential types of tags that could be defined
there, so that's why originally it was left as a separate field (to
distinguish automated contributions from NLP software being run over the
publications)
Cheers,
Min
…--
Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore
:: NUS School of Computing, AS6 05-12, 13 Computing Drive
Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) ::
[email protected] (E) :: www.comp.nus.edu.sg/~kanmy (W)
On Mon, Mar 11, 2019 at 11:35 PM Marcel Bollmann ***@***.***> wrote:
FWIW, I already conflate <software>, <dataset>, and <video> with
<attachment> (setting the tag name as the attachment type) during YAML
conversion, so the YAML is already "cleaned up" in that regard.
I don't do anything with the <mrf> tag yet, though, it kinda escaped my
attention...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#156 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AANP6270r4mVC-aUSBk8r88HiTPCNyhmks5vVnekgaJpZM4bWg6O>
.
|
Also, TACL has |
The new version of anthologize.pl I've just pushed always rewrites the That made it easier to detect nonstandard URLs in old conferences (ebdf086), although I didn't attempt to change I think that might close this issue? Or should we fix |
The original intent was to hide the indirection logic and present a canonical version to the user for both citation and reference. With respect to @davidweichiang 's last comment shouldn't it be One of the URLs (with the relative mapping logic) are for files hosted within the Anthology. However, some of our sister organizations host their own files (e.g., LREC used to not let Anthology host copies of their proceedings) and so absolute URLs were needed. I would continue to favor having separate standards for paths/URLs within the Anthology hosted materials and ones that exist outside (needing absolute references). |
👍 +1
I also mentioned somewhere that it will affect the possibility of full mirroring (#28); if we give absolute URLs always, that's a problem for mirroring the full site including the PDFs, and one actual advantage of only specifying an internal filename. However, about having |
Sorry, the |
What if we encoded (in the XML) only relative paths for Anthology URLs (e.g., just |
I like that, but I thought @knmnyn said that it was important to have the full URL because of DOIs. |
This would be just in the XML—the site generation code would see it's a relative URL and would prepend the baseURL when building the site. |
@davidweichiang @mjpost : I think @mjpost 's idea from
should be the default and that @mjpost reply on your comment is correct. |
I'd like to request that the distinction between external links and internal links not be made by using the tags
|
I agree with @davidweichiang 's suggestion, but I think the distinction should be in the tag, not overload the content to serve a logical distinction. You'd want both internal and external to be allowed to coexist and perhaps even allow multiples if needed at some future point. |
I still feel absolute vs. relative is easy enough to distinguish and doesn't need separate tags, but I would be fine with the other suggestions as well. However, for the website, there's currently the assumption that we just have one URL, and that is used e.g. as the link target for the "PDF" button and when users click on the paper title on the paper page. If multiple URLs can occur in the XML (or internal + external), we need to figure out what the logic should be for which URL gets linked where. |
A summary of changes: - Introduces a nested format (closes acl-org#317) - URLs are stored using a relative format for internal links (closes acl-org#156), which facilitates mirroring (acl-org#295) - URLs are only displayed if they are found in the XML. I manually crawled to validate and create entries for PDFs for all frontmatter entries (closes acl-org#181 closes acl-org#180), including journal frontmatter (acl-org#264) and volume PDFs (closes #31) - Added missing entries and removed ones whose PDFs were missing, including LREC 2014 (closes #31 ) - It punts on C69 reformatting (closes acl-org#147) Relevant, but not completed: - Creating PDF volumes by pasting together individual papers (acl-org#226) - This makes it much easier to add non-paper entries such as talks (acl-org#298), to add a volume-level "publication date (acl-org#319), and to create an RSS feed of updates (acl-org#358),
I already noted in #145 that revision URLs are not stored in the XML, but inferred in the Rails application. Now, I'm pretty sure that actually this is the case for all URLs.
For example,
P05.xml
has entries of the following format:These URLs are 404, and the current website actually (correctly) links to
http://www.aclweb.org/anthology/P05-1026
instead.On the other hand, other files (e.g.
A00.xml
) have no<url>
elements at all, but still produce a correct link to the PDF on the website.This should probably handled in some uniform way; either provide the correct URLs in the XML always, or never provide them when they're in the Anthology and follow the standard format. I think it's particularly problematic to have invalid URLs in there, as it's not intuitive that the current website basically ignores them.
The text was updated successfully, but these errors were encountered: