Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[For merging on 30.09] landing page is canonical #513

Merged
merged 12 commits into from
Oct 1, 2019
Merged

[For merging on 30.09] landing page is canonical #513

merged 12 commits into from
Oct 1, 2019

Conversation

mjpost
Copy link
Member

@mjpost mjpost commented Aug 18, 2019

This fix redirects paper pages like

https://aclweb.org/anthology/papers/P/P19/P19-1002/

to the canonical slashed variant

https://aclweb.org/anthology/P19-1002/

I hope this will stop people from accidentally sharing the nested URL forms which I find ugly and redundant. Note that this is just a small change; it's related to #480 but doesn't yet change the canonical URL. (Also note that I have tested this out and it is actually already in place).

@mjpost mjpost requested review from mbollmann and a team August 18, 2019 01:09
@mjpost
Copy link
Member Author

mjpost commented Aug 18, 2019

The latest push removes the links to nested pages in favor of the canonical variants:

  • paper page: ID/
  • BibTeX: ID.bib
  • EndNote: ID.endf

etc. I did not switch the canonical page yet. Testing shows that it works so far.

Copy link
Member Author

@mjpost mjpost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just submitted comments that should make the changes here live. Can someone from @acl-org/anthology take a look and approve?

hugo/layouts/papers/list-entry.html Show resolved Hide resolved
hugo/layouts/papers/list-entry.html Show resolved Hide resolved
{{- end -}}
{{- $bibfile := printf "/papers/%s/%s/%s.bib" (slicestr $volume_id 0 1) $volume_id .Params.anthology_id -}}
{{- if (fileExists (printf "/data-export/%s" $bibfile)) -}}
<a class="badge badge-secondary align-middle mr-1" href="{{ $bibfile | relURL }}" data-toggle="tooltip" data-placement="top" title="Export to BibTeX">
<a class="badge badge-secondary align-middle mr-1" href="{{ $paper.url }}.bib" data-toggle="tooltip" data-placement="top" title="Export to BibTeX">
bib
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bib path is now ${ANTH_ID}.bib, e.g., https://aclweb.org/anthology/P19-1002.bib, instead of the nested version.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're no longer using $bibfile, but the link is still within an if clause that checks for $bibfile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine though, right? If the bibfile exists, we generate the appropriate link to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this because I'm not sure how to fix it. If the bibfile is present, we generate the canonical link to it, which is handled by a redirect internally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the question is whether we still want a check for the existence of the file in the Hugo template.

If we do, then this current solution is suboptimal IMO since it requires us manually remembering and updating the actual local path to the file (in the $bibfile := ... line), in case it ever changes.

If we do not, we should just throw out the surrounding lines altogether.

@@ -28,7 +25,7 @@
{{ if eq hugo.Environment "development" }}
<span class="badge badge-light align-middle">{{ .Params.anthology_id }}</span>
{{ end }}
<strong><a class="align-middle" href="{{ .RelPermalink }}">{{ $paper.title_html | safeHTML }}</a></strong>
<strong><a class="align-middle" href="{{ $paper.url }}/">{{ $paper.title_html | safeHTML }}</a></strong>
<br />
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the canonical page (https://aclweb.org/anthology/P19-1002/) instead of the nested version. This will get rid of the double search results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as above: URL can refer to external content.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. url now points to the page's Anthology URL; pdf points to the PDF (which can be internal or external).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of how this is now handled is that it completely breaks local browsing of the site, since links always point to the live anthology.

I see two options here:

  1. Making the behaviour dependent on Hugo's "environment" variable, using .RelPermalink when in development mode and $paper.url otherwise. I don't really like this, as diverging behaviour has the potential to hide bugs.

  2. Generating the flat structure directly with Hugo, as discussed in Should the canonical URL be the landing page? #480, instead of relying on redirects.

@@ -44,7 +44,7 @@
{{ $bibfile := printf "/papers/%s/%s/%s.bib" (slicestr $volume_id 0 1) $volume_id $anthology_id }}
<section id="main">
<h2 id="title">
<a href="{{ $paper.url }}">{{ $paper.title_html | safeHTML }}</a>
<a href="{{ $paper.url }}/">{{ $paper.title_html | safeHTML }}</a>
</h2>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper page has a / at the end.

{{ end }}
{{ $endfile := printf "/papers/%s/%s/%s.endf" (slicestr $volume_id 0 1) $volume_id $anthology_id }}
{{ if (fileExists (printf "/data-export/%s" $endfile)) }}
<a class="btn btn-secondary btn-sm" href="{{ $endfile | relURL }}">EndNote</a>
<a class="btn btn-secondary btn-sm" href="{{ $paper.url}}.endf">EndNote</a>
{{ end }}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the canonical URL prefix instead of the deeply nested prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as with $bibfile above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you suggest that we generate bib, endf, etc links in the YAML, and only use those if present?

Copy link
Member Author

@mjpost mjpost Sep 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I ended up doing for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, but has the same issue with local browsing–the links will always point to the live server.

hugo/static/.htaccess Outdated Show resolved Hide resolved
# The Paper metadata page (e.g., P17-1069/ -> papers/P/P17/P17-1069/index.html)
RewriteRule ^([A-Za-z])([0-9][0-9])\-([0-9][0-9][0-9][0-9])\/$ papers/$1/$1$2/$1$2-$3/ [L,NC]
# The Paper metadata page (e.g., P17-1069/ loads papers/P/P17/P17-1069/index.html)
RewriteRule ^([A-Za-z])([0-9][0-9])\-([0-9][0-9][0-9][0-9])\/$ papers/$1/$1$2/$1$2-$3/index.html [L,NC]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loads the index file when https://aclweb.org/anthology/P19-1002/ (with a slash) is requested.

@mbollmann
Copy link
Member

Besides my individual comments, I really don't like this mixing of URL logic. URLs should either be constructed in the YAML or in the template, but right now it's a mixture of both, which seems very error-prone to me. (And in fact I believe this will break things, since we have external links as well.)

@mjpost
Copy link
Member Author

mjpost commented Sep 4, 2019

Thanks @mbollmann, it sounds like this needs more work. What do you think about just pushing up #520 in the meantime? (I've actually had this live for some time).

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's make sure this functionality won't break external links, and ideally refactor it to the YAML if feasible.

@mjpost
Copy link
Member Author

mjpost commented Sep 7, 2019

One way to do this would be for the YAML to have the following fields:

  • url: the paper landing page
  • pdf: absolute URL to the PDF, whether internal or external
  • endf, .xml, etc for our other formats

We then just print their values. Thoughts? I think we can use this PR as the official PR for #480.

@mjpost mjpost changed the title Redirect nested paper pages to the / variant [For merging on 30.09] landing page is canonical Sep 22, 2019
@mjpost mjpost requested review from mbollmann and a team September 22, 2019 12:16
@akoehn
Copy link
Member

akoehn commented Sep 22, 2019

@mbollmann, do you have time for a review? You are more qualified, but otherwise I will have a look at it next week.

Copy link
Member

@mbollmann mbollmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor functional comments, but my main issue with this is how it breaks local browsing of the site, as detailed in the comments.

bin/anthology/papers.py Outdated Show resolved Hide resolved
{{- end -}}
{{- $bibfile := printf "/papers/%s/%s/%s.bib" (slicestr $volume_id 0 1) $volume_id .Params.anthology_id -}}
{{- if (fileExists (printf "/data-export/%s" $bibfile)) -}}
<a class="badge badge-secondary align-middle mr-1" href="{{ $bibfile | relURL }}" data-toggle="tooltip" data-placement="top" title="Export to BibTeX">
<a class="badge badge-secondary align-middle mr-1" href="{{ $paper.url }}.bib" data-toggle="tooltip" data-placement="top" title="Export to BibTeX">
bib
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the question is whether we still want a check for the existence of the file in the Hugo template.

If we do, then this current solution is suboptimal IMO since it requires us manually remembering and updating the actual local path to the file (in the $bibfile := ... line), in case it ever changes.

If we do not, we should just throw out the surrounding lines altogether.

@@ -28,7 +25,7 @@
{{ if eq hugo.Environment "development" }}
<span class="badge badge-light align-middle">{{ .Params.anthology_id }}</span>
{{ end }}
<strong><a class="align-middle" href="{{ .RelPermalink }}">{{ $paper.title_html | safeHTML }}</a></strong>
<strong><a class="align-middle" href="{{ $paper.url }}/">{{ $paper.title_html | safeHTML }}</a></strong>
<br />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside of how this is now handled is that it completely breaks local browsing of the site, since links always point to the live anthology.

I see two options here:

  1. Making the behaviour dependent on Hugo's "environment" variable, using .RelPermalink when in development mode and $paper.url otherwise. I don't really like this, as diverging behaviour has the potential to hide bugs.

  2. Generating the flat structure directly with Hugo, as discussed in Should the canonical URL be the landing page? #480, instead of relying on redirects.

hugo/layouts/papers/single.html Show resolved Hide resolved
{{ end }}
{{ $endfile := printf "/papers/%s/%s/%s.endf" (slicestr $volume_id 0 1) $volume_id $anthology_id }}
{{ if (fileExists (printf "/data-export/%s" $endfile)) }}
<a class="btn btn-secondary btn-sm" href="{{ $endfile | relURL }}">EndNote</a>
<a class="btn btn-secondary btn-sm" href="{{ $paper.url}}.endf">EndNote</a>
{{ end }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice, but has the same issue with local browsing–the links will always point to the live server.

@mjpost
Copy link
Member Author

mjpost commented Sep 22, 2019

I'm not sure I like either (1) or (2) above:

  1. It's ugly, and I agree about potentially hidden bugs; and
  2. It's ugly to put all those files in one directory, especially the top-level directory.

What about

  1. We adapt make serve to run a docker container that contains an Apache instance that makes use of our .htaccess file. This has the advantage that it would be more fully testing our setup. We'd also have to change the generated URLs to be relative to the local root, which is a good change to make anyway. It would be a bit of a pain for developers to setup, but maybe it's a reasonable amount of pain to expect. It could also help us with the mirroring situations.

@akoehn
Copy link
Member

akoehn commented Sep 25, 2019

It's ugly to put all those files in one directory, especially the top-level directory.

Yes, we then have a lot of files in one directory. But this does not slow down anything, it is "only" a personal preference to have files sorted into directories (don't mean it insulting, I share the general preference). In fact, we lose complexity and the website might actually become faster with a flat structure.

This reduced complexity might also make it easier to reason about bugs as the html files directly show the problems and no mental step through redirects is necessary.

To not have all files in a single directory, we could instead also have a sub-directory for HTML, one for bibtex, etc.

I just fear that we add more complexity (docker containers, switches on the environment) where we could reduce it. Sure, a complex system can look beautiful, but a simple system is simple.

@mbollmann
Copy link
Member

FWIW, I believe that changing the output locations of the paper pages should be as simple as adding

[permalinks]
  papers = "/:filename/"

to hugo/config.toml. The .RelPermalink variable will adjust accordingly. (Untested though.)

@mjpost
Copy link
Member Author

mjpost commented Sep 25, 2019

One worry is that our hosting provider might not have the same modern file system. mount reports them all as "virtfs". Do you think this is possible?

What if we kept the current nested links for all generated files (bib, xml, endnote), but also added the top-level RewriteRule? That way they will both work. It's really only the .pdf links and the landing page that require only one external URL.

@mjpost
Copy link
Member Author

mjpost commented Sep 26, 2019

Okay, but this wouldn't work for the landing page.

What if we:

This adds a bit more structure to the top-level and has direct links. It does mean that the generated files are accessible from two paths, but I don't think that's a problem for those kinds of files.

Second, does anyone have time to do this before October 1? I will own it since I am the one who announced the date, but maybe it's quicker for someone else.

@akoehn
Copy link
Member

akoehn commented Sep 26, 2019

One worry is that our hosting provider might not have the same modern file system.

Ext4 has been a stable file system since 2008, that is more than ten years ago. I am sure they don't host this on FAT32 or similar ...

To test whether speed is an issue, you can run this in the build dir to get a flat html structure:
mkdir anthology-flat; for i in $(find anthology -type f -name index.html); do cp $i $(echo $i | sed 's|/|-|g; s|-index.html|.html|; s|anthology-|anthology-flat/|'); done
Then you can rsync the anthology-flat directory to the server and see whether accessing a file in it is slow.

With your second approach, we have more structure but would probably have a similar slow-down if the server cannot handle thousands of entries in a single directory.

@akoehn
Copy link
Member

akoehn commented Sep 27, 2019

Second, does anyone have time to do this before October 1?

Unfortunately not me, I am lying in my bed with a fever :-(

Do whatever you want to achieve this and ignore all my comments if it makes your life easier.

@mjpost
Copy link
Member Author

mjpost commented Sep 27, 2019

Sorry to hear it, @akoehn! I hope you feel better soon.

I ran some tests on the server, and there is no problem with all the files in one directory. So let's go that route.

The only choice left is whether to put generated files in subdirectories or not (e.g., bib files under bib/). Unless I hear otherwise, I will just put them in the root directory. So the root will have, for example:

N19-1017/
    index.html
N19-1017.bib
N19-1017.xml
N19-1017.endf
...

@knmnyn
Copy link
Collaborator

knmnyn commented Sep 28, 2019 via email

@mjpost
Copy link
Member Author

mjpost commented Sep 28, 2019

Last update: I've looked into this a bit, and I don't know hugo well enough to be able to implement the single directory by Monday. I'm going to suggest that we use the current proposal (which breaks local builds) until someone has a minute to fix it according our discussion here.

@mbollmann
Copy link
Member

I've noticed a couple of instances where links don't point to the right location yet; for example, the metadata field citation_pdf_url which, I believe, is used for Google Scholar indexing, doesn't actually point to the PDF anymore!

I can try to fix those, but debugging the website without a working local build is substantially harder...

@mjpost
Copy link
Member Author

mjpost commented Sep 29, 2019

How hard is it to fix the local builds? I’m happy to have everything in one directory but I don’t know hugo well enough.

@mbollmann
Copy link
Member

For the pages it should be as simple as the config file change I've wrote about above, but I'd want to test it and make sure that all the links point to the correct places. The bibliography files shouldn't be too hard either, but the same caveat applies.

I might have some time later tonight, but no promises. What time exactly would you want to push this live?

@mjpost
Copy link
Member Author

mjpost commented Sep 29, 2019

I don't think the time is important. It would be good if we could make it available sometime on the 1st, though.

@mbollmann
Copy link
Member

I've added two commits that should do what we discussed, and appear to produce the right thing locally.

We could put bibliography files in a subdirectory as well (in that case I'd go with the /paper-id/paper-id.bib route) and have the link point to /paper-id.bib – that's a trivial change, but to make this work locally I think we'd have to introduce a flag that changes the URL accordingly.

@mbollmann mbollmann requested a review from a team September 29, 2019 19:17
@mjpost
Copy link
Member Author

mjpost commented Sep 29, 2019

Great, I'll test this out.

@mjpost
Copy link
Member Author

mjpost commented Sep 29, 2019

One note: in the Python, I changed the meaning of the "url" field to denote the page itself (the landing page, the canonical URL), whereas "PDF" links to the PDF. So on the paper landing page, "URL" should be a self link, instead of to the PDF.

image

@mbollmann
Copy link
Member

It always used to link to the PDF though, so this would be a functional change.

@mbollmann
Copy link
Member

Also consider pages with external links, like LREC publications, which also point to the PDF in the "URL" field:

https://www.aclweb.org/anthology/L14-1003/

@mjpost
Copy link
Member Author

mjpost commented Sep 29, 2019

Okay, you've convinced me. I'll merge this tomorrow night!

@mjpost mjpost merged commit 19a337f into master Oct 1, 2019
@mjpost mjpost deleted the redirect branch October 1, 2019 00:35
mjpost added a commit that referenced this pull request Oct 1, 2019
* removed redirect rules (should have been in #513)
* added nodalida to events page (closes #544)
@mbollmann mbollmann mentioned this pull request Nov 1, 2019
najtin pushed a commit to ir-anthology/ir-anthology that referenced this pull request Jun 9, 2021
* Removed nested links in favor of canonical variants
* Flatten hierarchy of paper pages, change ANTHOLOGY_URL -> _PDF
* Fix URLs that should point to PDFs
* Fix volume-level URLs that should point to PDFs
* Generate bibliography files in flat structure
najtin pushed a commit to ir-anthology/ir-anthology that referenced this pull request Jun 9, 2021
* removed redirect rules (should have been in acl-org#513)
* added nodalida to events page (closes acl-org#544)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants