Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"sort by year" makes papers vanish #2605

Open
2 tasks done
jeisner opened this issue Jul 2, 2023 · 15 comments · Fixed by #2611
Open
2 tasks done

"sort by year" makes papers vanish #2605

jeisner opened this issue Jul 2, 2023 · 15 comments · Fixed by #2611
Assignees
Labels

Comments

@jeisner
Copy link

jeisner commented Jul 2, 2023

Confirm that this is a bug report

  • I want to report an issue that does not concern paper or author metadata.
  • I have searched for similar existing issues first.

Problem Description

The following query gets "about 28" hits, including many paper PDFs. But when I select "Sort by Year of Publication," only 1 hit remains. Maybe this means that publication year is missing for many papers? (Should they be shown anyway, perhaps at the end of the listing?)

https://aclanthology.org/search/?q=snowdon+nun

I want to report an issue that does not concern paper or author metadata.

Ok, it does concern those things, but not for a specific paper ...

I'm using Google Chrome 114.0.5735.198 on Linux Mint.

sort-relevance
sort-year

@jeisner jeisner added the bug label Jul 2, 2023
@jeisner jeisner changed the title "sort by relevance" makes papers vanish "sort by year" makes papers vanish Jul 2, 2023
@akoehn
Copy link
Member

akoehn commented Jul 2, 2023

The search is -- unfortunately -- not under our control, but under google's. The only way for us to fix these issues would be to move to a different search engine, for which we do not have the (personal) resources.

@mbollmann
Copy link
Member

mbollmann commented Jul 2, 2023

Well, we do have control over how we configure the Google Custom Search, and the stark discrepancy in results between different sorting options could be a sign that something is misconfigured.

I took a look at the settings, but can't see any obvious problem, though. The "sort by year" option should sort by the <meta name="citation_publication_date"> tag, which all our paper pages should have. I am puzzled why the search would even filter results based on a sorting option, which doesn't make any sense to me. I can have a look at whether the naming and content of the field is in line with what GCS expects.

That said, I do have a strong interest in replacing the Google search as well as time to do it (as evidenced by e.g. #165 (comment)), but unfortunately this isn't as simple as other changes to the site. At the moment, Matt and I are actively investigating whether it's an option to use Semantic Scholar instead.

@jeisner
Copy link
Author

jeisner commented Jul 2, 2023

Replying to @akoehn (sorry, messages crossed):

Right, I know that Google is providing the service.
If this feature can't be made to work, then maybe it should be removed.
But I assume it worked when the feature was added. And Google still seems to support an API for it.
They say that the Anthology could provide them with the metadata for this in any of several ways, for example:

A meta tag of the form <meta name="pubdate" content="20100101"> can be used with a search operator of the form: &sort=metatags-pubdate.

@mbollmann
Copy link
Member

aclanthology.org/search/?q=snowdon+nun

Hmm, I see now that the search results for @jeisner's query are almost exclusively PDF files, probably because that query hardly appears in titles/abstracts, but only the fulltext. So maybe the results disappear when sorting by year because PDF files, by virtue of not being XML, don't have <meta> tags?

@jeisner
Copy link
Author

jeisner commented Jul 2, 2023

Sounds right. The one that appears does have Snowdon in the abstract.

My attempt to sort by @mbollmann's field, https://aclanthology.org/search/?q=snowdon+nun&sort=metatags-citation_publication_date , still displays all of the results. But they are still sorted by relevance, I think. So maybe my &sort attempt is overridden in the back end by something that imposes the Sort by Relevance that is advertised on the page.

Fortunately, Google provides other means for supplying metadata info about the indexed files. This thread gives advice about how to do it for PDFs.

You might also be able to modify the PDF files themselves to add a custom metadata field, or just use the existing Created: and Modified: fields, but it seems safer for various reasons to supply the metadata from outside.

@jeisner
Copy link
Author

jeisner commented Jul 2, 2023

In particular, you can specify PageMap data in the Sitemap.

@mbollmann
Copy link
Member

Thanks @jeisner!

So from that document, what we could try is adding PDF files to the sitemap with meta information like this:

<url>
  <loc>https://aclanthology.org/2022.acl-long.1.pdf</loc>
  <PageMap>
    <DataObject type="metatags">
      <Attribute name="citation_publication_date" value="2022/5"/>
    </DataObject>
  </PageMap>
</url>

This should hopefully add metadata in a way that Google sees as equivalent to the <meta> tag on the landing page.

@mjpost I've tried adding this in f696824; before I make a PR, I'd suggest I build the site locally with that sitemap and try submitting the sitemap manually to Google Search to see if it works. I'm thinking I should also wait until after the ACL ingestion is complete to try this.

@mbollmann
Copy link
Member

Ah, my bad. I can't submit a sitemap XML file to the Google Search console, only a URL to a sitemap file. So I guess we'd have to merge the PR first and then see if it worked...

In any case I merged in the new ACL ingestion and checked that the sitemap generates as intended, at least.

@mbollmann
Copy link
Member

Re-opening this until Google has processed the sitemap and we can check results.

@mjpost
Copy link
Member

mjpost commented Jul 10, 2023

Did you manually resubmit it in the Google search console?

@mbollmann
Copy link
Member

Yes, I manually resubmitted the index file, and that caused at least some parts of the sitemap files to be re-read immediately, revealing a namespace error message (see #2615).

@akoehn
Copy link
Member

akoehn commented Jul 10, 2023

One aspect I still dislike about this approach is that the search is leading directly to the pdf and not to the canonical page, but that seems to be something we need to accept as long as we do not post-process the results or switch to a different provider. Or can the canonical site be set as canonical for the pdf in the sitemap?

That being said, thanks for the pointer, @jeisner! Seems like I was just too used to the search not working as intended.

@mbollmann
Copy link
Member

Sorting by "Year of Publication" now makes PDFs show up for me!

Searching for "snowdon nun" and sorting by year of publication

That it doesn't show all of them might be because not all parts of the sitemap have been re-read by Google Search so far; let's wait a bit and see.

@jeisner
Copy link
Author

jeisner commented Jul 11, 2023

Awesome, that's progress! Thanks @mbollmann!

In addition to only 5 of the "about 123" results showing up so far, I notice that their years are 2017, 2022, 2018, 2018, 2017, in that order, so it's not quite reverse chronological -- the first one is out of order. (Maybe there is a bug in the new sitemap data?)

@mbollmann
Copy link
Member

mbollmann commented Jul 11, 2023

In addition to only 5 of the "about 123" results showing up so far, I notice that their years are 2017, 2022, 2018, 2018, 2017, in that order, so it's not quite reverse chronological -- the first one is out of order.

I wonder if the value of the field, which is of the format "YYYY/MM", isn't interpreted as intended by the sorting algorithm. I'd have to play around with a few different queries and check if I can spot any pattern...

EDIT: But I think it's best to give it a day or two to make sure that it's not just Google's database not being fully caught up with the new sitemap yet. I don't know if "sitemap was read" means that changes are reflected instantly on the search as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants