-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"sort by year" makes papers vanish #2605
Comments
The search is -- unfortunately -- not under our control, but under google's. The only way for us to fix these issues would be to move to a different search engine, for which we do not have the (personal) resources. |
Well, we do have control over how we configure the Google Custom Search, and the stark discrepancy in results between different sorting options could be a sign that something is misconfigured. I took a look at the settings, but can't see any obvious problem, though. The "sort by year" option should sort by the That said, I do have a strong interest in replacing the Google search as well as time to do it (as evidenced by e.g. #165 (comment)), but unfortunately this isn't as simple as other changes to the site. At the moment, Matt and I are actively investigating whether it's an option to use Semantic Scholar instead. |
Replying to @akoehn (sorry, messages crossed): Right, I know that Google is providing the service.
|
Hmm, I see now that the search results for @jeisner's query are almost exclusively PDF files, probably because that query hardly appears in titles/abstracts, but only the fulltext. So maybe the results disappear when sorting by year because PDF files, by virtue of not being XML, don't have |
Sounds right. The one that appears does have My attempt to sort by @mbollmann's field, https://aclanthology.org/search/?q=snowdon+nun&sort=metatags-citation_publication_date , still displays all of the results. But they are still sorted by relevance, I think. So maybe my Fortunately, Google provides other means for supplying metadata info about the indexed files. This thread gives advice about how to do it for PDFs. You might also be able to modify the PDF files themselves to add a custom metadata field, or just use the existing Created: and Modified: fields, but it seems safer for various reasons to supply the metadata from outside. |
In particular, you can specify PageMap data in the Sitemap. |
Thanks @jeisner! So from that document, what we could try is adding PDF files to the sitemap with meta information like this: <url>
<loc>https://aclanthology.org/2022.acl-long.1.pdf</loc>
<PageMap>
<DataObject type="metatags">
<Attribute name="citation_publication_date" value="2022/5"/>
</DataObject>
</PageMap>
</url> This should hopefully add metadata in a way that Google sees as equivalent to the @mjpost I've tried adding this in f696824; before I make a PR, I'd suggest I build the site locally with that sitemap and try submitting the sitemap manually to Google Search to see if it works. I'm thinking I should also wait until after the ACL ingestion is complete to try this. |
Ah, my bad. I can't submit a sitemap XML file to the Google Search console, only a URL to a sitemap file. So I guess we'd have to merge the PR first and then see if it worked... In any case I merged in the new ACL ingestion and checked that the sitemap generates as intended, at least. |
Re-opening this until Google has processed the sitemap and we can check results. |
Did you manually resubmit it in the Google search console? |
Yes, I manually resubmitted the index file, and that caused at least some parts of the sitemap files to be re-read immediately, revealing a namespace error message (see #2615). |
One aspect I still dislike about this approach is that the search is leading directly to the pdf and not to the canonical page, but that seems to be something we need to accept as long as we do not post-process the results or switch to a different provider. Or can the canonical site be set as canonical for the pdf in the sitemap? That being said, thanks for the pointer, @jeisner! Seems like I was just too used to the search not working as intended. |
Awesome, that's progress! Thanks @mbollmann! In addition to only 5 of the "about 123" results showing up so far, I notice that their years are 2017, 2022, 2018, 2018, 2017, in that order, so it's not quite reverse chronological -- the first one is out of order. (Maybe there is a bug in the new sitemap data?) |
I wonder if the value of the field, which is of the format "YYYY/MM", isn't interpreted as intended by the sorting algorithm. I'd have to play around with a few different queries and check if I can spot any pattern... EDIT: But I think it's best to give it a day or two to make sure that it's not just Google's database not being fully caught up with the new sitemap yet. I don't know if "sitemap was read" means that changes are reflected instantly on the search as well. |
Confirm that this is a bug report
Problem Description
The following query gets "about 28" hits, including many paper PDFs. But when I select "Sort by Year of Publication," only 1 hit remains. Maybe this means that publication year is missing for many papers? (Should they be shown anyway, perhaps at the end of the listing?)
https://aclanthology.org/search/?q=snowdon+nun
Ok, it does concern those things, but not for a specific paper ...
I'm using Google Chrome 114.0.5735.198 on Linux Mint.
The text was updated successfully, but these errors were encountered: