-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search box #165
Comments
So I don't know anything about Apache Solr, but the way we currently use it seems to be tightly integrated with the Rails application (the instructions use I've also never used Google Custom Search before, but from what I gather from the documentation, it seems to be pretty powerful in its ability to customize how the search works exactly, and it should certainly be much easier to integrate into the static website. If we count as a non-profit organization (here's where I reveal my total ignorance of the legal status of the ACL—do we?), we can even get rid of advertisements in the search. So my suggestion would be to go for the Google solution now, then see if we are satisfied with it or want something tailor-made again in the future, which someone would probably have to re-build and maintain (it should at least be decoupled from the whole Rails thing, and if we kept Solr this would also be a good opportunity to upgrade from Solr 3.5 to Solr 7). If we do this, @mjpost would have to decide what Google account should be the owner of the Custom Search and create it; anyone can be made admin afterwards, and I can take care of setting it up correctly and integrating it. |
I started this process and added you as an admin. I'm looking into the non-profit stuff, if you want to take care of the site side of things. |
I can shed some light on this. The reason for Solr to be so integrated with the Rails application is because the application was built on top of a template for developing Rails apps with integrated search, called Blacklight. That said, we don't need to go through Blacklight if we want custom search - we could go directly through Solr, or even through a different search engine. I think using Google Custom Search would work as an agile solution, but I've written before about why I think an in-house search would work better in the long term. Searching for papers by author and year, for instance, is something that I find useful and would be gone if we used Google. I have toyed with the idea of using an XML database directly, since it would allow us to search without having to keep a parallel database - we would just feed it the canonical XMLs and the schema, and the DB would make its magic. But I realize that this would require development time, and that's something we are a bit short on. At the moment I'm quite busy, as evidenced by my significantly-lower participation in issues, but I could hopefully invest some time for this in two or three weeks if there is interest. In the meantime, getting Google up and running seems straightforward and would be an okay solution. |
Thanks, Martín. This sounds good to me: we roll out with a custom Google search, and also build our own solution as time permits. We can then compare them. Note that a third option is bibsearch, a tool that @davvil and I wrote that could be adapted here as a CGI app. It's quite fast and also based on a custom database. David has indicated interest in continuing to maintain that, and this could remove some redundancy. I also plan to write an Alfred plugin that allows quick search from the OS X desktop. |
Update: It seems that the customizations I was hoping to do with Google Custom Search are not possible after all. The documentation for Google CSE has a nice section on rich result snippets that demonstrates customization of search results based on structured data. However, the links explaining how to do this are either 404 or link to pages from CSE v1 which is no longer supported. All further info I could find online also refers to this old v1 search element. In conclusion, it appears that this used to be a thing but is no longer possible with the current CSE v2 (at least not with the free version), even though the docs suggest otherwise. This makes a custom-built solution more appealing again in the long term. |
I asked about this on another thread:
@mbollmann replied there:
I see. But couldn't we just parse the HTML results? (I suppose that requires maintenance if the HTML format changes, but maybe that doesn't happen too often. The other issue would be having to retrieve multiple pages of results to get the first page of filtered results.) |
Apart from the question whether that's against the TOS of Google's Custom Search (which I'm not sure of right now), I don't see how. In contrast to the results from the JSON API, there is no metadata (such as author information) in the HTML results AFAICT. |
Presumably the page title is also the paper title, and so the start of the page title (shown in blue for each search hit) can be used to index into our database to retrieve the other metadata. On the rare occasion that the hit is consistent with multiple papers, be generous and keep it if any of those papers match the search criteria. If the hit is not consistent with any papers for some reason, be generous and keep it, or else fall back to some kind of fuzzy search (like agrep). |
There is no database to query anymore though, since the whole site is statically generated now. |
I see. But there is a static bibtex database on the site. Search results are necessarily dynamic: the improved search box would thus have to talk to a process that serves up results by querying Google and filtering those results. When that process started up, it could read the bib files and construct a simple in-memory index (e.g., a hash on the start of the canonicalized title). |
True, but do you see any advantages over a fully server-side search solution anymore then? @mjpost suggested a CGI app based on bibsearch above, for example. Once we introduce a server-side component, we might as well go all the way, no? |
I was thinking of two advantages. |
Hmm -- in the upper-right corner of the Google search results, there is a "Sort by:" dropdown that lets you choose either "Relevance" or "Year of Publication". Where did that come from? There are also tabs to search "Authors," "Events," and "Paper Metadata." I believe this means to restrict the search only to certain kinds of pages on the site. However, it took me a few minutes to come up with that theory. I fear that these tabs might be misinterpreted as saying "please list the authors / events of all the papers you just found." That would be useful but is not what the tabs currently do. For example, a full-text search on "puns" might find that the most relevant papers are by Jo Bloggs, but she won't be on the author tab unless her author page contains the word "puns" (e.g., in a paper title). |
It's the customizations that Google Custom Search lets you do (and that I added). "Year of Publication" sorts by About your previous suggestion, I'll have to think about it a bit more. I can see the advantages of piggybacking on Google, but I'm still not sure it's not too hacky to be maintainable in the long run and/or against Google's TOS. That said, maybe just having a server-side & customized option (e.g. based on bibsearch) alongside the Google one might already give users more options? As I said, I'll have to think about it more. |
The "Paper Metadata" tab is now broken after the change to a flat directory hierarchy in #513, and I don't see how it can be restored. In Google Custom Search, we can assign labels to URLs or some very simple URL patterns, which was used to label all URLs beginning with I have discussed options to switch to a different search engine with @mjpost, but still have to prepare an overview and a suggestion. Once I do, I will open a separate issue to discuss this, but in the meantime I wanted to note this problem with GCS here. |
Just cross-referencing various search-related issues to have them in one place.
|
Leaving some notes here regarding options for building our own custom search engine, since I've been pondering about search functionality for quite a while now. Server-side searchIn 2020, I built a search engine prototype that used server-side search via Meilisearch. It has been offline and unmaintained for a while now since there was no clear path towards integrating it into the Anthology. However, I think there are some good arguments in favor of picking this idea up again:
In other words, there are now open-source, commercial-grade solutions for both backend and frontend via Meilisearch, which could make this solution a bit of a safer bet now regarding long-term stability and maintainability. Client-side searchI've also wondered about the feasibility of purely client-side search. There are tons of libraries for that, such as Fuse.js or Lunr.js, and if you try an interactive demo/comparison and generate, say, 100000 titles to search in, it still performs blazingly fast (a few milliseconds per search on my machine). The biggest bottleneck for this kind of solution, IMO, is getting the data to the user. Let's take a paper index as an example that contains only Anthology ID, paper title, and author last names. A pre-built search index for Lunr.js comes out at around 32 MB, or 8.7MB after gzip compression. The unindexed JSON data (which would need to be indexed client-side every time) is 14 MB, or 4 MB compressed. I'm not sure what's acceptable in terms of data volume for websites to transfer these days, but considering that this will only grow, and doesn't even include abstracts or other metadata yet, I'm a bit skeptical that this is a good way to go. On the plus side, maintaining a purely client-side solution would most definitely be easier. Anyway, if anyone's still interested in making this happen eventually, I'm happy to hear other people's thoughts as well. @mjpost @akoehn |
Do any of these solutions support dense retrieval? That is, embed queries and passages into a vector space, so that exact word match isn't required. I'm asking because I assume that Google Custom Search must be evolving in this direction. |
I feel a quick, site-internal metadata search is somewhat complementary to a dense retrieval system like you're describing. I'm really mainly talking about the former here, as I think the current Google Custom Search doesn't fulfill this role very well, and we should probably not build and support our own homegrown solution to the latter. Note that exact word match isn't required for most of these search solutions I'm talking about either. They typically employ some form of fast, fuzzy matching. (Although of course that doesn't handle synonymy etc., if that's what you were thinking.) I have wondered if it's an option to collaborate with Semantic Scholar somehow. They already provide an API that provides metadata about whether a paper belongs to the ACL Anthology, but last I checked they didn't support search queries that filtered based on this. |
We haven't discussed what to do with the search box when the static site goes live. Currently it's disabled. I suggest that we either
I do like our custom search, and we get some more control over the results. So I suppose (2) is semi-dependent on finding someone to maintain it. Thoughts?
CC: @mbollmann @villalbamartin
The text was updated successfully, but these errors were encountered: