Smart autocomplete #1047

thatbudakguy · 2024-06-07T18:36:51Z

This is a brand-new feature to enhance the search experience, both from the homepage and search results page.

To try it out, visit the interactive prototype and click in the search bar to simulate typing.

The suggestions are grouped into categories:

Datasets, Maps, etc. (corresponding to values in the resource class field in solr)
Places

For the latter, the most basic implementation would be to use values from the "Spatial coverage" field/facet. However, this won't result in a spatial search and won't move the map because those values are just metadata – they aren't tied to geospatial locations. We want to do more than this, if we can.

A more complete/featureful implementation would do geocoding, so that "New york (city)" and "New york (state)" could be matched to geospatial coordinates. When you selected each option, you would go to a search with the spatial facet active, as though you had already moved the map to the coordinates/bbox of the place you selected.

This deserves some thought re: implementation. We could query some kind of web service live during the autocomplete process, or we could pre-compute a list of known locations and their coordinates to make the process faster/simpler (but perhaps less complete).

Note that Stanford maintains its own geocoder service at https://sul-geocoding-web.stanford.edu/. It would be great to take advantage of this! There's more information on that page about how the service works; see in particular this page on the geocoding API. There is also a suggestion service that might be worth investigating.

Whatever we do, we probably want to take advantage of solr's built-in functionality for search suggestions.

thatbudakguy · 2024-06-07T18:44:16Z

Might be related to or synergize with #41.

thatbudakguy · 2024-06-11T19:45:44Z

re: the stanford geocoding service, @mapninja says:

Because of the way the data is licensed, we currently have 5 services, rather than a single global service. I’m trying to negotiate a global geocoder but that may not happen until next license cycle in Jan ’25. That means that you would have to figure out how to aggregate the autocomplete from all 5 services, in order to accommodate global placenames.

thatbudakguy · 2024-06-13T21:12:48Z

Evidently the ArcGIS Online (cloud) version of this API does have a /world endpoint, and because the APIs are the same, we could develop against/use that while the stanford-local one is being set up.

More info at: https://developers.arcgis.com/documentation/mapping-apis-and-services/geocoding/tutorials/search-for-an-address/

hudajkhan · 2024-07-17T17:52:26Z

@dnoneill had a great suggestion of looking at Blacklight 8. Some references that might be useful to look at that:

https://github.com/projectblacklight/blacklight/blob/main/app/components/blacklight/search_bar_component.html.erb#L24
and possibly: https://github.com/projectblacklight/blacklight/blob/main/app/views/layouts/blacklight/base.html.erb#L23

The autocomplete path will default to the suggest handler in Solr. In looking at that, I am wondering if we need to do something separately on the Solr end to build the suggest index (that's separate from the regular Solr index). We should be able to call the suggest endpoint independently with a query and see which values we will get back.

Apart from the Solr stuff, some questions I have are:

For the category breakout, would it be enough to change the display of the results based on the types of Solr documents returned? i.e. we get the response from the suggest endpoint, it gets funneled to be displayed. At that point, we are able to group the documents by "Datasets"/"Maps" based on the document resource class field. We then update the display (somehow?) to show those categories along with the results.
As far as I know, the suggestion algorithm will return Solr documents as results. How would a place name be returned at all? Would place names really be a separate suggest endpoint or some kind of call, looking specifically at place name values (and not at Solr documents returned by the regular textual suggestion endpoint)? Or maybe looking at what possible place name facet values we have? Or just all place names possible in general?
What does a place name coming up in the autocomplete indicate? Does it indicate that we have results that tie to that bounding box? Or does it indicate just that these are possible place names in general?
For the dynamic spatial search - when we figure out how to get place names - are we trying to match the action of clicking on a place result to a bounding box query? E.g. The user types in "New Y". Our suggestion comes back with "New York". Clicking on New York will then show results with bounding box related to New York. Does this also do a textual search or is it only a bounding box search?

hudajkhan · 2024-07-17T17:58:19Z

Also, how do we integrate lobsters? :)

hudajkhan · 2024-07-17T18:12:18Z

Another question is - when we understand what and how we want place names to display: do we need to create a new suggest endpoint/dictionary to support this lookup?

dbranchini · 2024-07-24T21:38:29Z

I'll try to answer the questions I think I can answer. It seems there's a mix of UX and technical in those questions, and we might want to have a discussion around some of these.

For the category breakout, would it be enough to change the display of the results based on the types of Solr documents returned? i.e. we get the response from the suggest endpoint, it gets funneled to be displayed. At that point, we are able to group the documents by "Datasets"/"Maps" based on the document resource class field. We then update the display (somehow?) to show those categories along with the results.

Not a UX question, correct?

As far as I know, the suggestion algorithm will return Solr documents as results. How would a place name be returned at all? Would place names really be a separate suggest endpoint or some kind of call, looking specifically at place name values (and not at Solr documents returned by the regular textual suggestion endpoint)? Or maybe looking at what possible place name facet values we have? Or just all place names possible in general?

First part of this block seems to be more technical, but there's a UX component to the second part. We could go either route - show only locations (or place names) that match our data or show all locations. Two scenarios come to mind. One - showing only locations that match our data allows users to find datasets matching the location their interested in. Two - showing all locations supports a different use case. I'll use India and Rajasthan (a state in India). Let's imagine we have datasets matching Rajasthan, but we don't have any matching India. If someone knows they're looking for a state in India, but can't remember the exact name or maybe they're interested in a couple different states, etc., they might start by searching for India. In this case, they'd get zero matches. So I lean toward the latter scenario. User sees India as a location match, they click it, the map zooms to India and the search results show all datasets contained within that bounding box.

What does a place name coming up in the autocomplete indicate? Does it indicate that we have results that tie to that bounding box? Or does it indicate just that these are possible place names in general?

I believe I answered that above so I now I'm wondering if I misunderstood the question above?

For the dynamic spatial search - when we figure out how to get place names - are we trying to match the action of clicking on a place result to a bounding box query? E.g. The user types in "New Y". Our suggestion comes back with "New York". Clicking on New York will then show results with bounding box related to New York. Does this also do a textual search or is it only a bounding box search?

That's a good question. I think the textual search is shown through the matches for datasets and maps, and the location search is just a location/map search.

I hope this helps get a bigger conversation started around this. I'm open to other ideas.

thatbudakguy · 2024-07-24T21:57:27Z

For the category breakout, would it be enough to change the display of the results based on the types of Solr documents returned? i.e. we get the response from the suggest endpoint, it gets funneled to be displayed. At that point, we are able to group the documents by "Datasets"/"Maps" based on the document resource class field. We then update the display (somehow?) to show those categories along with the results.

Technically speaking, this matches my understanding and seems like something we can do now, without regard to external calls or geocoding using a service. Maybe this is the MVP implementation for this ticket: just split up the autocomplete results depending on what type of thing they are.

As far as I know, the suggestion algorithm will return Solr documents as results. How would a place name be returned at all? Would place names really be a separate suggest endpoint or some kind of call, looking specifically at place name values (and not at Solr documents returned by the regular textual suggestion endpoint)?

This is correct. The way to get place names is via an external call to a geocoding service, as detailed in the initial description of the ticket. What we would do is merge solr's suggestion results with the external results and display them as a single list with different headings. This could be considered the more full-featured implementation of the ticket.

What does a place name coming up in the autocomplete indicate? Does it indicate that we have results that tie to that bounding box? Or does it indicate just that these are possible place names in general?

I think it would just be simpler to do the latter. It's entirely possible that the user enters "New Y...", gets a list of suggestions, and clicks on the one for "New York (city)", but then that search has zero results, because we don't hold any data that falls within the bounding box for NYC. Doing the former seems maybe possible, but hard?

For the dynamic spatial search - when we figure out how to get place names - are we trying to match the action of clicking on a place result to a bounding box query? E.g. The user types in "New Y". Our suggestion comes back with "New York". Clicking on New York will then show results with bounding box related to New York. Does this also do a textual search or is it only a bounding box search?

Yeah, this is the idea. It would not do a textual search, because:

we want the search to return any items that fall within the geographic boundary of NYC, including those that may not have "new york" anywhere in their textual data (this is the point of this feature; otherwise these results would be missed)
we don't want the search to return all things that have "new york" in their textual data/metadata, because that might result in returning things that apply to the entire state, or even things that were just published in or mention new york

hudajkhan · 2024-07-31T18:19:34Z

Thanks @dbranchini and @thatbudakguy for the responses!

hudajkhan · 2024-07-31T18:21:37Z

Schema and Solr config information that is relevant
https://github.com/sul-dlss/sul-solr-configs/blob/master/earthworks-aardvark-prod/schema.xml#L60
https://github.com/sul-dlss/sul-solr-configs/blob/master/earthworks-aardvark-prod/schema.xml#L126
https://github.com/sul-dlss/sul-solr-configs/blob/master/earthworks-aardvark-prod/schema.xml#L184

https://github.com/sul-dlss/sul-solr-configs/blob/master/earthworks-aardvark-prod/solrconfig.xml#L207
https://github.com/sul-dlss/sul-solr-configs/blob/master/earthworks-aardvark-prod/solrconfig.xml#L217

hudajkhan · 2024-07-31T20:43:08Z

A few questions and comments based on discussion with @edsu:

The suggest autocomplete is currently based on multiple fields like spatial, title, and subject (among others - see link above). This means that the autocomplete results will show strings that mach any of those fields (e.g. "New" will result in strings that come from the spatial field, title or subject). The figma prototype shows what appear to just be titles of resources/solr documents, whereas the suggest service (as configured in Solr) will just return any string that matches from the suggest field (which currently has content from a variety of fields). <-- This is an open question, and would be good to discuss. Specifically, are the suggest results meant to be items, so clicking on one would take us to the item? Or are they supposed to be string matches or suggestions that lead to keyword queries? Or keyword queries that are further constrained by the resource class they fall under (i.e. clicking on "riverline" under "Data" will do a keyword query for "riverline" but filter by resource class "Dataset")?
We are looking into possible context filter implementation that could help us identify only datasets or maps. (What the string result shows will be another discussion). We could possibly also look at weighting (for example, weighting the title above the subject field).
@edsu will look into the geocoding service linked above.
@hudajkhan will look at context filters some more and if we can use resource class as a filter. (And then maybe weighting but we'll see).

edsu · 2024-08-01T16:53:18Z

For the Locations part of the results I'm wondering if we could do a query for what's in the input box, and facet by dct_spatial_sm which would give us some places that actually exist in the data. Then if they click on the location they will see results for records that have that place assigned to it, and the map will automatically zoom/pan to the appropriate place?

One concern I have about using a Gazetteer of place names is we will get a list of places, but a user who clicks on one of them could very likely end up on a screen with no results, unless we have documents that match the bounding box for the place.

hudajkhan · 2024-08-01T20:41:49Z

Summarizing some of the comments from the discussion we had at standup:

For data and maps, we want the autocomplete to actually give us possible record match titles i.e. search results with the titles of records and not strings that may match a subject, title, or place name field. Also, we may want to display highlight what matched for the result (e.g. search snippets). To that end, we are leaning towards the use of Solr search instead of Solr suggestions. Using search will also give us access to facet information as well as details about the individual records that will help us get information about resource class and types using a single query.
For search, if we want to have left-anchored search, we should look into if that is a parameter we can use with the existing Solr (I am not sure about this) or setting up a new Solr search request endpoint (this might be possible). Perfect left-anchored search may be difficult so we should at least review possibilities, even if we don't handle that kind of search immediately.
For locations, we can follow @edsu's suggestion above with respect to using the dct_spatial_sm values to see which locations actually exist in the data.
We will look into using Themes as a category as well.

Steps:
@edsu is looking at the geocoder services. We can review how to use those. Even if we rely purely on dct_spatial_sm facet values that match a string (i.e. "New York" and "New Hampshire"), now that Aardvark does not require bounding boxes, our search results map may not always restrict itself to "New York" or "New Hampshire" based solely on search result bounding boxes. We may still need to just move the map to that location explicitly instead of relying on the search results' bounding boxes for the user's selection of "New York", for example.

@hudajkhan is looking into these areas in roughly this order: Using search results (as they currently work with Solr /select) in the autocomplete drop down (instead of suggest). How to get Solr highlights/snippets (and if those will work for us). What it takes to do left anchored search and whether that requires an entirely new request endpoint.

@thatbudakguy @edsu @dbranchini please feel free to add any comments if I got any of the above wrong.

hudajkhan · 2024-08-05T22:17:06Z

@dbranchini , we still have a few questions for you. I think another meeting later will help, and I will update the ticket with some more implementation options/examples.

Here is a screenshot right now of a (un-styled) view of autocomplete results we are getting if we use the regular search endpoint (not the suggest endpoint).

In general, this looks ok, although it does take longer than a regular autosuggest. How this works currently is that a regular search is done against Solr with "New" which returns a list of Solr documents. The code then organizes the results into datasets and maps based on the resource class, and then looks at the dct_spatial_sm (i.e. place) facet to get the most frequent place values from the results of the "New" query.

Here are some of the issues with using search instead of suggest:

The results, as currently implemented, do not necessitate that the resulting suggestion strings will actually match any part of query input. For example, if we type in "data", we get locations that do not have "data" in the string label, and datasets and maps that do not have "data" in the titles. This seems to be a limitation of using regular search for auto-suggest, because we suspect the expected functionality usually sees some obvious match between the query and the suggestions. We want to confirm whether this is the functionality we want to see.

Search results take a bit longer than the suggest request end point
The suggest endpoint (i.e. not above but what we see currently in production) copies over values from the following fields: title, creator, publisher, place, schema provider, and subject to match against the query. If we figure out how to correctly get the categories we need from the suggest endpoint, would it make sense to add categories to the display? In other words - datasets, maps, creators/authors, publishers, etc. Or should we limit the fields that are copied over for suggest so we only copy those values (e.g. if we decide we only want suggestions that match titles, locations, and themes, then we make sure those are the only string values returned for search).

hudajkhan · 2024-08-05T23:00:11Z

In the meantime, here are a few possibilities to examine:

Using the suggest endpoint (i.e. the general implementation we have in production right now), can we add information to the payload field in order to be able to discern the category to which a string returned by the suggest service belongs? e.g. dataset, map, location?
For the above, can we add information in a way that only updates the solr config or schema, and doesn't require changes to the indexing workflow?
For the above, if the solution is a change to the indexing workflow, when and how should we proceed with that?
If we try to use regular search instead of suggest, should we consider setting up a new search request handler OR passing in parameters that limit which fields the search queries to limit the kinds of matches we get?
If we try using a regular search, what should we do in order to get more of a left anchored search (i.e. the left hand side of the string is what we are matching against or preferring matches for)?

edsu · 2024-08-06T17:39:02Z

In standup today we discussed whether we might want to get smart-suggestions for things other than locations and titles, such as publishers, creators, etc.

It seemed like the consensus was to hold off for now and only add indexes that we know we need for the current design. We will need to reindex to get the new suggest indexes into place, and should be able to repeat that process if there's a need to suggest publishers, subjects, creators, etc.

hudajkhan · 2024-08-06T21:20:19Z

@edsu I have added my work done so far to this branch:
https://github.com/sul-dlss/earthworks/tree/bl8auto

This hasn't been rebased to the latest from bl8, but it should have the pieces necessary to test out the configuration. I updated the schema.xml and solrconfig.xml files for the solr configuration. The other important updates are to the blacklight suggest response model and to the suggest.html.erb file.

edsu · 2024-08-19T16:57:16Z

I've put an alternative approach to using the /select endpoint instead of /suggest on this branch:

https://github.com/sul-dlss/earthworks/tree/bl8-auto-search

The advantage to using search is that we get back solr documents which we can use the id to construct a direct link from the suggestion, which I think is the intended behavior for the datasets and maps links in results?

marlo-longley · 2024-08-19T17:20:31Z

We want to schedule a group session on this.
Should this drop off of the workcycle and be applicable to other Blacklight apps?
Huda and Ed will get both solutions up and running, and then the team can review and discuss.

2 questions: 1. functionality and 2. finesse of search context

Marlo will schedule a huddle.

hudajkhan · 2024-08-22T21:34:40Z

Discussion was scheduled. Slides at
https://docs.google.com/presentation/d/1tsgnAX2_yjxsbf1G_VRSHK9e9nh3g1lgp12VX0zShFw/edit#slide=id.g2edf351186c_0_128 .

More user feedback/review required.
We will document our work so far, and look into setting up a test environment with this configuration.

thatbudakguy added the enhancement label Jun 7, 2024

thatbudakguy added this to Geo Workcycles 2024 Jun 7, 2024

thatbudakguy moved this to Ready in Geo Workcycles 2024 Jun 7, 2024

thatbudakguy mentioned this issue Jun 7, 2024

Update the home page UI #1048

Open

thatbudakguy mentioned this issue Jun 7, 2024

Update the search results page #1052

Closed

7 tasks

thatbudakguy moved this from Ready to Blocked in Geo Workcycles 2024 Jul 2, 2024

thatbudakguy moved this from Blocked to Ready in Geo Workcycles 2024 Jul 16, 2024

thatbudakguy added the question label Jul 24, 2024

hudajkhan assigned edsu and hudajkhan Jul 31, 2024

hudajkhan moved this from Ready to In Progress in Geo Workcycles 2024 Jul 31, 2024

marlo-longley mentioned this issue Aug 5, 2024

Look into shingle filter for autocomplete enhancements #259

Open

edsu mentioned this issue Aug 12, 2024

Smart autocomplete #1153

Closed

This was referenced Aug 12, 2024

Smart auto suggest #1154

Draft

Smart suggest using search #1180

Closed

hudajkhan mentioned this issue Aug 22, 2024

Document autosuggest approaches, Solr configuration and indexing requirements, and explore how to setup user feedback environment #1247

Open

hudajkhan closed this as completed Aug 22, 2024

github-project-automation bot moved this from In Progress to Done in Geo Workcycles 2024 Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smart autocomplete #1047

Smart autocomplete #1047

thatbudakguy commented Jun 7, 2024 •

edited

Loading

thatbudakguy commented Jun 7, 2024 •

edited

Loading

thatbudakguy commented Jun 11, 2024

thatbudakguy commented Jun 13, 2024

hudajkhan commented Jul 17, 2024

hudajkhan commented Jul 17, 2024

hudajkhan commented Jul 17, 2024

dbranchini commented Jul 24, 2024

thatbudakguy commented Jul 24, 2024

hudajkhan commented Jul 31, 2024

hudajkhan commented Jul 31, 2024

hudajkhan commented Jul 31, 2024

edsu commented Aug 1, 2024 •

edited

Loading

hudajkhan commented Aug 1, 2024

hudajkhan commented Aug 5, 2024

hudajkhan commented Aug 5, 2024

edsu commented Aug 6, 2024

hudajkhan commented Aug 6, 2024 •

edited

Loading

edsu commented Aug 19, 2024 •

edited

Loading

marlo-longley commented Aug 19, 2024

hudajkhan commented Aug 22, 2024

Smart autocomplete #1047

Smart autocomplete #1047

Comments

thatbudakguy commented Jun 7, 2024 • edited Loading

thatbudakguy commented Jun 7, 2024 • edited Loading

thatbudakguy commented Jun 11, 2024

thatbudakguy commented Jun 13, 2024

hudajkhan commented Jul 17, 2024

hudajkhan commented Jul 17, 2024

hudajkhan commented Jul 17, 2024

dbranchini commented Jul 24, 2024

thatbudakguy commented Jul 24, 2024

hudajkhan commented Jul 31, 2024

hudajkhan commented Jul 31, 2024

hudajkhan commented Jul 31, 2024

edsu commented Aug 1, 2024 • edited Loading

hudajkhan commented Aug 1, 2024

hudajkhan commented Aug 5, 2024

hudajkhan commented Aug 5, 2024

edsu commented Aug 6, 2024

hudajkhan commented Aug 6, 2024 • edited Loading

edsu commented Aug 19, 2024 • edited Loading

marlo-longley commented Aug 19, 2024

hudajkhan commented Aug 22, 2024

thatbudakguy commented Jun 7, 2024 •

edited

Loading

thatbudakguy commented Jun 7, 2024 •

edited

Loading

edsu commented Aug 1, 2024 •

edited

Loading

hudajkhan commented Aug 6, 2024 •

edited

Loading

edsu commented Aug 19, 2024 •

edited

Loading