Moving to a search server: Why & how ?

Since 2010, GeoNetwork community has been discussing the move from Lucene to Solr in order to improve user search experience. Main motivations for that move was:

Improve suggestions (eg. spell check, suggest only on user records)
Facets on any fields and cross fields facet hierarchy
Scoring, boosting results
Similar documents
Highlights in response
Join query
Improve Lucene memory issues on some setup (require restart)
Reduce Lucene multilingual/search complexity

Moving from Lucene to Solr or Elasticsearch introduce a major change in the application. The search server is running aside GeoNetwork. A proxy is implemented in GeoNetwork to do search and enrich queries and responses based on user privileges.

Based on the WFS data indexing funded by Ifremer, a first codesprint was made in April/May 2016 with titellus (Francois Prunayre) and camptocamp (Patrick Valsecchi, Antoine Abt, Florent Gravin) to replace Lucene by Solr.

This codesprint focus on starting the move to Solr in order to identify main issues & risks / main benefits and draw a roadmap in order to then look for funding. This document sum-up what has been done so far and illustrate features that could be relevant for GeoNetwork.

Codesprint main targets

Analyze how to move to Solr
Investigate Solr features and illustrate the benefits
Start migration & refactoring focusing on main search service and CSW; identify features to deprecate
Illustrate with a simple search interface providing the capability to search on metadata and datasets

draft

Technical overview

New dependency:

Solr 6
Java 8 required

Removed dependency:

Lucene 4.9 dependency

GeoNetwork major changes:

Angular app use a simple HTTP Interceptor to allows basic search (the interceptor mimic q service query/response translation from/to Solr format). This is used to enable basic functionalities in current UI.
New experimental Angular UI for search (on features and metadata)
Integrate cleaning PR ie. Remove ExtJS UI, Old XSL services, Z39.50 server

Work

See branch https://github.com/geonetwork/core-geonetwork/tree/solr

Preview of improvements

First experiments:

Geographic features https://www.youtube.com/watch?v=VFiQEi0U-yc (indexing, attribute table view, facetting, heatmaps)
(Meta)data search https://www.youtube.com/watch?v=3FyugQMxaiE for search in metadata and data at the same time

Spellchecks & suggestions

Spell checking module allows to suggest related search to end users in case of typo. Suggestion module can be used to provide suggestions based on field in the index.

Example of suggestions and similar words:

draft

Examples on typos:

draft

Spell check also works on phrases:

draft

Current suggestion in GeoNetwork is based on a search and could not provide terms that are not matching results as current implementation does (see https://github.com/geonetwork/core-geonetwork/issues/1466, https://github.com/geonetwork/core-geonetwork/issues/634, https://github.com/geonetwork/core-geonetwork/issues/1003.

Find similar document

Using "MoreLikeThis" component, easily provide similar document to the one you're currently looking at (eg. other versions of the same dataset). See https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

eg. search for ortho imagery, when you retrieve an image for 2015, you also have similar images in 2009, 2012. More like this response is structured that way

draft

Boosting

Search can know boost on specific fields during search or indexing (eg. give more score for match in the title) using Solr search API.

Synonyms

Solr support synonyms configuration based on simple text file or more advanced synonym map (configurable using API). Synonyms are heavily used in the INSPIRE dashboard project (eg. INSPIRE themes & annex https://github.com/INSPIRE-MIF/daobs/blob/daobs-1.0.x/solr/solr-config/src/main/solr-cores/data/conf/_schema_analysis_synonyms_inspireannex.json, contact and territory in France https://github.com/fxprunayre/daobs/blob/geocataloguefr/solr/solr-config/src/main/solr-cores/data/conf/_schema_analysis_synonyms_geocat_producer_territory.json).

Once configured, synonyms can be used in search/facets/stats components.

It extends the use of thesaurus in GeoNetwork currently only broader/narrower relation in thesaurus is used for hierarchycal facets (https://github.com/geonetwork/core-geonetwork/wiki/201411HierarchicalFacetSupport).

Fine tuned queries

Query syntax could be used to make more flexible searches:

draft

Search and index analysis chain is also better configured and will avoid search errors like when searching on full title.

Highlighting

The Highlighter module provides the capability to highlight matching words in results eg. in abstract.

draft

Sample query: http://localhost:8984/solr/catalog_srv_shard1_replica1/select?q=map&rows=1&wt=json&indent=true&hl=true&hl.fl=resource*&hl.simple.pre=%3Cstrong%3E&hl.simple.post=%3C%2Fstrong%3E
Sample response:

document....
},
  highlighting: {
    501: {
      resourceAbstract: [
        "Use this template to describe a static <strong>map</strong> (eg. PDF or image) or an interactive <strong>map</strong> (eg. WMC)."
      ]
    }
  }

Note: Field MUST be tokenized. eg. does not work with String, should use text_general type.

Facetting

Instead of using the server config-summary.xml which defines a predefined list of facets, Solr allows to create facet on any fields. The client could easily request any facets required. For example, the WFS feature data filter computes automatically facets on all feature attributes. It computes statistics on field for numeric and date type fields and compute facet configuration on-the-fly:

draft

GeoNetwork facet only support term facet returning a list of values with a count of records. More advanced facetting could be done with Solr:

range
interval
heatmap (for geometry)
pivot

draft

Pivot can also be quite flexible using the new Solr facet API allowing multilevel facets. User could for example request:

a first level facet on resource type (eg. feature/dataset/service)
a second level facet on point of contact
a third level on conformity
... and get statistics on each pivot

eg. http://localhost:8984/solr/catalog_srv_shard1_replica1/select?indent=on&q=*:*&wt=json&rows=0&facet=true&json.facet={test:{terms:resourceType}}

eg. http://localhost:8984/solr/catalog_srv_shard1_replica1/select?indent=on&q=*:*&wt=json&rows=0&facet=true&json.facet={level1:{type:terms,field:resourceType,missing:true,facet:{tag:{type:terms,field:tag}}}}

Facet API also provide the capability to request more facet values, paging in facets, ...

Indexing related documents and data

When data is available using WFS (see https://github.com/geonetwork/core-geonetwork/wiki/WFS-Filters-based-on-WFS-indexing-with-SOLR). This work needs to be extended to also index other types of document (eg. PDF). Parser like Apache Tika can be used for this task.

Grouping/Collapsing

Those features could be relevant to grouping results (datasets/serie, features/dataset, ...). Links between document must be added in the index. eg. search can be combined on both metadata and features.

Sample query:
http://localhost:8984/solr/catalog_srv/select?indent=on&q=+docType:feature&wt=json&group=true&group.field=parent&group.limit=4
Get metadata with feature http://localhost:8984/solr/catalog_srv/select?indent=on&q={!join%20from=parent%20to=metadataIdentifier}+%2BdocType:feature&wt=json&fl=resourceTitle
Get metadata with feature about "MEDECO" http://localhost:8984/solr/catalog_srv/select?indent=on&q={!join%20from=parent%20to=metadataIdentifier}+%2BdocType:feature+%2BMEDECO&wt=json&fl=resourceTitle

grouped: {
 parent: {
  matches: 8624,
  groups: [
  {
   groupValue: "89dee307e38c972b333b152d9bd19bb2e9bb0d4d",
   doclist: {
   numFound: 49,
   start: 0,
   docs: [
   {
     id: "states.1",
     docType: "feature"

More work required:

groupValue contains the UUID (how to get label ?)
Issue on group on multivalue field
Block join (https://cwiki.apache.org/confluence/display/solr/Other+Parsers)

Issues:

Does not return info about child docs.

## Spatial searches

Spatial search has been tested for both feature and metadata indexing/searching. Indexing of millions of object was tested. Some limitations were identified and need some more testing (eg. indexing ship track over the world was quite long to index based on the index grid size).

Heatmap feature is also used in feature analysis.

Spatial searches is based on Lucene spatial and does not use GeoTools filter. So far, spatial queries looks to be working fine.

Performances

To be tested.

Misc.

Conclusion

Moving from Lucene to a search engine will bring major benefits by bringing many features implemented in search servers like Solr or Elasticsearch (including better scalability). In both cases, a proxy is placed in front in order to deal with privileges and building responses. Major tasks which will represent most of the workload is:

implementing multilingual support (by using one field per language instead of one index by language as we do now).
rework the Angular client to deal with the new format response
re-implement all search protocols (the POC focused on CSW, but GN also implement OpenSearch, OAIPMH, SRU, Atom, ...)

Technical analysis & configuration

Suggestion

Solr configuration

Sample query http://localhost:8984/solr/catalog_srv_shard1_replica1/spell?q=bosin&spellcheck=true&spellcheck.collateParam.q.op=AND

Spellcheck and suggestion configuration is made in:

solrconfig.xml: define module configuration
schema: define which fields use to build the dictionary (currently, title, tags, abstract)

Response contains a dedicated spellcheck and suggestion section:

<response>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
 <lst name="suggestions">
  <lst name="bosin">
  <int name="numFound">1</int>
  <int name="startOffset">0</int>
  <int name="endOffset">5</int>
  <int name="origFreq">0</int>
  <arr name="suggestion">
   <lst>
    <str name="word">basins</str>
    <int name="freq">1</int>
   </lst>
  </arr>
 </lst>
</lst>
<bool name="correctlySpelled">false</bool>
<lst name="collations">
 <lst name="collation">
  <str name="collationQuery">basins</str>
  <int name="hits">1</int>
  <lst name="misspellingsAndCorrections">
  <str name="bosin">basins</str>

More work required

Configure the dictionary updates (on commit ?)
Add from the admin the capability to rebuild the dictionary
"Also don't forget to build the spellcheck dictionary before you use it:"
URL to trigger dictionnary update http://localhost:8984/solr/catalog_srv_shard1_replica1/select?&suggest=true&suggest.dictionary=mainSuggester&suggest.buildAll=true

Client search application

The simple search application focused on drafting Angular components to easily create interface on top of Solr Search. In that work, we tried to overcome issues made in the first Angular components (eg. difficulties to have more than one search in the same app) and we started the design of components for search (eg. requestHandler, facets, results, paging, ...).

TODO: Add some more details.

Preview of limitations

TODO

Solr migration work

Search

All communications made with Solr is handled by a proxy. The proxy takes care of:

Query / Add user privileges to search filters
Response / Add extra information on metadata document eg. can edit, is selected (formerly geonet:info)
Provide access to search, spellcheck, suggestion, facet.
Provide access to search for any type of document ie. metadata or data. The client should filter what to query.

Search response format is JSON.

Solr is not required to start the application but a warning is displayed in case of error contacting the search engine.

draft

A health check tests if Solr is up & running and report status in the admin console.

Major changes:

Search / Parameters / No default set. Client needs to define all (before, search defaults on isTemplate:n)
Selection / Add q parameter to select the records matching a specific query. Not related to session last search anymore. See SelectionManager

More work required

Multilingual search / Move from one index per language to field in each language in same index
OAI-PMH
Atom service
RSS search
CSV search
Server / Response / Can we have complex JSON object in response instead of only flat structure ?
Client / Can not sort on multivalue field (eg. denominator): Create min and max field in index

CSW

GetDomain / Basic support / RangeValues is not supported
GetRecords
Config / Review mapping to solr field

More work required

Virtual CSW / Needs testing
Testing

Indexing

Indexing is still made in 2 steps:

XSL transformation to extract information from metadata record
Add information from the database.

BTW, atomic update have been implemented in order to update popularity and rating without reindexing the full document for better performance.

Integration tests

More work required:

How to setup/start Solr for running tests ?

Relation

Editor / Update field name in relation panel

Multinode support

Not taken into account during the codesprint. It sounds relevant to have one Solr collection per node and provide one searcher per node. The way bean are accessed could probably be improved in order to better use Spring bean scope.

API Changes

GetPublicMetadataAsRdf : Move from URL params to Solr query eg. /rdf.metadata.public.get?q=...
Log search
Removed: Analyze Solr log instead - all requests made using GET contains parameters.
Quid: Search Solr
Requests and Params tables removed.
Admin console / Dashboard : Removed - Use Solr facets instead and build new dashboard from that.
Search
No support of geom by id geometry:region:kantone:15
CSW
Language is defined by URL only to return DC response (no language detection).
GetRecords / Result_with_summary custom extension is removed
GetDomain / no support for range

Misc

More work required:

Homogenous date time for records in db/index/xml
Index / Add timezone. Value in index is in UTC. https://github.com/geonetwork/core-geonetwork/blob/develop/domain/src/main/java/org/fao/geonet/domain/ISODate.java
Client side: move away from Wro4j to Brunch ?

Cleaning

Merge cleaning PR in Solr branch ? https://github.com/geonetwork/core-geonetwork/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+cleaning to make Lucene removal easier.
Remove Lucene* (ie. deps, index, config)
Field constants - Move all Solr fields in one class in a module that all other modules can access https://github.com/geonetwork/core-geonetwork/blob/develop/core/src/main/java/org/fao/geonet/kernel/search/LuceneIndexField.java and https://github.com/geonetwork/core-geonetwork/blob/develop/core/src/main/java/org/fao/geonet/constants/Geonet.java#L624
Client / Drop all @json and drop this mode in favour of _content_type

If you have some comments, start a discussion, raise an issue or use one of our other communication channels to talk to us.

Welcome to the GeoNetwork project!

Project Steering Committee

How to contribute

GeoNetwork cheat sheet

GeoNetwork website (external)

GN3 public implementations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving to a search server: Why & how ?

Codesprint main targets

Technical overview

Work

Preview of improvements

Spellchecks & suggestions

Find similar document

Boosting

Synonyms

Fine tuned queries

Highlighting

Facetting

Indexing related documents and data

Grouping/Collapsing

Performances

Misc.

Conclusion

Technical analysis & configuration

Suggestion

Solr configuration

More work required

Client search application

Preview of limitations

Solr migration work

Search

CSW

Indexing

Integration tests

Relation

Multinode support

API Changes

Misc

Cleaning

Clone this wiki locally