-
-
Notifications
You must be signed in to change notification settings - Fork 489
Elasticsearch migration July 2019 codesprint
Address:
- Monday: Olivier, Florent in camptocamp / Jose, Francois at the farm
- Tuesday/Wednesday: at the farm, 321, Route de la Mollière, Saint-Pierre-de-Genebroz, Chambéry, Savoie, Auvergne-Rhône-Alpes, France
- Thursday: All in camptocamp
- Friday: Olivier, Florent in camptocamp / Jose, Francois at the farm
- Florent
- Olivier
- Jose
- Francois
- Pierre ?
- Michel ?
- EEA
PR: https://github.com/geonetwork/core-geonetwork/pull/2830
- UI & search (Florent & Olivier)
- Facet / Improve CSS
- Facet / Support tree
- Facet / Support negative switch
- Facet / More values https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-size
- Facet / Config UI
- Search / Return in _source only the field required by the UI (performance)
- Search / Map / Restore bbox
- Related records
- Score https://github.com/geonetwork/core-geonetwork/commit/ff184e9402dfb181266074e89fe9f18dc6229ac9#diff-e5f71169531dd443a7dead183fbc8e52R42
- Config facet type https://github.com/geonetwork/core-geonetwork/blob/es/web-ui/src/main/resources/catalog/components/elasticsearch/EsService.js#L163
- Date histogram
- Indexing & search
- Synonyms
- Core (Jose & Francois)
- CSW (Jose)
- Paging,
- Select all,
- MEF2 export,
- PDF export,
- CSV export,
- Update field in index (instead of reindex all)
- Index related
- ... onlyMyRecord, getTitleFromIndex eg. buildMetadataStatusResponses, retrieveMetadataIndexField
- GN Harvester
- Tests https://www.elastic.co/guide/en/elasticsearch/reference/current/integration-tests.html
- Install
- Docker setup (Pierre?)
During this sprint, we explored the benefits for GeoNetwork to use Elasticsearch as search engine. Moving to Elasticsearch will help in two ways:
- Better and more efficient/flexible searches
- Scalability.
We focused on 3 axes:
- User interface with focus on EEA requirements eg. GEMET tree hierarchy classification, negative queries like not obsolete, better suggestions
- CSW which is one of the main search services we need in order to make GeoNetwork 4 usable in production context
- Wire all the application ie. make all the client application work again
The next section describes some of the significant search experience improvements that Elasticsearch can bring to the end user.
Facet can now be configured on the client application from the admin page (using JSON based on Elasticsearch API). User can configure the (ordered) list of fields, sorting, size, ...
This requires knowledge on the fields to be used but makes the list of facets easier to configure and customize.
GeoNetwork 3 already provides basic support of hierarchy of facet when using keywords from a hierarchical thesaurus like GEMET. This functionality was implemented with Elasticsearch but provides much more flexibility.
Hierarchy of facet is now supported using 2 approaches:
- sub aggregation concept in Elasticsearch which allows to have nested aggregations eg. below on resource type > format
- path hierarchy using a separator eg. below on GEMET thesaurus
The path hierarchy mode can also be applied at indexing time to non thesaurus elements in order to build a path for classification eg. {resourceType}/{format|serviceType}
Multiple facet choices can now be selected to make an OR query:
Selected choices are highlighted in green.
Then user can also negate a query eg. not a service
Choices are in red. Note that the permalink is not able to restore a negative query - this is something to improve.
Elasticsearch API also provides paging in facet values. With this, user can navigate in all values:
Different types of facets are provided in Elasticsearch API. We are mainly using TERM facets in GeoNetwork. But other types can be useful for better analysis (those also used in Kibana). For example, histogram aggregation support allows to build small charts for selecting date range:
GeoNetwork 3 provides basic autocompletion with some bugs / limitations (eg. does not work on phrase suggestions, can show tokens of private records). Various Elasticsearch queries can be used to configure autocompletion where you define:
- Which fields to search on / and how
- Which fields to suggest
After different tests, we opted for a multi_match query on anytext returning the title
- Supports partial word match
- Support for phrase query
Autocompletion query can also be configured from the admin console:
By default, a multi_match on anytext + its ngram associated fields is configured in order to propose record titles based on analysis of partial word match.
By default Lucene and Elasticsearch provides a default scoring mechanism based on term frequency. The scoring can be customized by custom functions. Some testing was done and for now are hardcoded (see https://github.com/geonetwork/core-geonetwork/commit/ff184e9402dfb181266074e89fe9f18dc6229ac9#diff-e5f71169531dd443a7dead183fbc8e52R42) eg.
- Promote grid instead of vector (a bit dummy example)
- Score down old records (eg. older than 200days)
- Promote records with good rating!
Based on similarity algorithm, Elasticsearch can provide similar records to the one user is currently viewing. The similarity can be built on specific fields. Similar records are presented in the record view and allows to navigate easily between records of same topics:
We have now a working search interface with a full text field for search and much more advanced flexibility in the way we manage facets. Advanced form is for now completely removed. Search service is also decorated with privileges security filter and portal filter.
Some bugs still need to be fixed but we have good minimum of functionality already covered and usable.
We also managed to improve performances of the search by itself:
- Only fields required to the client application are returned and this can be configured on a per search basis. Eg. only the title is returned in the response if only a list of record title is needed. GeoNetwork 3 was always returning the same search response.
- Elasticsearch service is faster than the Lucene one we have
We can still do more on the service looking for related records which is slowing down the user interface in various places.
CSW service is now operational and provides better support of OR/AND and nested combination of filters that we use to support in GeoNetwork3. Spatial operators are also working.
The client application functionalities have been restored and is now usable for a good part of the 3 modules: search/edit/admin.
Among others, we restored:
- Selection mechanism
- ZIP export
- PDF export
- CSV export
- Editing is operational with linking records together.
We investigated the possibility to only update one field in the index in order to improve performances. Currently GeoNetwork 3 index the full document when the rating changes for example. With more recent version of Lucene, only the rating field of a document in the index can be updated. This needs more work and could be applied to different cases eg. rating, privileges, popularity, status, category changes.
This codesprint was the opportunity to make significant improvements on the search application and CSW service.
The next phase is probably to:
- Make all the core functionalities of the application works
- Editing
- Subtemplates
- Multilingual support
- Harvesting
- Testing
- Restore the unit test build
- Integration test with a running Elasticsearch instance
- Packaging
- Docker setup
- Installer build
- Documentation
A demo server is available at:
- https://apps.titellus.net/geonetwork/srv/eng/catalog.search
- Login admin/admin
Also, not really related to this task, but we have been discussing improving performances of the client applications and couple of ideas would need some support/funding/attention:
- Bootstrap the map application only once requested (and not on startup)
- Decrease number of watchers This could make the Angular application faster.
- Index a field with a separator eg "Hydrologie/Salinité" (could be resourceType/(serviceType|spatialRepType)/(format)
- Define field in index
"settings": {
"analysis": {
"analyzer": {
"pathAnalyzer": {
"tokenizer": "pathTokenizer"
}
},
"tokenizer": {
"pathTokenizer": {
"type": "path_hierarchy",
"delimiter": "/",
"replacement": "/",
"skip": 0,
"reverse": false
}
}
}
},
...
"mappings": {
"dynamic_templates": [
{
"stringPathType": {
"match": "ft_*_s_tree",
"mapping": {
"type": "keyword",
"fielddata": true,
"analyzer": "pathAnalyzer",
"search_analyzer": "keyword"
}
}
},
- https://github.com/geonetwork/core-geonetwork/blob/master/core/src/main/java/org/fao/geonet/kernel/search/classifier/AbstractTerm.java#L73-L79
- https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html
- No native aggregation on this - check what has been done in Sextant, https://stackoverflow.com/questions/52940790/elasticsearch-aggregation-with-hierarchical-category-subcategory-limit-the-lev
Eg.
If you have some comments, start a discussion, raise an issue or use one of our other communication channels to talk to us.