-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis of bottlenecks and performance tweaks #287
Comments
One thing that we might want to look into again is #229, which caused a substantial increase of the average query response time since the deployment of VLO 4.6.0 (28 February 2019) |
There might be a lot of potential in improving Solr caching settings. In production, we could quite easily allocate 5-15GB of RAM for caching if deemed useful. Solr config settings that we could look at:
Documentation: Solr 8.3: Query Settings in SolrConfig A post with some useful hints: https://teaspoon-consulting.com/articles/solr-cache-tuning.html |
In case this is informative, here is a snapshot from the various cache metrics taken from the Solr dashboard at 2020-03-17 11:33 CET For interpretation, see Performance Statistics Reference and this blog post. Hit ratio seems a good performance indicator. |
I tried to replicate the user behaviour based on a week of Solr requests on the production machine, using an Apache JMeter test plan with 2 threads and varying Solr configurations (used cache implementation, cache sizes, eviction strategy). Focusing on the hitratios and number of cache inserts/evictions it was no problem to replicate the good results of the query result cache and the filter cache. However, I couldn't replicate the very low hitratio of the production's document cache (where 96% of lookups are misses), even though our Solr instance just uses an LRU based eviction approach. The frequency distribution of queried documents shows the expected long-tail of documents that are fetched only rarely (35% of all documents 2 times at most during this week) but not an extreme form of a power law distribution. Assuming the document cache holding the most popular documents, its current size would only account for 35% of all document queries, a cache size of 4096 would increase this value to 56%. Given the rather heterogenous document queries on the VLO, there is probably not a lot to do beyond this (and these queries seem to be fast anyway). The moderate hitratio of the query result cache is a bit unclear. As the vast majority of VLO page views are the main page (with or without user selection), the four (by far) most frequent Solr queries are:
It might be possible to avoid the last query (being redundant considering the third), however, their high frequencies should make them always be part of the 512 stored query results in our standard configuration. An increase of this cache's size (also 4096?) might still have positive results and reduced the number of cache evictions in the tests. A closer look on the query times shows that around 90% of all distinct Solr queries (for the most relevant /select request handler) return after 200ms or less in average. As might be expected, their is a clear separation between fast and slow queries based on use of faceting (average qtime with/without faceting: 1545ms vs. 51ms). In fact, 86% of the 1000 slowest distinct queries involve a restriction on the languageCode facet. As there are hardly any faceted queries with a qtime of less than 1000 (~3%), it seems that most of these queries are cache misses. However, when evaluating Solr's cache statistics for every page view, it becomes clear that all 4 queries are counted as cache hits on the query result cache (except for the first page view on a cold cache). Therefore, it seems that the sub-optimal qtimes are already cache-based values. tbc. |
Thanks @teckart for this thorough report. Some thoughts/comments that came up while reading:
Could this low hit ratio be an artefact of the import process? Import will make individual queries for all documents. Or have you included this in your simulations as well?
By this do you mean with/without production of the facet result, or with/without a facet selection? The former makes sense but the latter would surprise me. Either way, I have to say that this different is very large!
Strange, does this mean that facets are not cached? In any case, if facets are such a big factor when it comes to performance, I would like to see the effect of setting the |
You are right about that: after several import rounds I ended up with similar hitratios for the document cache.
All queries that I talked about, are counted as hits on the queryResultCache. Even those with absurd high query times. After a myriad of test runs and configuration changes, the following minimal example using the requestHandler All: Collapsed: This fits to my previous comment so far as the main page without selection only uses a combination of |
Hmm, so the bottleneck really is in the interaction between faceting and collapsing. Have we looked at the notes on performance in the following pages and/or @teckart do you know if they can somehow be applied to our solution?
To be honest, I'm not sure if we are already (implicitly) using the |
The only remark about performance for the collapsing query parser concerns the
I also had some doubts at first and there is not a lot of documentation about it. As far as I understand, its a plugin that is used automatically when you use queries with the |
For completeness: this post refers to an inofficial grouping implementation before Solr got the feature with version 3.x ( |
Identify the (major) bottlenecks in the processes involved in serving the VLO to the end user. This includes both the web app (Wicket application running in Tomcat) and the Solr back end. Certain actions seem to be slower than they need to be, in particular in relation to filtering based on facet values.
Potentially, optimisations could be made with respect to:
...?
The text was updated successfully, but these errors were encountered: