Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination performance does not meet requirements #352

Closed
tloubrieu-jpl opened this issue Jun 20, 2023 · 38 comments Β· Fixed by #397
Closed

Pagination performance does not meet requirements #352

tloubrieu-jpl opened this issue Jun 20, 2023 · 38 comments Β· Fixed by #397
Assignees
Labels
B14.0 B14.1 bug Something isn't working open.v1.3.1 s.high High severity

Comments

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Jun 20, 2023

Checked for duplicates

No - I haven't checked

πŸ› Describe the bug

When I do a requests over themembers of one collections, 500 products at a time, we pull on average 2 products per seconds.

See screenshot
image

πŸ•΅οΈ Expected behavior

We have a requirement for a single request to be executed in less that one second. In this example we want to pull 500 products per request (500 is an example here which gives a idea of the size of a request, 100 to a few hundred products).

So I expect the performance to be around 1/100 s per product or less.

πŸ“œ To Reproduce

Jupyter notebook to be provided.

πŸ–₯ Environment Info

No response

πŸ“š Version of Software Used

No response

🩺 Test Data / Additional context

No response

πŸ¦„ Related requirements

πŸ¦„ #16

βš™οΈ Engineering Details

No response

@tloubrieu-jpl tloubrieu-jpl added B14.0 bug Something isn't working s.high High severity labels Jun 20, 2023
@jordanpadams jordanpadams changed the title pagination is slow Pagination performance does not meet requirements Jun 20, 2023
@alexdunnjpl
Copy link
Contributor

Probably resolvable via #353

@tloubrieu-jpl
Copy link
Member Author

Yes you are right @alexdunnjpl , I believe the performance are good for non members requests is good so I will re-evaluate the criticality of the bug to medium.

@tloubrieu-jpl tloubrieu-jpl added s.medium Medium level severity and removed s.high High severity labels Jun 21, 2023
@tloubrieu-jpl
Copy link
Member Author

Actually I put it to high again. The query does not work quickly on other requests as well. For example /products with pagination.

@tloubrieu-jpl tloubrieu-jpl added s.high High severity and removed s.medium Medium level severity labels Jun 21, 2023
@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl do you have an example of a specific request or set of requests (and the relevant host) which are not performant? It's not immediately clear to me from the provided information that the api is the culprit as opposed to, say, the python client.

@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl following up on this (see question in previous comment)

@tloubrieu-jpl
Copy link
Member Author

Hi @alexdunnjpl , sorry I missed to answer that earlier.

The request I was evaluating the performances from is in that notebook https://github.com/NASA-PDS/search-api-notebook/blob/main/notebooks/ovirs/part1/explore-a-collection.ipynb , cell "Request specific properties of all the observational products of the collection".

To run that notebook, you need to clone the repository and follow the instructions in the README to deploy it locally.

@alexdunnjpl
Copy link
Contributor

@tloubrieu-jpl what I'm suggesting is that, given that this is the first instance of this kind of performance problem we've seen (that I know of, at least), this probably isn't an API bug.

So unless there's a curl request to reproduce the behaviour outside of the notebook, I'd suggest closing/suspending this ticket and treating it as a bug in the notebook or one of its dependencies (which Al or I could still look into, for sure)

@al-niessner
Copy link
Contributor

al-niessner commented Aug 3, 2023

@alexdunnjpl

No, this bug has come and gone a couple of times (feels like dozen dozens or bunches). It has been a glob of things like cluster nodes slow to respond, overhead in cluster communications, and more. It was supposed to all be fixed with multi-tenancy or some word like that. The notebook request is a members request which is hard to turn into a direct opensearch curl for measuring opensearch vs registry-api timing. In the past, it was shown that the lag was at opensearch and not at registry-api addition processing. I guess I need to make that first step in the binary search persistent.

Are you working it or do you want me to do it?

@alexdunnjpl
Copy link
Contributor

@al-niessner ahh, gotcha. In that case, seems like a probably fix via #353 to avoid the need to search anything?

Happy to work it or yield it, I think Jordan/Thomas were looking to divvy up tasks between us at breakout today, now that the cognito work is being pushed back as a priority.

@al-niessner
Copy link
Contributor

@alexdunnjpl

Lets take it to the breakout. If the new ancestry field gets rid of the double lookup, then lets run that direction. It will make finding the delays so much simpler if nothing else.

@alexdunnjpl
Copy link
Contributor

Not sure how I mistook this as being related to /members, but it isn't. Still investigating.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Aug 10, 2023

@tloubrieu-jpl @jordanpadams the bug is in the pds python api client.

API logs reveal that the client is paging without regard to the limit value (see qparams limit and start)

2023-08-10T16:15:28.960-07:00 DEBUG 1328923 --- [nio-8080-exec-7] o.s.web.servlet.DispatcherServlet        : GET "/products?fields=orex:Spatial.orex:latitude&fields=orex:Spatial.orex:longitude&fields=ref_lid_instrument&fields=orex:Spatial.orex:target_range&fields=ops:Data_File_Info.ops:file_ref&limit=500&q=%20%28%20%28%20ref_lid_target%20eq%20%22urn%3Anasa%3Apds%3Acontext%3Atarget%3Aasteroid.101955_bennu%22%20%29%20and%20%28%20ref_lid_instrument%20eq%20%22urn%3Anasa%3Apds%3Acontext%3Ainstrument%3Aovirs.orex%22%20%29%20%29&start=28000", parameters={masked}
2023-08-10T16:15:28.960-07:00 DEBUG 1328923 --- [nio-8080-exec-7] s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped to gov.nasa.pds.api.registry.controller.SwaggerJavaTransmuter#productList(List, List, Integer, String, List, Integer)
2023-08-10T16:15:42.862-07:00 DEBUG 1328923 --- [nio-8080-exec-7] o.s.w.s.m.m.a.HttpEntityMethodProcessor  : Using 'application/json', given [application/json] and supported [application/json, */*]
2023-08-10T16:15:42.876-07:00 DEBUG 1328923 --- [nio-8080-exec-7] o.s.w.s.m.m.a.HttpEntityMethodProcessor  : Writing [class PdsProducts {<EOL>    summary: class Summary {<EOL>        q:  ( ( ref_lid_target eq "urn:nasa:pds:con (truncated)...]
2023-08-10T16:15:42.890-07:00 DEBUG 1328923 --- [nio-8080-exec-7] o.s.web.servlet.DispatcherServlet        : Completed 200 OK
2023-08-10T16:15:42.961-07:00 DEBUG 1328923 --- [nio-8080-exec-8] o.s.web.servlet.DispatcherServlet        : GET "/products?fields=orex:Spatial.orex:latitude&fields=orex:Spatial.orex:longitude&fields=ref_lid_instrument&fields=orex:Spatial.orex:target_range&fields=ops:Data_File_Info.ops:file_ref&limit=500&q=%20%28%20%28%20ref_lid_target%20eq%20%22urn%3Anasa%3Apds%3Acontext%3Atarget%3Aasteroid.101955_bennu%22%20%29%20and%20%28%20ref_lid_instrument%20eq%20%22urn%3Anasa%3Apds%3Acontext%3Ainstrument%3Aovirs.orex%22%20%29%20%29&start=28500", parameters={masked}
2023-08-10T16:15:42.962-07:00 DEBUG 1328923 --- [nio-8080-exec-8] s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped to gov.nasa.pds.api.registry.controller.SwaggerJavaTransmuter#productList(List, List, Integer, String, List, Integer)
2023-08-10T16:15:56.673-07:00 DEBUG 1328923 --- [nio-8080-exec-8] o.s.w.s.m.m.a.HttpEntityMethodProcessor  : Using 'application/json', given [application/json] and supported [application/json, */*]
2023-08-10T16:15:56.689-07:00 DEBUG 1328923 --- [nio-8080-exec-8] o.s.w.s.m.m.a.HttpEntityMethodProcessor  : Writing [class PdsProducts {<EOL>    summary: class Summary {<EOL>        q:  ( ( ref_lid_target eq "urn:nasa:pds:con (truncated)...]
2023-08-10T16:15:56.701-07:00 DEBUG 1328923 --- [nio-8080-exec-8] o.s.web.servlet.DispatcherServlet        : Completed 200 OK
2023-08-10T16:15:56.768-07:00 DEBUG 1328923 --- [nio-8080-exec-9] o.s.web.servlet.DispatcherServlet        : GET "/products?fields=orex:Spatial.orex:latitude&fields=orex:Spatial.orex:longitude&fields=ref_lid_instrument&fields=orex:Spatial.orex:target_range&fields=ops:Data_File_Info.ops:file_ref&limit=500&q=%20%28%20%28%20ref_lid_target%20eq%20%22urn%3Anasa%3Apds%3Acontext%3Atarget%3Aasteroid.101955_bennu%22%20%29%20and%20%28%20ref_lid_instrument%20eq%20%22urn%3Anasa%3Apds%3Acontext%3Ainstrument%3Aovirs.orex%22%20%29%20%29&start=29000", parameters={masked}
2023-08-10T16:15:56.769-07:00 DEBUG 1328923 --- [nio-8080-exec-9] s.w.s.m.m.a.RequestMappingHandlerMapping : Mapped to gov.nasa.pds.api.registry.controller.SwaggerJavaTransmuter#productList(List, List, Integer, String, List, Integer)
2023-08-10T16:16:11.571-07:00 DEBUG 1328923 --- [nio-8080-exec-9] o.s.w.s.m.m.a.HttpEntityMethodProcessor  : Using 'application/json', given [application/json] and supported [application/json, */*]
2023-08-10T16:16:11.586-07:00 DEBUG 1328923 --- [nio-8080-exec-9] o.s.w.s.m.m.a.HttpEntityMethodProcessor  : Writing [class PdsProducts {<EOL>    summary: class Summary {<EOL>        q:  ( ( ref_lid_target eq "urn:nasa:pds:con (truncated)...]
2023-08-10T16:16:11.599-07:00 DEBUG 1328923 --- [nio-8080-exec-9] o.s.web.servlet.DispatcherServlet        : Completed 200 OK

Who's the most familiar with the python client codebase? @tloubrieu-jpl or @jjacob7734 perhaps? If either of y'all can think of where in the client package code the page iteration occurs, that'd save me some time/effort, otherwise I'll just look for it myself. If you could point me to the repo that'd be helpful, at least - pds-api-client doesn't seem to be it.

@alexdunnjpl
Copy link
Contributor

I've just tested at 5000doc/page, and found the following:

  • Query time of 1300ms irrespective of field subsetting
  • Download time of 7600ms for full documents
  • Download time of 600ms for heavily-subset documents

So for full docs, that's 27ms/doc at 50doc/page, 4.3ms/doc at 500doc/page, 1.8ms/doc at 5000doc/page

@tloubrieu-jpl
Copy link
Member Author

Thanks @alexdunnjpl , regarding the <1s requirement we will need to refine which request supports that. The motivation behind the requirement is to have very responsive web pages when the API is used as a back-end for the web search. It might even mean that we need even faster responses for a subset of requests to provide a fluent navigation.

But for notebook search longer response can be acceptable.

@alexdunnjpl
Copy link
Contributor

Deeper investigation yields the following, given 5000doc page size:

  • client.search() increases linearly with page "index" at about 3000ms per page
  • getHits() is a constant 300ms +/-50ms, as expected
  • API adds a constant 1500ms +/-500ms

Implementation of search_after requires the following:

  • implementation of _id sort*
  • implementation of search_after qparam support (this may exist as-is, but need to confirm)
  • calling request.source().searchAfter(new Object[] {"_id"}) in RequestAndResponseContext

@tloubrieu-jpl @jordanpadams on the user side, they would need to supply a search_after qparam equal to the _id (i.e. lidvid) of the last element returned in the previous query. i.e. the python client would need/want to abstract this away from the user, and any raw requests not using this functionality would fail to benefit from the update.

The way search_after works is that the db applies a sort to the full dataset, and then uses the requested search_after value to quickly access the desired page. Presumably the query is also cached for some period of time, too.

  • this conflicts with the currently-not-implemented(?) requirement to allow the user to provide sorting instructions via the API. Sorting by _id in all cases is simplest, but will need to be dealt with in a more complex manner if user-specified sorting is ever implemented.

Thoughts on all that @jordanpadams @tloubrieu-jpl ?

@tloubrieu-jpl
Copy link
Member Author

Hi @alexdunnjpl ,

I started a spreadsheet to see how it would be best to implement the pagination in the API , regarding what is intuitive for our users, what is most easy/performance to implement. See https://docs.google.com/spreadsheets/d/1NWE6xWT_OOGb6VIPclbx4r7JtyTy5cWJazqkVi1TjpM/edit?usp=sharing

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Aug 28, 2023

Regarding PIT:

https://opensearch.org/docs/latest/search-plugins/point-in-time/

PIT does not require sorting because (I assume) the opensearch-internal row ID functions as a de-facto sort in that use-case.


Quick resource note
https://stackoverflow.com/questions/68127892/how-does-search-after-work-in-elastic-search

@tloubrieu-jpl
Copy link
Member Author

tloubrieu-jpl commented Aug 28, 2023

After discussing with @alexdunnjpl we decided to implement the pagination with url parameters:

  • search_after
  • page_size

To Be checked if we user _ or - in the name.

It is alike what CMR proposes (See https://cmr.earthdata.nasa.gov/search/site/docs/search/api.html#search-after)

This is also the option which is preferred by OpenSearch (see https://opensearch.org/docs/latest/search-plugins/searching-data/paginate/#scroll-search), especially with the Point In Time (coming later).

For the implementation we will always sort the results by lidvid (see https://stackoverflow.com/a/68535811/22141093), as the lowest sorting property (for when we implement user sort).

@alexdunnjpl
Copy link
Contributor

Alex next steps:

  • Benchmark effect of sorting by lidvid vs unsorted
  • If necessary, benchmark search-after paging in python or Postman

@jordanpadams
Copy link
Member

Status: @alexdunnjpl in work, probably not going to make the sprint

@alexdunnjpl
Copy link
Contributor

Status: basic functionality almost complete.
Outstanding:

  • incorporation into Pagination/PaginationLidvidBuilder

Outstanding design questions:

  • Should sort qparam be removed? (Yes, as-is, as sort is a mandatory side-effect of searchAfter)
  • Should exposed interface be ?search-after=key1=val1,key2=val2 or ?sort=key1,key2&search-after=val1,val2? The former may be clearer to users, but at the same time the latter is more representative of what's being done, allows specification of multiple sort without search-after and may allow for cleaner implementation of asc/desc specification.
  • First-page behaviour is kinda terrible. It's necessary to specify sort so that the correct first page is populated. Currently, this looks like ?search-after=somekey=. Need to check whether quotes are used in Spring to wrap null strings... but either way the whole thing is relatively fiddly and users will need to read a documentation snippet to have any hope of accessing multi-page data.

@tloubrieu-jpl
Copy link
Member Author

tloubrieu-jpl commented Nov 7, 2023

We prefer the option: 1?sort=key1,key2&search-after=val1,val21

In the documentation the advised sorting parameter should be harvest_time.

More research is needed to see:

  • @tloubrieu-jpl @alexdunnjpl how to handle default sort by looking at other implementations of search-after (e.g. CRM...)
  • @alexdunnjpl will look at the relevancy (score) sort in opensearch to confirm it works with search-after
  • we need to consider to add a next-key attribute in the summary of the response for the user to use to get their next page (see alternate implementations)

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 9, 2023

Initial benchmarks against prod using local API with page size 5000 suggest that custom sort does not slow down query time in any perceptible manner (strangely, but we can look at that later).

Without sort, 32sec (84MB transferred) With harvest-time sort, 27sec (64MB transferred), with havest-time sort and minimal properties returned (fields=ops%3AHarvest_Info%2Fops%3Aharvest_date_time), (13MB transferred). N.B. This conflicts with previous benchmarks, which were much faster

Without culling properties, that's 5-7ms/product, and ~9hrs to page 5M products.

Pages of 100 run ~750ms (7.5ms/product), which actually isn't that much less efficient than paging large, which is a good thing.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 9, 2023

To-do:

  • integrate Pagination/PaginationLidvidBuilder
  • re-run test from jupyter notebook to confirm expected speed-up and lack of response time linearly-increasing with page index

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 9, 2023

Performance has markedly increased as a result of OpenSearch resource improvements.

New start/limit benchmark using specific query, full-span (~275k records), page_size=50. 50-record rolling average applied

page response time as function of page index with 1000ms low-pass filter

image

with 500ms low-pass filter:

image

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 16, 2023

New benchmark completed, 50doc pages

start-limit: average 140ms/pg, linearly increasing to avg 220ms/pg.
search-after: average 116ms/pg, constant, so 190% throughput compared to start-limit (with the given query, at least)

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Nov 16, 2023

While attempting to confirm correctness, additional issues have arisen.

Attempting to page over all 274749 hits with start-limit only appears to return 197636 unique doc ids.

Search-after behaves as expected. @jordanpadams @tloubrieu-jpl do you want me to investigate what's going on with the start-limit pagination, or is it completely moot given that we're moving to search-after pagination?

@tloubrieu-jpl
Copy link
Member Author

We expected more improvement on the initial use case also thanks to the usage of the new ancestry fields in the api, see #353

@alexdunnjpl
Copy link
Contributor

Tightly coupled with #353, transitioning to that ticket for completion.

@tloubrieu-jpl
Copy link
Member Author

@gxtchen , you can use the jupyter notebook in this repository https://github.com/NASA-PDS/search-api-notebook
using the ovirs/part1/explore_a_collection notebook, cell Request specific properties of all the observational products of the collection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B14.0 B14.1 bug Something isn't working open.v1.3.1 s.high High severity
Projects
Status: 🏁 Done
Development

Successfully merging a pull request may close this issue.

4 participants