Skip to content

Commit

Permalink
Auto merge of #3648 - pietroalbini:seek-crates, r=jtgeibel
Browse files Browse the repository at this point in the history
Implement seek-based pagination for filterless sortless crates endpoint

This PR implements seek-based pagination for parts of the `/api/v1/crates` endpoint, and adds the infrastructure to parse `seek=` parameters in paginated APIs. The reason for this PR is that we have multiple crawlers scraping the whole `/api/v1/crates` endpoint every day, and offset-based pagination is not feasible after a high number of pages.

To keep the implementation simple seek-based pagination is only supported for requests matching these constraints:

* **No filters:** when the request is filtered the total results count is not the same as the total number of crates, so the current implementation to gather the number of results (`SELECT COUNT(*) FROM crates;`) does not work. This can be lifted in the future by refactoring how we gather the total number of results.
* **No sorting other than alphanumeric:** this is just not implemented yet, we'd need to change the condition used to seek to the correct position. Note that for anything that's not alphanumeric we'll need to sort by `($param, id)`, not just `$param`, to ensure the results are consistently ordered (there could be multiple crates with the same download count).
* **No search queries:** search queries complicate the SQL query quite a bit, introducing relevance sorting and hardcoding the exact match as the first result. Not worth implementing for the time being IMO.

When the endpoint is called with a request that supports seeking, no `meta.prev_page` will be provided and `meta.next_page` will include the `seek` parameter instead of the `page` parameter. `meta.total` will continue to be provided as usual.

The value of `seek` is a base64-encoded JSON value. The reason base64 is used is to signal to our API consumers the seek value is an implementation detail and they shouldn't manually create one. This is what other big providers do (for example GitHub). The reason to encode the value in JSON is to support more complex seek keys in the future: right now we're just encoding a bare number, and the JSON representation is the same as the ASCII representation.

This PR is fully backward compatible: if clients just rely on `meta.next_page` they will transparently start using seek-based pagination, but offset-based pagination still works. When the `page=` attribute is provided offset-based pagination will be forced, and `meta.next_page` will include `page` instead of `seek`. This way, clients manually creating `?page=` URLs will not break.

One unfortunate aspect of this PR is that we still have to provide `meta.total`, as that's required by the Cargo registries API. Getting the total count requires a full scan over the table, which right now in production is done using an index-only scan in ~20ms.

The PR is best reviewed commit-by-commit.
r? `@jtgeibel`
  • Loading branch information
bors committed May 27, 2021
2 parents d608b13 + 5cfdbca commit f70ae1e
Show file tree
Hide file tree
Showing 7 changed files with 419 additions and 106 deletions.
2 changes: 1 addition & 1 deletion src/controllers/category.rs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ pub fn index(req: &mut dyn RequestExt) -> EndpointResult {
// FIXME: There are 69 categories, 47 top level. This isn't going to
// grow by an OoM. We need a limit for /summary, but we don't need
// to paginate this.
let options = PaginationOptions::new(req)?;
let options = PaginationOptions::builder().gather(req)?;
let offset = options.offset().unwrap_or_default();
let sort = query.get("sort").map_or("alpha", String::as_str);

Expand Down
Loading

0 comments on commit f70ae1e

Please sign in to comment.