Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection caching #114

Closed
joepio opened this issue Mar 17, 2021 · 3 comments
Closed

Collection caching #114

joepio opened this issue Mar 17, 2021 · 3 comments
Labels
lib rust atomic-lib (rust) performance Speed improvements

Comments

@joepio
Copy link
Member

joepio commented Mar 17, 2021

Collections are currently dynamic resources, which means that they are fully calculated when a user sends a request. That works fine, but it comes at a performance cost, since the DB must be queried.

How to cache this? How does this interact with the get_extended_resource function? How to invalidate the cache? Let's discuss some considerations.

Collections can be sorted and filtered by adding query params. These of course change the dynamic properties such as members and total_count. These should be cached seperately.

Since all changes should be done using Commits, we can perform cache invalidations while handling Commits. How does the Commit handler know which resources should be invalidated? For example, let's say I remove the firstName property with the john value from some person Resource. The person first appeared in the collections of people named john, but this collection should now be invalidated.

Invalidation approaches

Invalidate when any attribute of a resource chages

When a Collection iterates over its members, it adds the subject of the collection (including query params) to a K/V incomingLinks store where each K is a subject, and each V stands for an array of subjects that link to it. When a commit is applied to resource X, it takes subject X and opens the incomingLinks instance of that X. It then proceeds to invalidate all the V items. This will invalidate many collections that could very well result in exactly the same members, when re-run.

Use TPF index / cache

#14

If we build an index for all values, most of the expensive part is solve. It just leaves sorting - which still is expensive.

@joepio joepio added lib rust atomic-lib (rust) performance Speed improvements labels Mar 19, 2021
@joepio
Copy link
Member Author

joepio commented Aug 10, 2021

Now that a Value Index #14 has been implemented, this issue has become a lot less relevant. Only when collections are very large, it might make sense to cache things like filtering and sorting.

@joepio
Copy link
Member Author

joepio commented Nov 13, 2021

I'm noticing that on my laptop, some collections (such as commits) are becoming a bit slower (200 ms for 1400 entries). Makes sense, as all items are iterated and sorted on the fly.

That should happen at index time, not at runtime.

But how do we cache this, exactly?

Store cached versions of the Collection resource

Seems easy to implement, but...

  • How do you know when these should be invalidated? How does a commit handler know which collections to update? Should we have a list of all the existing collections, and for each commit check which collections should be invalidated? What if a resource is modified, should we also check both the new version as well as the old one?
  • How do we deal with authorization? Do we cache only public collections? Should we check every single Resource every time, each time we want to build a collection? Do we cache the entire list of Collections? Should we introduce a new Class that requires members to be children, so we don't have to do authorization checks for every resource?
  • How do we deal with sorting the members? Should we keep track of how every single Collections has been sorted, and keep a sorted list of Subjects for each?

Have a special index for collections

  • When an atom is added (or removed!) to the store, run all known collections on that atom. Every collection that matches, should be invalidated.

Current implementation idea

  • Add a new sled::Tree called collections_cache (a KV store) that, for each time a collection is queried, stores an item.
  • The K is the subject of the collection, including query parameters.
  • This V (CollectionCached) stores 1) details about the query (TPF combo + sorting) and 2) a Vec of subjects and 3) an invalidated boolean.
  • When a Commit is applied, the CommitMonitor iterates over every single item in this tree. It performs the TPF query from that CollectionCached, and it checks every single Atom of the new and old resource. If there is a hit, the invalidated is set to true.

Limitations

  • The most problematic collection, /commits, will be invalidated on every commit. This means that it does not benefit from all of this logic.
  • We could move some collection-calculation logic from query-time to index-time. We could store both the subject and the sorted value, which means that we could sort this list without having to fetch every single Resource and check that value again.

@joepio
Copy link
Member Author

joepio commented Jan 15, 2022

I've learned a bit more about how to optimise using sled, so let's talk about a different implementation idea:

  • Add a new sled::Tree called collections_cache (a KV store) that, for each time a collection is queried, stores an item.
  • The K in collections_cache is the collection URL (including query params!) + the Value of the value being sorted. This means we always sort by value, and have no indexing cost for sorting. Perhaps we should limit this value to a relatively small amount of characters. If a Collection is not sorted, it is sorted by subject. In that case, use the subject as the sortable value.
  • The V should be a Vec<String> of all the subjects in the collections that share this sorted value.
  • We use scan_prefix(collection_url) to construct the list of subjects. We can start iterating at the end or at the start, depending on the request of the user.
  • Since we use scan in this manner, we could allow users to pass a start_at parameter, which is a value from which the lexicographic sorting starts. This can be useful for example in constructing a list of most recent items, or render an infinite scroll list.
  • Add a second new sled::Tree called collections_watching that lists all the currently used Collections. The K are the subjects + query params, the V contains the property and value filters. These are required for testing the resources.
  • When a Commit is applied, the CommitMonitor iterates over all these watched collections - both for the old atoms (removed) and the new ones (added). For each Atom, it will check if it matches with any of these collections. If it does, it remove or adds the items from collections_cache that match.

joepio added a commit that referenced this issue Jan 15, 2022
@joepio joepio mentioned this issue Jan 15, 2022
10 tasks
joepio added a commit that referenced this issue Jan 16, 2022
joepio added a commit that referenced this issue Jan 25, 2022
* #114 WIP collection cache

* #114 WIP collection cache working but slow

* #114 try different approach

* WIP tests passing, but sorting not working

* WIP

* Sorting one way works...

* Fix sorting

* mostly working

* Move db tests to file

* Move some utility functions

* Cleanup

* authorization tests

* Add authorization tests, get them green

* Cache invalidation test passing

* Add test for delting, fix temp path gitignore

* Refactor commit opts

* Fix query index

* Change TPF, fix test

* Tests passing

* Improve sorting

* Bump to v0.31.0
@joepio joepio closed this as completed Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lib rust atomic-lib (rust) performance Speed improvements
Projects
None yet
Development

No branches or pull requests

1 participant