Collection caching #114

joepio · 2021-03-17T11:45:55Z

Collections are currently dynamic resources, which means that they are fully calculated when a user sends a request. That works fine, but it comes at a performance cost, since the DB must be queried.

How to cache this? How does this interact with the get_extended_resource function? How to invalidate the cache? Let's discuss some considerations.

Collections can be sorted and filtered by adding query params. These of course change the dynamic properties such as members and total_count. These should be cached seperately.

Since all changes should be done using Commits, we can perform cache invalidations while handling Commits. How does the Commit handler know which resources should be invalidated? For example, let's say I remove the firstName property with the john value from some person Resource. The person first appeared in the collections of people named john, but this collection should now be invalidated.

Invalidation approaches

Invalidate when any attribute of a resource chages

When a Collection iterates over its members, it adds the subject of the collection (including query params) to a K/V incomingLinks store where each K is a subject, and each V stands for an array of subjects that link to it. When a commit is applied to resource X, it takes subject X and opens the incomingLinks instance of that X. It then proceeds to invalidate all the V items. This will invalidate many collections that could very well result in exactly the same members, when re-run.

Use TPF index / cache

#14

If we build an index for all values, most of the expensive part is solve. It just leaves sorting - which still is expensive.

The text was updated successfully, but these errors were encountered:

joepio · 2021-08-10T14:41:57Z

Now that a Value Index #14 has been implemented, this issue has become a lot less relevant. Only when collections are very large, it might make sense to cache things like filtering and sorting.

joepio · 2021-11-13T23:56:25Z

I'm noticing that on my laptop, some collections (such as commits) are becoming a bit slower (200 ms for 1400 entries). Makes sense, as all items are iterated and sorted on the fly.

That should happen at index time, not at runtime.

But how do we cache this, exactly?

Store cached versions of the Collection resource

Seems easy to implement, but...

How do you know when these should be invalidated? How does a commit handler know which collections to update? Should we have a list of all the existing collections, and for each commit check which collections should be invalidated? What if a resource is modified, should we also check both the new version as well as the old one?
How do we deal with authorization? Do we cache only public collections? Should we check every single Resource every time, each time we want to build a collection? Do we cache the entire list of Collections? Should we introduce a new Class that requires members to be children, so we don't have to do authorization checks for every resource?
How do we deal with sorting the members? Should we keep track of how every single Collections has been sorted, and keep a sorted list of Subjects for each?

Have a special index for collections

When an atom is added (or removed!) to the store, run all known collections on that atom. Every collection that matches, should be invalidated.

Current implementation idea

Add a new sled::Tree called collections_cache (a KV store) that, for each time a collection is queried, stores an item.
The K is the subject of the collection, including query parameters.
This V (CollectionCached) stores 1) details about the query (TPF combo + sorting) and 2) a Vec of subjects and 3) an invalidated boolean.
When a Commit is applied, the CommitMonitor iterates over every single item in this tree. It performs the TPF query from that CollectionCached, and it checks every single Atom of the new and old resource. If there is a hit, the invalidated is set to true.

Limitations

The most problematic collection, /commits, will be invalidated on every commit. This means that it does not benefit from all of this logic.
We could move some collection-calculation logic from query-time to index-time. We could store both the subject and the sorted value, which means that we could sort this list without having to fetch every single Resource and check that value again.

joepio · 2022-01-15T16:51:18Z

I've learned a bit more about how to optimise using sled, so let's talk about a different implementation idea:

Add a new sled::Tree called collections_cache (a KV store) that, for each time a collection is queried, stores an item.
The K in collections_cache is the collection URL (including query params!) + the Value of the value being sorted. This means we always sort by value, and have no indexing cost for sorting. Perhaps we should limit this value to a relatively small amount of characters. If a Collection is not sorted, it is sorted by subject. In that case, use the subject as the sortable value.
The V should be a Vec<String> of all the subjects in the collections that share this sorted value.
We use scan_prefix(collection_url) to construct the list of subjects. We can start iterating at the end or at the start, depending on the request of the user.
Since we use scan in this manner, we could allow users to pass a start_at parameter, which is a value from which the lexicographic sorting starts. This can be useful for example in constructing a list of most recent items, or render an infinite scroll list.
Add a second new sled::Tree called collections_watching that lists all the currently used Collections. The K are the subjects + query params, the V contains the property and value filters. These are required for testing the resources.
When a Commit is applied, the CommitMonitor iterates over all these watched collections - both for the old atoms (removed) and the new ones (added). For each Atom, it will check if it matches with any of these collections. If it does, it remove or adds the items from collections_cache that match.

* #114 WIP collection cache * #114 WIP collection cache working but slow * #114 try different approach * WIP tests passing, but sorting not working * WIP * Sorting one way works... * Fix sorting * mostly working * Move db tests to file * Move some utility functions * Cleanup * authorization tests * Add authorization tests, get them green * Cache invalidation test passing * Add test for delting, fix temp path gitignore * Refactor commit opts * Fix query index * Change TPF, fix test * Tests passing * Improve sorting * Bump to v0.31.0

joepio added lib rust atomic-lib (rust) performance Speed improvements labels Mar 19, 2021

joepio added a commit that referenced this issue Jan 15, 2022

#114 WIP collection cache

c93c84f

joepio mentioned this issue Jan 15, 2022

#114 WIP collection cache #285

Merged

10 tasks

joepio added a commit that referenced this issue Jan 16, 2022

#114 WIP collection cache working but slow

e0beb90

joepio added a commit that referenced this issue Jan 16, 2022

#114 try different approach

f60f810

This was referenced Jan 18, 2022

Queries can leak hidden information through count #286

Open

GEO / GIS queries #278

Open

Sort floats and integers using big-endian #287

Open

joepio closed this as completed Jan 25, 2022

joepio added this to the v0.31 querying, caching, performance milestone Jan 26, 2022

joepio mentioned this issue Mar 23, 2022

Sorting doesn't work consistently across datatypes #142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection caching #114

Collection caching #114

joepio commented Mar 17, 2021 •

edited

Loading

joepio commented Aug 10, 2021

joepio commented Nov 13, 2021 •

edited

Loading

joepio commented Jan 15, 2022 •

edited

Loading

Collection caching #114

Collection caching #114

Comments

joepio commented Mar 17, 2021 • edited Loading

Invalidation approaches

Invalidate when any attribute of a resource chages

Use TPF index / cache

joepio commented Aug 10, 2021

joepio commented Nov 13, 2021 • edited Loading

Store cached versions of the Collection resource

Have a special index for collections

Current implementation idea

Limitations

joepio commented Jan 15, 2022 • edited Loading

joepio commented Mar 17, 2021 •

edited

Loading

joepio commented Nov 13, 2021 •

edited

Loading

joepio commented Jan 15, 2022 •

edited

Loading