Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ds.MapDatastore with proper in memory datastore #40

Closed
jsimnz opened this issue Nov 15, 2021 · 12 comments · Fixed by #56
Closed

Replace ds.MapDatastore with proper in memory datastore #40

jsimnz opened this issue Nov 15, 2021 · 12 comments · Fixed by #56
Assignees
Labels
area/datastore Related to the datastore / storage engine system feature New feature or request perf Performance issue or suggestion

Comments

@jsimnz
Copy link
Member

jsimnz commented Nov 15, 2021

Currently, we use the github.com/ipfs/go-datastore implementation MapDatastore which is basically just a map[string]interface{} wrapped in the Datastore interface as our default "in-memory" datastore.

The main problem with this is a lack of in-order iterator and prefix queries. Go's maps don't have a predefined order, and using a range over the key-value pairs has an undefined order.

Our query system is designed on the assumption that our underlying storage engine supports efficient in-order range/prefix scans over a collection of key-pairs.

A few options are available here:

@jsimnz jsimnz added area/datastore Related to the datastore / storage engine system feature New feature or request perf Performance issue or suggestion labels Nov 15, 2021
@jsimnz
Copy link
Member Author

jsimnz commented Nov 16, 2021

Blocked by #44

@AndrewSisley
Copy link
Contributor

I have a few questions on this:

  1. How much time do we want to spend on this (including longer-term)?
  2. What kind of data-volumes should we target? i.e. a few KB, or a few TB?
  3. How generic do we want to keep this? We could gain quite a bit by taking advantage of the fact that we know the structure of our data.
  4. How important is the sorting of the key components (e.g. dockey or cids - not the full ds.Key), and where exactly do we use this? - I can't really think of a place off the top of my head, and we can make some serious gains if we only need to care about determinism (again within the context of a key-component).
  5. Do you know if there is a significant imbalance in the performance constraints? Should we target memory efficiency over CPU time, or just keep things balanced/sensible on both sides?

@AndrewSisley
Copy link
Contributor

Would also be super handy to have some performance tests before pushing forward with this. Marking as blocked by #30

@jsimnz
Copy link
Member Author

jsimnz commented Nov 19, 2021

Would also be super handy to have some performance tests before pushing forward with this. Marking as blocked by #30

I agree its going to be important to track the perf of the implementation, but I wouldn't call it a full blocker of this feature. The great thing about our datastore interface is the freedom to use whichever engine we need with no code change. So we can develop the in-memory versions without changing any structures.

So as long we are building the in-mem engine using decent patterns, it's always going to be faster than the disk-backed persistence storage via BadgerDB.

@jsimnz
Copy link
Member Author

jsimnz commented Nov 19, 2021

I have a few questions on this:

  1. How much time do we want to spend on this (including longer-term)?
  2. What kind of data-volumes should we target? i.e. a few KB, or a few TB?
  3. How generic do we want to keep this? We could gain quite a bit by taking advantage of the fact that we know the structure of our data.
  4. How important is the sorting of the key components (e.g. dockey or cids - not the full ds.Key), and where exactly do we use this? - I can't really think of a place off the top of my head, and we can make some serious gains if we only need to care about determinism (again within the context of a key-component).
  5. Do you know if there is a significant imbalance in the performance constraints? Should we target memory efficiency over CPU time, or just keep things balanced/sensible on both sides?
  1. It's hard to tell at this point, the primary goal is to have a "Good" version of the in-mem store in comparison to the current implementation, which is just a glorified map[string]interface{}. This may become a more long term goal, but the flexibility of the storage engine interface, and the fact that this is an ephemeral storage, means theres very few backwards compatibility gurantees we need specifically for this storage engine.

  2. We should mostly expect to consume the majority of the available memory on the device we're running. In the datacenter deployment, this can get to 100s GB to TB.

  3. I think generic is more important here, since the same storage engine is used for the 4 different stores of Defra - the Datastore, Headstore, Blockstore, and Systemstore. As well as handling various kinds of indexes (primary and secondary) on the datastore. So trying to optimize for all those stores might be too much.

  4. I think the only goals we should optimize for (in terms of access patterns), is efficient in-order key iteration (on a byte level), as that will allow us to build up the rest of the interface.

  5. Not a good sense at the moment regarding balance of CPU vs I/O.

I think starting off using an existing BTree implementation should get as far enough for now. BuntDB is honestly a fairly good starting point as well, which is a full-featured storage engine already build on their BTree implementation.

@jsimnz
Copy link
Member Author

jsimnz commented Nov 19, 2021

The other design goal is Transaction support, if we do things ourselves we have to implement them from scratch. BuntDB already supports Transactions

@jsimnz jsimnz closed this as completed Nov 19, 2021
@jsimnz jsimnz reopened this Nov 19, 2021
@AndrewSisley
Copy link
Contributor

Been having quite a long think about this:

RE: custom vs super generic/existing solution:

Pros (for custom):

  • We can absolutely smash the the benchmarks for a generic solution, we really don't need it to be sorted (just grouped), and most of the key is pretty much static (only growing on schema change)
  • Much easier to (even) further optimize access to key areas
  • Super easy to incrementally improve it (at least to start off with ofc), there are lots of natural points for us to split the work over
  • Code will likely be much simpler than learning a new datastore package

Cons:

  • You seem to not be fussed about testing performance here, I can maybe live with that for something like badger in the short-term but there is no way I would merge a custom data structure this important without solid tests - so there is some additional dev. cost here (vs badger, if you really really want BluntDB I would want them)
  • Will of course need to spend extra time building and maintaining it
  • Perhaps a greater risk of bugs vs something like badger, as this will be something of a side project for us, although our code should be able to be much simpler (depending on how much fun we want to have faster we want it to be)

RE: Which existing package to use:

  • Really not keen on BluntDB, they have an open issue where they managed to corrupt the database of a user (comment suggests fixed, but still open), and There can only be one read/write transaction running at a time (across the entire database) which is quite a hefty bottleneck in my opinion... Also seems to be a one-man-band/hobby-project and I would suggest that it would be unwise to push it as a solution to our users (or for our automated tests)
  • Badger has an InMemory option, why not use that? Would probably require us to update the version though, which would be a file-system breaking change
  • If not Badger, would suggest we keep looking for something other than BluntDB

Also - semi-related, I think we can probably restructure the code in db.go to break the code-dependency on any given implementation - moving the badger code out to a new package, and likewise for the IM stuff. I might do this first (#50).

Your thoughts?

@jsimnz
Copy link
Member Author

jsimnz commented Nov 22, 2021

Love the input!

Even with these pros/cons for custom, im still leaning towards generic implementation, such that we can still wrap the given storage with the Datastore package interface. I think the freedom to switch, and experiment with various datastores is really important at this moment.

Additionally, some of the various deployment targets will certainly need their own specific datastore implementations. Like compiling the DB via WASM to run in a browser, will need to have an adapted IndexedDB wrapper around the Datastore interface, etc.

As for BadgerDB, I don't see where you can run it as "in-memory" mode, without just using some various shortcuts/hacks w.r.t the Sync/Merge/Flush options, and a virtual memory backed filesystem. Unless im missing something here :/

As for BuntDB, not sure how I missed the single write transaction limitation..certainly a blocker.

I'm more than happy for us to dedicate whatever time is necessary, and even a long-term maintenance window is fine.

@AndrewSisley
Copy link
Contributor

AndrewSisley commented Nov 22, 2021

Ah was for sure going to keep the ipfs interfaces and all the Wrap stuff with a custom implementation - would be a big loss losing that even if the alternative means a bit of wasted CPU/in-store-code-complication.

According to the docs BadgerDB has an IM flag here: https://dgraph.io/docs/badger/get-started/#in-memory-mode-diskless-mode - I think it must be in v2 or higher as the current version we use doesn't seem to have it. (we are on v1, there is a v3 now). There is a file-format change between v1 and v2 though, so you/product would need to give the green light, as any early users could be affected by that. We could try and use two different (major) versions of the package as it looks like Go supports that, but maybe that could cause confusion and it would be better to find another package in the short-term (I assume we'll update badger at somepoint anyway?).

Will have a look for others in a bit.

@jsimnz
Copy link
Member Author

jsimnz commented Nov 22, 2021

Thats weird, I went looking for a in-memory mode for Badger, but couldnt for the life of me find one lol. Let me double check how its implemented internally, but other than it looks like a good (at least short term) solution.

As for the required upgrade, thats fine at the moment as a breaking change. Rather upgrade now and not have to do it later as more partners start building more involved applications.

@jsimnz
Copy link
Member Author

jsimnz commented Nov 23, 2021

See #52

@jsimnz
Copy link
Member Author

jsimnz commented Dec 2, 2021

fixed in #56

@jsimnz jsimnz closed this as completed Dec 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datastore Related to the datastore / storage engine system feature New feature or request perf Performance issue or suggestion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants