-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid argument: LMDBStore.getBinaryFast #164
Comments
I could probably make a good guess at which line of code it is coming from in LMDB (https://github.com/DoctorEvidence/lmdb-js/blob/master/dependencies/lmdb/libraries/liblmdb/mdb.c#L7738), but there a few assertions that can trigger that, so not sure which is the actual cause or how that occurred. Do you know if this has this ever occurred before, or is it new to a more recent version of lmdb-js/parcel? What I can do is add some code to expand the details of these error messages, so the errors are little more informative though. |
I think I've first seen this on CI sometime in the last ~2 weeks. And no user appears to have actually run into this so far. |
Ok, I published a v2.3.7 with more detailed error messages when there are failures in gets (or more likely in renewing the transaction for a get), if you want to see if that helps provide more info. |
It looks like it failed again, but I am not sure how to access the error/stack info that you posted, from the build results? |
The failing benchmark job is unrelated, so all tests passed. But I'll report back if I see this error again in the future |
|
Thank you for the update @mischnic, this is certainly an error/code-branch I have never seen happen in LMDB before! I'm curious, based on the test name and the fact that both times it triggered the error: does this test involve actually moving the location of the LMDB file/directory, possibly while it is in use? I have a couple ideas for how this error could possibly be induced. One is that somehow the file mutexes/locks become ineffective in locking (maybe due to some file manipulation). The other idea a little more concrete: I believe if you open an LMDB database in one thread, read from it, then open the same LMDB database in another thread but using a different path/string (even if it is different string, but relative vs absolute to same location), this will generate a new LMDB environment, and then if you close that database, it will will clear any reader slots with the current process id (same among all threads), which would indeed create a bad slot for an active reader in another thread. In lmdb-js I try to avoid creating duplicate LMDB environments (since LMDB explicitly forbids that as dangerous for this reason), but perhaps not trying hard enough. |
It's definitely moved. There are no explicit calls to the "old" lmdb db (at the old location) after the move (unless something wasn't flushed of course). So I think we should be calling |
(I did actually briefly look at the code, but didn't try to figure out what ncp really did :) )
You could see if it changes anything, but, at least in theory, shouldn't be necessary. lmdb-js tries pretty hard, setting up listeners to process.on('exit') and clean up hooks, to ensure data is really flushed before a process exits. But like you said, somewhat dependent on how violently it was exited (should catch/flush process.exit, uncaught errors, event queue finishing, but not segfaults, for example). |
Ok, these updates are published in v2.3.8. |
I don't know how helpful those could be but after we (gatsby) bumped from Common "shape" of those errors:
|
@pieh A few questions about this: Are you thinking this is the same issue (reported in same place) as #153 (just with better error messaging)? You don't think this is a new regression with 2.3.10, right? (just expanded messaging for the error?) And with these tests, you are creating child processes (or is that specific to the second report)? Are there any worker threads involved? I believe the root cause in this ticket originally was multiple LMDB database/env instances being created for the same database file; do you think that is possible in this situation? (the fix that was applied for this ticket was trying to ensure that is checking for existing database based on dev/inode of the file, I don't know if maybe that is unreliable on some OSes). Is it possible there are multiple versions of lmdb-js in this test? And I assume you don't have any way of reproducing this locally at this point do you? |
Honestly I have no idea. It's possible. It's not clear to me wether The #153 have reports from our full e2e runs, the ones I linked are unit tests so possibly are easier to figure out due to less parts involved generally in unit tests. I decided to add comment here instead of #153 because I saw
Too early for me to say if we see errors at increased rate now, so can't speculate if there is "new regression" yet. I mostly wanted to show the "expanded messaging" in hope that maybe You can give me some clues what to debug from this point to get more information that is actually usable for you or for us to change our implementation if this is likely because we set up things wrong.
Some tests do create child process, but not all failed ones do - in particular in both linked runs it has "worker can access node created in main process" failing test that is creating a child process with On the other hand the example of failing tests without child process involved result in error mentioning
We don't use worker threads ourselves, but will look up if there is anything hidden in deps (for example, I remember
Good question - the problems might be our unit tests setup. I will poke around and try to see if running our tests result in multiple We also do have our own wrapper to prevent multiple open dbs, so I will look into that as well.
Those unit tests shouldn't use different
We don't have reliable way to reproduce it, it is flaky, which leads me to believe there are timing conditions required to reproduce which I don't understand yet. |
So to summarize my understanding of potential causes of bad reader slots:
|
Unsure if this is useful but this is another trace I got
|
I think this is a potentially useful clue. It is actually challenging to conceive of how this type "out of range" error could occur; I have never seen it before and the dbi range never decrements, so doesn't immediately seem like it could be reduced to a range of 9 after previously being increased to 11+. However, I think this actually could be possible as a race condition of simultaneously opening databases from different threads using read-only transactions (which is used for opening dbs as of v2.1, for #100). And it appears this error log does show two interleaving errors (due to either two threads or processes). I believe that gatsby does not use threads (just child processes), however, I do believe that jest does use multiple threads by default (https://github.com/facebook/jest/blob/main/packages/jest-config/src/getMaxWorkers.ts#L28). And if this is thread-related, I think makes sense, as I understand these errors only occurring in gatsby's tests (and not actual production usage). I believe I can address this particular "out of range" error, assuming my theory is correct. This doesn't address the bad reader slot error, but perhaps a good clue, as multiple threads certainly provides a more plausible situation for multiple database instances to be opened (although still not sure exactly how). |
…ondition, unless in readOnly mode, #164
…ondition, unless in readOnly mode, #164
Published version v2.4.0 with changes to address the dbi "out of range" error. |
I can still reproduce this error with
I haven't gotten a small reproduction yet and I can only reproduce it on GCP Kubernetes.
|
You aren't by any chance using any |
No, we initiate multiple caches per plugin:
We also have another database
|
I still don't have any ideas for how this could occur. I think this error would be the result of using an old read transaction to access a database that was opened after the read transaction had begun or renewed, but in lmdb-js read txns are reset after opening any database. However, I have added some more detailed information to the error messages in case that reveals anything about the state that I didn't expect. Let me know if you need a publish for that. Also, just a couple assumptions: if I understand correctly, you are using (multiple) worker threads, but not multiple processes. And I don't know if this is testable without threads just in case that could be related? |
If you can publish a canary, I'm happy to do so or if you have a guide how to compile myself that's helpful too |
Its a safe, minor patch and version numbers are cheap, so went ahead and published this v2.4.3. |
And just FYI, as far as how to directly use the source/compile, I think you would set a package version override, and directly use a commit/branch, like |
…an break the entire locking scheme, parcel-bundler/parcel#8165, #153, #164
Still getting an error with 1.4.5 - I disabled all child_procs from spawning so it's a single process
|
I have been able to create a test case that reproduces this error. For me this occurs when there are a large number of async writes or async transactions running while opening a database (with the same root database). Does this sound like it could possibly be similar to your situation? |
Yeah it could definilty be the case. We can open the cache earlier, now we open it on first write. |
@kriszyp can I try your latest commit or isn't that the full fix yet? |
Sorry, I realize I have been slow with this, as I felt like this change was a bit more than a patch release since it involves some changes to the ordering of sync and async transactions, and was working on putting together a v2.5 release for this, that included some other work. Consequently, I don't think the commit will work on its own, but I will try to build/publish a v2.5-beta this morning, if it builds ok, and you can try that. |
No need to apologize :) We're very grateful for your work here! 🥇 |
I guess I updated the #153 ticket, but not this one, but the updates in v2.5.x will hopefully address this. |
2.6.8@data-v1 (when works with worker_threads)
|
I have tryied to install version greater then 2.6.9, but its compiling with error:
|
I cannot use v2, bacause old database is too large. But... How, can i migrate that to v2?
|
The referenced commit should fix the compilation for lmdb v1. With that you should be able to write a script that can read with one version of LMDB and then write data using a the newer version of LMDB to migrate the data. Let me know if you need a publish. |
Occasionally, we get this error on CI (and if it does occur, it's always on the same test)
in this case, on this PR: parcel-bundler/parcel#7995
Any idea where this might be coming from? The parameter that Parcel passes to
store.get(..)
should always be a stringThe text was updated successfully, but these errors were encountered: