-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[upstream] Eventcore bug with collection during allocation #1367
Comments
This is your clue. The preimages are not sent fast enough but faucet is making transactions as fast as possible. So it will lead to this scenario. The real issue here is that we need to implement slashing (#1076), and also do something about missing preimages. Maybe this is also related to the freezing issue. Have you checked if any other nodes have crashed (not just eu-002) when a freeze happens? We need to inspect the logs of all the nodes. |
That makes sense. The nodes create many blocks in a short time due to a large pool of UTXOs. This leads to the faucet falling behind and trying to catch up. All the nodes are crashing, so it might be related to freezing. |
More specifically: When a validator is slashed because of a missing preimage, we will also have to exclude that validator's preimage when generating the sum random number: https://github.com/bpfkorea/agora/blob/06b3d433c9105dc77e1d992baf017b368315f96c/source/agora/consensus/EnrollmentManager.d#L550 But we don't have this detection yet until #1076 is implemented. Another possible issue here is that the validators are not sending their preimages fast enough. I suspect this can happen around the time a validator is about to re-enroll. It only re-enrolls one block before expiration: https://github.com/bpfkorea/agora/blob/06b3d433c9105dc77e1d992baf017b368315f96c/source/agora/node/Validator.d#L453 Normally one block should be enough time (it's 10 minutes), however in our test system the So I suggest the following:
|
This will be a little harder to implement. But what could be done is to get the local clock time (on your PC) with ansible, and then change the configs' I suspect this will be difficult to do. Then there is the other alternative: change Agora so it doesn't try to catch-up and generate too many blocks at once. Unfortunately there are still many design decisions that haven't been resolved, so this kind of fix probably isn't future-proof. |
On the other hand making the nodes generate blocks too fast can become a problem in real-world scenarios. For example, after prolonged network outages. So maybe we really should change Agora to not generate blocks too fast. |
I think if we upgrade to the latest build the situation may improve due to the fact that more than 8 transactions will be allowed in a block. However we may run into a different problem where the block size is too big. |
@kchulj already upgraded, the problem persists. |
I've tested with the latest version and it seems like this is the issue. There's always a problem with the pre-image at |
Well it depends what we want to do. Do we implement a workaround until #1076 is implemented? You could for example make faucet poll for the preimages until you get a preimage with the right distance (using |
Perhaps Agora could wait at least a few seconds after externalising a block before starting the next one. That way the nodes should have enough time to reveal their pre-images. We already have this wait for pre-images in the unit tests and it is not realistic. Even if #1076 is implemented we need to give fair time for the nodes to reveal. |
Both of the ideas sound good. The timing definitely needs improvement, because I'm observing the "catch-up gap" not getting smaller over time. So should the waiting be on the |
Another option is to let the catch up happen with a single transaction from the Faucet. We do this in the ci system test now. Agora should then externalise the blocks at the correct interval from then on and you can start sending many transactions. You just need to wait for the first block to be created. |
I tried some workarounds and it seems like this is simple and feasible. |
Based on running the test several times, the
|
We finally got the backtrace: https://gist.github.com/AndrejMitrovic/4bbbdc7e66dbade63f4fdd02570b0427 Also pasted below:
|
import core.memory;
size_t[size_t] glob;
class Foo
{
size_t count;
this (size_t entries) @safe
{
this.count = entries;
foreach (idx; 0 .. entries)
glob[idx] = idx;
}
~this () @safe
{
foreach (idx; 0 .. this.count)
glob.remove(idx);
}
}
void main ()
{
bar();
GC.collect(); // Needs to happen from a GC collection
}
void bar () @safe
{
Foo f = new Foo(16);
} I guess we need to patch |
This uses our own branch which includes a bugfix for bosagora#1367
This uses our own branch which includes a bugfix for bosagora#1367
This uses our own branch which includes a bugfix for bosagora#1367
This uses our own branch which includes a bugfix for #1367
Resolved on our side with #1397 I'm renaming this issue into a tracking issue for when eventcore is fixed. |
Invalid Memory Operation
exception and crashes
Closing since upgrading the compiler will fix this. |
It happened at
Height
51.The text was updated successfully, but these errors were encountered: