-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Memory usage spike induced crash #10371
Comments
Can you please run with |
It should actually be |
|
@SimonHausdorf Are you sure there isn't some other process sending e.g. SIGTERM to the parity process?
I'm glad I'm not the only one: #10364 |
I'm seeing this behavior as well. I put in some logging to trace it. Will report back if I find anything new. As a result of this, my nodes are pretty unusable because they shut down or crash after a short period of time. I'm seeing this behavior on both 2.2.10-stable and 2.3.3-beta. |
yes, I am running parity in screen if this makes any difference. On this machine is nothing else running than Parity. Ubuntu 18.10 Maybe we can collect some information how you guys are running parity and on which system @c0deright @lampshade9909 |
@SimonHausdorf I'm running on AWS with Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - ami-0bdf93799014acdc4. Nothing else is running on this instance, only parity. c5.xlarge instance with 200 GB of disk space. |
So also Ubuntu 18.x We have a kovan server running on 16.04 without any problems. But it is 2.3.0-beta. I will setup two new nodes with Ubuntu 16.04 and 18.10 and let them run during the night to see if one of them crashes. |
I don't run parity 2.2.10 and 2.3.3 anymore because of #10361 (have downgraded to 2.2.9 and 2.3.2). My setup is AWS ec2 t3.medium instance, EBS gp2 volume, Ubuntu 16.04 LTS, parity being started and stopped via self-written systemd service file:
/home/foobar/parity-binary is a symlink to the actual binary so I can switch versions easily. |
@SimonHausdorf It may be important to note that I'm making a very significant number of jsonrpc calls to my nodes. Calls like: "eth_getBalance", "eth_call", "eth_blockNumber", "eth_getTransactionReceipt", "eth_getTransactionByHash", etc. |
same for us, we have a lot of the rpc calls you mentioned |
The two new nodes are still running, our main node crashed 2 times during that night. |
@SimonHausdorf Mine still crashes consistently. Approximately 50-100 minutes after running. Here's the configuration I'm using:
|
On a site note unrelated to that issue: why do you @lampshade9909 and @SimonHausdorf run parity with all the parameters from the command line respectively from screen instead of letting the OS/systemd/whatever handle starting and stopping parity like every other daemon (sshd, cron, syslog, ...), too? And why don't you put all the config in config.toml and just use 'parity' command? I'm just curious. |
I am running it as a service now to restart after crash automatically. As to the config, just laziness :) |
Just appearing here with I just tested that when I'm trying to rerun my application that sends transactions I get
|
@Pzixel that doesn't appear related to the issue at hand, but just to make sure, can you find out which rpc method is returning the invalid response? |
@joshua-mir can't say it now since I restarted the app, but it should be sendrawtransaction, getreceipt or getcode. However it may be unrelated because it works when I restart client. Maybe some keepalive options or something else that prevents reconnect when server is restarted. |
This is just the habit I've gotten myself into. Certainly not the ideal way to do it. So you have your parity node being restarted once it shuts down? I need to figure out how to do that, just haven't had the need since my parity nodes were pretty reliable and didn't shutdown/crash much up until this last update. Edit: I installed immortal and will use that to persist the parity node. Maybe i'll move to the config file next ;-) |
@lampshade9909 that's fine but it seems that clients may be not ok with reconnecting to restarted server, see my comment above. |
I am now seeing this shutdown issue: 2019-02-20 17:03:30 UTC 22/75 peers 303 MiB chain 564 MiB db 17 KiB queue 72 KiB sync RPC: 0 conn, 0 req/s, 18610 µs2019-02-20 17:03:30 UTC 21/75 peers 303 MiB chain 564 MiB db 26 KiB queue 72 KiB sync RPC: 0 conn, 0 req/s, 18610 µs 2019-02-20 17:05:25 UTC Syncing #7527099 0xac8b…a5e3 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 4 Qed #7527104 20/75 peers 303 MiB chain 564 MiB db 36 KiB queue 72 KiB sync RPC: 0 conn, 0 req/s, 18610 µs2019-02-20 17:05:25 UTC 20/75 peers 303 MiB chain 564 MiB db 0 bytes queue 72 KiB sync RPC: 0 conn, 0 req/s, 18610 µs 2019-02-20 17:05:27 UTC 19/75 peers 303 MiB chain 564 MiB db 20 KiB queue 72 KiB sync RPC: 0 conn, 2 req/s, 18610 µs 2019-02-20 17:08:07 UTC 2/75 peers 303 MiB chain 564 MiB db 0 bytes queue 72 KiB sync RPC: 0 conn, 2 req/s, 198442 µs 2019-02-20 17:12:11 UTC Finishing work, please wait... 2019-02-20 17:13:12 UTC Shutdown is taking longer than expected. As well as a memory leak: [1082346.387016] Out of memory: Kill process 64976 (parity) score 335 or sacrifice child Parity 2.2.10 upgraded from 2.2.9. |
Just a +1 here, having the same issue running in AWS EC2 with the standard Ubuntu 18.04 AMI using 2.2.10. At some points it seems to be basically causing the entire networking of the machine to break until restarting. Have never had this problem before, and it seems to be all our parity nodes exhibiting the same behavior. |
I also have to add that this affected all of my Parity daemons running at the same time. This is on Ubuntu 18.04.1. |
@Serpent6877 did your nodes happen to be doing the same activity or were they on the same machine? ie, was this simultaneously at the same blockheight, or some spike in resource usage affected them all at the same time? |
@dvdplm This last run is without any cache settings on either of the three Parity instances. But was stating that I had only set 10GB cache previously and it appeared to be using well over 20GB for each Parity. I am wondering if this isn't RPC related. I am doing average over a minute of 36 RPC calls a second. Here is the RPC calls I am performing: eth_getWork I could possibly run the nightly build on a single block. ETC might be best. It seems to die the quickest. |
I see, that makes sense, thank you.
Excellent, that is useful. I'll try to repeat it on my end. |
@Serpent6877 Your config above says you have Also, may I ask how you are submitting the http requests to the parity node? Is it a single thread/process or several? From localhost or from the network? Finally it would be useful to know if you have a rough idea of the mix of the RPC calls: are some more common than others? |
@dvdplm It seems to be consuming 1GB a day between the two Parity running. The ELLA Parity seems to be flatlined right now. I can set it to just eth. I did not realize that was for the RPC calls. I just looked and it does appear I am also doing these additional calls: net_peerCount This is a modified form of Sammy's Open Ethereum Pool found here: https://github.com/sammy007/open-ethereum-pool It is wrote in GO and does these calls: rpcResp, err := r.doPost(r.Url, "eth_getWork", []string{}) I changed my config to use "eth", "net". Will see how that goes. I will also try and get you more info on the RPC calls. A majority of it will be the eth_getWork/eth_submitWork commands. |
@Serpent6877 any updates on your end? On my side I've tried to hit my node (compiled from current master) with a mix of your RPC calls at ~300req/s. Memory consumption is very stable here but I'm running macOS so it's not really indicative. FWIW we have a new version of |
@dvdplm I have not made the changes to the apis as of yet. There was a small memory usage increase over the last several days as seen here: But it has tapered off slightly today. I think I will need a few more days to see if it continues to increase more or has finally leveled out. I will try and capture more info on the types of API call's being performed. |
Ok, thanks. No rush ofc. These things are really tricky to pin down. We merged #11151 today which contains some good changes to threads usage. When I load tested that build today I saw better performance and much diminished variance in request times. But I could not reproduce the memory leak on my system with the current master so not sure if it fixes your issue. If you have the means to check out master and build parity and try it under your workload that'd be ideal of course. |
@dvdplm Sorry for the lack of updates. I have had a server fan issue that caused a reboot and thus waiting to collect more data. I do notice that it does seem to be more stable without the cache settings. I'll let it run for a few days then restart with the cache to see if that makes a difference. |
Excellent, thank you for the update. |
This matches exactly what I'm seeing using Parity 2.5.9 on my NanoPC T4. Stable connection, good number of peers. After 8 hours memory usage through the roof, low peer count and continually losing sync. Restart and the problem goes away. I've deleted the cache setting as mentioned early - will monitor and let you know. |
@Serpent6877 @sverzijl any news on you guys' nodes? |
@dvdplm Since removing the cache and the restarts things seem pretty stable to me. I have not had a chance to re-enable the cache and test that. Mainly due to not wanting to lose any further miners. I am waiting for the next beta to test anything further. |
@Serpent6877 ok, thank you for the report! |
I'm syncing a new fresh node under the same conditions I was having issues with before. I have to non-warp fully sync my node to be able to test under the conditions that I was having before, but when that's finished, I'll confirm I'm running parity 2.5.10 (or later, with #11151), and test again and see if I run into the same issues. My guess is it will probably be a couple weeks before I'm able to properly attempt to reproduce this, but I'll update when I do. |
@cheeseandcereal thank you for keeping us in the loop here and for your patience. |
Yesterday, because of the the upcoming Istanbul Ethereum Hardfork (block 9069000, targeted for the 4th of December) I upgraded my production boxes from v2.2.9 to v2.6.5-beta. After several months of running the v2.6.x branch on my test node, I finally made the switch on prod after increasing RAM to at least 8GB. So far no issues. IO dropped on machines previously running with 4GB of RAM, so that's good. I even kept Didn't see any signs of a memory leak for months. |
So I upgraded to 2.6.6-beta from my 2.6.3-beta and the memory seems okay. However a whole new issue has come about that is even worse. Parity simple freezes up completely. Stops logging. Unresponsive to RPC. Nada. I have to do a kill -9 just to stop it. Then when restarted it has to sync to catch up. So it was definitely in a completely locked state. I am at a loss on how to proceed since I need this for Istanbul.
I commented out the server_threads and processing_threads to test that out. Any help appreciated. |
Nothing obvious come to mind. A few things to try:
|
Any news from you @Serpent6877? I just had the exact same thing happen to me with v2.6.6: At
Had to kill parity. After restart of v2.6.6 parity did take some time to sync block #9123828 (see below) but then went on. Just now, I upgraded to 2.6.7 and commented out my log
|
My archive node got stuck syncing the same block on v2.6.6 which is a problem I haven't had for many versions. |
|
Sorry for being unclear. Here's the events:
Note how it did nothing between 3AM and 3PM. It then continued to stall on the same block
untill I manually restarted it and then it would sync. |
Ok, not the same block number but likely the same issue. |
@c0deright @dvdplm I actually just had a lockup last night. I have just been testing different config changes to see what is causing it. So far no luck. Last block before lock was 9123621.
Here is my current config during the lockup:
|
Parity 2.2.6-beta:
Just had this issue that then put Parity into a loop trying to sync block #9189356. Just upgraded to 2.6.8-beta for hard fork. |
I've been testing different toml configuration, same problem. BUFF CACHE is slowly eating all the mem space after each parity restart. since version 2.6.8 beta |
Hi, I think the same problem remains in v2.7.2-stable...I'm running a PoA in a AWS server 4GB RAM and some validators crash repeatedly. I'm running it with docker-compose and after crashing I can see it exit with code 137 (Out Of Memory). I also check kernel logs and here is what I see:
Some comments above thinks it is because of RPC calls and I support it...I'm going to explain why: I've been experiencing this crashes since november (versions 2.5.x, 2.6.x and 2.7.2). In november I was testing a blockexplorer which builds a database by its own (with constant rpc-calls) and I forgot I didn't stop it! After some months breaking my head trying to know why my validators where crashing I read this issue and remembered this test-blockexplorer...I'll comment if they stop crashing but I'm almost sure this was the reason because my validators were crashing even when no real users where using the blockchain... I don't know if this issue is still under review or not... |
Closing the issue due to its stale state |
The logs before parity shuts down
This happens randomly.
The text was updated successfully, but these errors were encountered: