Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e3: amoy: mem leak when No block headers to write in this log period #10734

Closed
AskAlexSharov opened this issue Jun 13, 2024 · 7 comments
Closed
Assignees
Labels

Comments

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Jun 13, 2024

[INFO] [06-12|05:04:54.090] [2/15 Headers] No block headers to write in this log period number=8140100 blk/second=0.000 alloc=110.0GB sys=124.1GB invalidHeaders=0 rejectedBadHeaders=1844
[INFO] [06-12|05:04:54.090] Req/resp stats                           req=4331 reqMin=8139909 reqMax=8173939 skel=29 skelMin=8140099 skelMax=8177155 resp=11908 respMin=8139910 respMax=8173942 dups=1398907 alloc=110.0GB sys=124.1GB

alloc=110.0GB

last logs:

[DBUG] [06-12|05:05:12.861] [2/15 Headers] Requested header          from=8140291 length=192
[WARN] [06-12|05:05:12.864] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[WARN] [06-12|05:05:12.869] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[WARN] [06-12|05:05:12.870] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[WARN] [06-12|05:05:12.873] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[WARN] [06-12|05:05:12.874] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[WARN] [06-12|05:05:12.901] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[WARN] [06-12|05:05:12.907] [downloader] Rejected header marked as bad hash=0x0134869cbf3e179d6d3503d91fe5f76c3b298a4d82ae1a591ca8f525240228d7 height=8140101
[DBUG] [06-12|05:05:12.915] [2/15 Headers] Requested header          from=8140292 length=192
[DBUG] [06-12|05:05:12.918] slow peer or too many requests, dropping its old requests name=erigon/v2.59.3-088fd...
[DBUG] [06-12|05:05:12.974] [2/15 Headers] Requested header          from=8155538 length=192
[DBUG] [06-12|05:05:13.566] [2/15 Headers] Requested skeleton        from=8140099 length=192
[DBUG] [06-12|05:05:14.278] [bor.heimdall] Got new milestone         start=8173890 end=8173912
[DBUG] [06-12|05:05:14.544] [bor] current head block behind incoming block=milestone head=8140100 end block=8173912
[DBUG] [06-12|05:05:18.061] [2/15 Headers] Requested skeleton        from=8140099 length=192
[DBUG] [06-12|05:05:22.150] [2/15 Headers] Requested header          from=8140292 length=192
[WARN] [06-12|05:05:25.469] [bor.heimdall] an error while fetching   path=/milestone/lastNoAck queryParams= attempt=1 err="Get \"https://heimdall-api-amoy.polygon.technology/milestone/lastNoAck\": context deadline exceeded"
[DBUG] [06-12|05:05:36.445] [bor.heimdall] request canceled          reason="context deadline exceeded" path=/milestone/lastNoAck queryParams= attempt=1
[WARN] [06-12|05:05:40.199] [bor.heimdall] Failed to fetch latest no-ack milestone err="context deadline exceeded"
[WARN] [06-12|05:05:40.551] [bor.heimdall] an error while fetching   path=/milestone/latest queryParams= attempt=1 err="Get \"https://heimdall-api-amoy.polygon.technology/milestone/latest\": context deadline exceeded"
[WARN] [06-12|05:05:43.345] [bor] unable to handle no-ack-milestone service err="failed to fetch latest milestone"
[DBUG] [06-12|05:05:50.722] [p2p] Discovery table                    protocol=67 version=v4 len=197 live=182 unsol=258 ips=284 db=0 reval=513 invalid IP in response record: LAN address from WAN host=4 RPC timeout=29 unsolicited reply=25 unknown node=68 expired=126
[DBUG] [06-12|05:05:49.965] [p2p] Server                             protocol=68 peers=7 trusted=0 inbound=0 write tcp 51.210.214.106:30303->84.247.174.229:57032: write: broken pipe=1 write tcp 51.210.214.106:30303->38.242.206.150:37512: write: broken pipe=1 write tcp 51.210.214.106:30303->5.182.33.115:47960: write: broken pipe=1 write tcp 51.210.214.106:30303->152.53.1.39:43570: write: broken pipe=1 write tcp 51.210.214.106:30303->198.24.170.214:49964: write: broken pipe=1 EOF=2580 write tcp 51.210.214.106:30303->74.201.193.158:31566: write: broken pipe=1 write tcp 51.210.214.106:30303->152.53.1.39:34850: write: broken pipe=1 write tcp 51.210.214.106:30303->72.251.237.4:50144: write: broken pipe=1 write tcp 51.210.214.106:30303->65.109.235.111:47292: write: broken pipe=1 write tcp 51.210.214.106:30303->74.201.193.229:50182: write: broken pipe=1 useless peer=1 write tcp 51.210.214.106:30303->74.201.193.244:43114: write: broken pipe=1 write tcp 51.210.214.106:30303->131.153.169.78:49396: write: broken pipe=1 write tcp 51.210.214.106:30303->143.198.84.113:48460: write: broken pipe=1 write tcp 51.210.214.106:30303->136.243.102.211:53328: write: broken pipe=1 write tcp 51.210.214.106:30303->194.233.75.136:41556: write: broken pipe=1 write tcp 51.210.214.106:30303->65.21.57.251:45136: write: broken pipe=1 write tcp 51.210.214.106:30303->51.138.200.10:3137: write: broken pipe=1 write tcp 51.210.214.106:30303->54.214.33.132:9407: write: broken pipe=1 write tcp 51.210.214.106:30303->122.202.14.37:4302: write: broken pipe=1 write tcp 51.210.214.106:30303->120.133.26.20:24570: write: broken pipe=1 write tcp 51.210.214.106:30303->144.91.87.17:49222: write: broken pipe=1 ecies: invalid message=54 write tcp 51.210.214.106:30303->194.163.163.11:55204: write: broken pipe=1 write tcp 51.210.214.106:30303->142.215.209.8:31410: write: broken pipe=1 write tcp 51.210.214.106:30303->31.220.99.133:56978: write: broken pipe=1 write tcp 51.210.214.106:30303->194.233.75.139:55926: write: broken pipe=1 write tcp 51.210.214.106:30303->194.163.160.188:59010: write: broken pipe=1 write tcp 51.210.214.106:30303->72.251.237.13:42320: write: broken pipe=1 write tcp 51.210.214.106:30303->72.251.237.18:40936: write: broken pipe=1 write tcp 51.210.214.106:30303->128.199.42.64:43276: write: broken pipe=1 write tcp 51.210.214.106:30303->91.210.101.51:29316: write: broken pipe=1 too many peers=610 closed by remote=21271 i/o timeout=2026
[DBUG] [06-12|05:05:50.722] [p2p] Dial scheduler                     protocol=67 peers=0/33 tried=31461 static=0 connect: connection refused=70 i/o timeout=338 connect: no route to host=3 connect: connection reset by peer=21
[DBUG] [06-12|05:05:51.602] [p2p] Discovery table                    protocol=68 version=v4 len=160 live=157 unsol=79 ips=233 db=0 reval=511 RPC timeout=12 expired=120 unsolicited reply=12 unknown node=23
[DBUG] [06-12|05:05:51.631] [p2p] Server                             protocol=67 peers=0 trusted=0 inbound=0 i/o timeout=1886 write tcp 51.210.214.106:30304->95.216.159.246:39678: write: broken pipe=1 write tcp 51.210.214.106:30304->194.163.160.191:43482: write: broken pipe=1 write tcp 51.210.214.106:30304->117.27.228.180:50038: write: broken pipe=1 closed by remote=23794 ecies: invalid message=46 write tcp 51.210.214.106:30304->138.201.249.97:52522: write: broken pipe=1 write tcp 51.210.214.106:30304->46.4.40.44:57286: write: broken pipe=1 write tcp 51.210.214.106:30304->62.171.163.167:41694: write: broken pipe=1 EOF=3509 write tcp 51.210.214.106:30304->65.21.244.80:32802: write: broken pipe=1 write tcp 51.210.214.106:30304->23.88.99.44:57748: write: broken pipe=1 write tcp 51.210.214.106:30304->38.242.234.2:53274: write: broken pipe=1 write tcp 51.210.214.106:30304->209.126.84.5:55992: write: broken pipe=1 write tcp 51.210.214.106:30304->65.109.224.35:42574: write: broken pipe=1 too many peers=385 already connected=31 write tcp 51.210.214.106:30304->173.212.236.16:60638: write: broken pipe=1 write tcp 51.210.214.106:30304->124.220.166.178:63268: write: broken pipe=1 write tcp 51.210.214.106:30304->207.246.127.166:38518: write: broken pipe=1
[DBUG] [06-12|05:05:51.548] [p2p] Dial scheduler                     protocol=68 peers=7/33 tried=29009 static=0 i/o timeout=844 connect: connection refused=83 connect: no route to host=7 connect: connection reset by peer=14
[WARN] [06-12|05:05:55.745] [bor.heimdall] an error while fetching   path=/milestone/lastNoAck queryParams= attempt=1 err="Get \"https://heimdall-api-amoy.polygon.technology/milestone/lastNoAck\": context deadline exceeded"
[WARN] [06-12|05:05:59.577] [bor.heimdall] an error while fetching   path=/milestone/latest queryParams= attempt=2 err="Get \"https://heimdall-api-amoy.polygon.technology/milestone/latest\": context deadline exceeded"
[DBUG] [06-12|05:06:42.340] [bor.heimdall] request canceled          reason="context deadline exceeded" path=/milestone/lastNoAck queryParams= attempt=1
[WARN] [06-12|05:06:47.950] [bor.heimdall] an error while fetching   path=/checkpoints/latest queryParams= attempt=1 err="Get \"https://heimdall-api-amoy.polygon.technology/checkpoints/latest\": context deadline exceeded"
[WARN] [06-12|05:06:50.191] [bor.heimdall] Failed to fetch latest no-ack milestone err="context deadline exceeded"
[DBUG] [06-12|05:06:50.765] [p2p] Server                             protocol=68 peers=7 trusted=0 inbound=0 i/o timeout=2031 write tcp 51.210.214.106:30303->128.199.42.64:43276: write: broken pipe=1 write tcp 51.210.214.106:30303->91.210.101.51:29316: write: broken pipe=1 too many peers=610 closed by remote=21271 write tcp 51.210.214.106:30303->152.53.1.39:34850: write: broken pipe=1 write tcp 51.210.214.106:30303->84.247.174.229:57032: write: broken pipe=1 write tcp 51.210.214.106:30303->38.242.206.150:37512: write: broken pipe=1 write tcp 51.210.214.106:30303->5.182.33.115:47960: write: broken pipe=1 write tcp 51.210.214.106:30303->152.53.1.39:43570: write: broken pipe=1 write tcp 51.210.214.106:30303->198.24.170.214:49964: write: broken pipe=1 EOF=2580 write tcp 51.210.214.106:30303->74.201.193.158:31566: write: broken pipe=1 write tcp 51.210.214.106:30303->72.251.237.4:50144: write: broken pipe=1 write tcp 51.210.214.106:30303->65.109.235.111:47292: write: broken pipe=1 write tcp 51.210.214.106:30303->131.153.169.78:49396: write: broken pipe=1 write tcp 51.210.214.106:30303->74.201.193.229:50182: write: broken pipe=1 useless peer=1 write tcp 51.210.214.106:30303->74.201.193.244:43114: write: broken pipe=1 write tcp 51.210.214.106:30303->65.21.57.251:45136: write: broken pipe=1 write tcp 51.210.214.106:30303->143.198.84.113:48460: write: broken pipe=1 write tcp 51.210.214.106:30303->136.243.102.211:53328: write: broken pipe=1 write tcp 51.210.214.106:30303->194.233.75.136:41556: write: broken pipe=1 write tcp 51.210.214.106:30303->51.138.200.10:3137: write: broken pipe=1 write tcp 51.210.214.106:30303->142.215.209.8:31410: write: broken pipe=1 write tcp 51.210.214.106:30303->54.214.33.132:9407: write: broken pipe=1 write tcp 51.210.214.106:30303->122.202.14.37:4302: write: broken pipe=1 write tcp 51.210.214.106:30303->120.133.26.20:24570: write: broken pipe=1 write tcp 51.210.214.106:30303->144.91.87.17:49222: write: broken pipe=1 ecies: invalid message=55 write tcp 51.210.214.106:30303->194.163.163.11:55204: write: broken pipe=1 write tcp 51.210.214.106:30303->194.163.160.188:59010: write: broken pipe=1 write tcp 51.210.214.106:30303->31.220.99.133:56978: write: broken pipe=1 write tcp 51.210.214.106:30303->194.233.75.139:55926: write: broken pipe=1 write tcp 51.210.214.106:30303->72.251.237.13:42320: write: broken pipe=1 write tcp 51.210.214.106:30303->72.251.237.18:40936: write: broken pipe=1
[DBUG] [06-12|05:06:52.317] [p2p] Server                             protocol=67 peers=0 trusted=0 inbound=0 i/o timeout=1889 write tcp 51.210.214.106:30304->95.216.159.246:39678: write: broken pipe=1 write tcp 51.210.214.106:30304->194.163.160.191:43482: write: broken pipe=1 write tcp 51.210.214.106:30304->117.27.228.180:50038: write: broken pipe=1 closed by remote=23794 ecies: invalid message=46 write tcp 51.210.214.106:30304->138.201.249.97:52522: write: broken pipe=1 write tcp 51.210.214.106:30304->46.4.40.44:57286: write: broken pipe=1 write tcp 51.210.214.106:30304->62.171.163.167:41694: write: broken pipe=1 EOF=3509 write tcp 51.210.214.106:30304->65.21.244.80:32802: write: broken pipe=1 write tcp 51.210.214.106:30304->23.88.99.44:57748: write: broken pipe=1 write tcp 51.210.214.106:30304->38.242.234.2:53274: write: broken pipe=1 write tcp 51.210.214.106:30304->209.126.84.5:55992: write: broken pipe=1 write tcp 51.210.214.106:30304->65.109.224.35:42574: write: broken pipe=1 too many peers=385 already connected=31 write tcp 51.210.214.106:30304->173.212.236.16:60638: write: broken pipe=1 write tcp 51.210.214.106:30304->124.220.166.178:63268: write: broken pipe=1 write tcp 51.210.214.106:30304->207.246.127.166:38518: write: broken pipe=1
[DBUG] [06-12|05:06:52.234] [p2p] Dial scheduler                     protocol=68 peers=7/33 tried=29009 static=0 i/o timeout=844 connect: connection refused=83 connect: no route to host=7 connect: connection reset by peer=14
[DBUG] [06-12|05:06:53.776] [p2p] Dial scheduler                     protocol=67 peers=0/33 tried=31461 static=0 connect: connection refused=70 i/o timeout=338 connect: no route to host=3 connect: connection reset by peer=21
[INFO] [06-12|05:06:53.786] [p2p] GoodPeers                          eth68=7
[DBUG] [06-12|05:06:54.324] [p2p] Discovery table                    protocol=67 version=v4 len=197 live=181 unsol=258 ips=282 db=0 reval=515 invalid IP in response record: LAN address from WAN host=4 RPC timeout=31 unsolicited reply=25 unknown node=68 expired=136
[WARN] [06-12|05:07:12.086] [bor.heimdall] an error while fetching   path=/milestone/latest queryParams= attempt=3 err="Get \"https://heimdall-api-amoy.polygon.technology/milestone/latest\": context deadline exceeded"
@AskAlexSharov AskAlexSharov added this to the 3.0.0-alpha1 milestone Jun 13, 2024
@AskAlexSharov
Copy link
Collaborator Author

Screenshot 2024-06-13 at 17 10 48

@VBulikov VBulikov added the imp2 Medium importance label Jun 28, 2024
@AskAlexSharov AskAlexSharov added polygon imp1 High importance and removed imp2 Medium importance labels Jul 24, 2024
@AskAlexSharov AskAlexSharov removed this from the 3.0.0-beta1 milestone Aug 1, 2024
taratorio added a commit that referenced this issue Aug 9, 2024
relates to:
#10734
#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable
wait until we reach 45% or more alloc in stage_headers when
"noProgressCounter >= 5" or "Rejected header marked as bad"
taratorio added a commit that referenced this issue Aug 9, 2024
relates to:
#10734
#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable
wait until we reach 45% or more alloc in stage_headers when
"noProgressCounter >= 5" or "Rejected header marked as bad"
@taratorio
Copy link
Member

taratorio commented Aug 9, 2024

@AskAlexSharov the screenshot only shows 4 GB - I imagine it was not taken at the right time?

I added #11549 for that.
If you have some time can you restart your Erigon instance with latest code on main and with export SAVE_HEAP_PROFILE=true env var please?
I'll run an Amoy instance now too and let it run in the hopes I can reproduce.

taratorio added a commit that referenced this issue Aug 12, 2024
…11551)

cherry-pick 2a98f6a for E2

relates to:
#10734
#11387

restart Erigon with `SAVE_HEAP_PROFILE = true` env variable wait until
we reach 45% or more alloc in stage_headers when "noProgressCounter >=
5" or "Rejected header marked as bad"
@AskAlexSharov
Copy link
Collaborator Author

Saw it on Amoy snapshotter today. here is what in logs if grep -v this warning:

[WARN] [08-24|01:07:09.774] Sanitizing invalid bor miner gas price   provided=1000000000 updated=25000000000
[WARN] [08-24|01:07:09.774] Sanitizing invalid bor min fee cap       provided=1 updated=25000000000.000
[WARN] [08-24|01:09:08.880] [7/9 Execution] Execution failed         block=10954598 txNum=36126721 hash=0xbec0aa1033826a45a1e5faba3319445f2ac85849d131f4c0a169c0dfa9baabb6 err="invalid block, txnIdx=1, receiptHash mismatch: 2e8eabdeaf20429d2175676ba2e6400f19a5701a0646c9b62422f4ebff09412b != 51478baf47ac21f86ec338e2e187b1b989787c19e0caea2469b964a54a48fbcf, headerNum=10954598, bec0aa1033826a45a1e5faba3319445f2ac85849d131f4c0a169c0dfa9baabb6"
[EROR] [08-24|01:09:08.908] Staged Sync                              err="bad block unwinding"

So, the reason is: unwind happened and after unwind - node can't function correctly. So, it bug of unwind.

Recent logs:

[WARN] [08-24|02:27:02.283] [downloader] InsertHeader: Rejected header marked as bad hash=0xbec0aa1033826a45a1e5faba3319445f2ac85849d131f4c0a169c0dfa9baabb6 height=10954598
[WARN] [08-24|02:27:02.428] [downloader] InsertHeader: Rejected header marked as bad hash=0xbec0aa1033826a45a1e5faba3319445f2ac85849d131f4c0a169c0dfa9baabb6 height=10954598
[INFO] [08-24|02:28:52.479] [p2p] GoodPeers                          eth68=7 eth66=1 eth67=4
[INFO] [08-24|02:32:47.710] [p2p] GoodPeers                          eth68=3 eth67=4 eth66=1
[INFO] [08-24|02:35:03.373] [p2p] GoodPeers                          eth68=2 eth67=2 eth66=1
[WARN] [08-24|02:37:14.898] [p2p] No GoodPeers
[INFO] [08-24|02:38:03.573] [2/9 Headers] No block headers to write in this log period number=10954597 blk/second=0.000 alloc=115.1GB sys=123.5GB invalidHeaders=0 rejectedBadHeaders=2546
[INFO] [08-24|02:38:49.178] [mem] memory stats                       Rss=124.0GB Size=0B Pss=124.0GB SharedClean=2.4MB SharedDirty=0B PrivateClean=24.0KB PrivateDirty=124.0GB Referenced=124.0GB Anonymous=124.0GB Swap=0B alloc=115.1GB sys=123.5GB
[INFO] [08-24|02:39:39.324] Req/resp stats                           req=5311 reqMin=10954406 reqMax=11094481 skel=25 skelMin=10954596 skelMax=10991652 resp=17034 respMin=10954407 respMax=11094490 dups=2210572 alloc=115.1GB sys=123.5GB
[WARN] [08-24|02:40:16.299] [p2p] No GoodPeers
[WARN] [08-24|02:43:14.321] [p2p] No GoodPeers

@taratorio
Copy link
Member

taratorio commented Aug 26, 2024

@AskAlexSharov are you running the snapshotter with SAVE_HEAP_PROFILE=true?
Do you see any logs [Experiment] heap profile threshold check and [Experiment] saving heap profile as near OOM?

@AskAlexSharov
Copy link
Collaborator Author

AskAlexSharov commented Aug 27, 2024

SAVE_HEAP_PROFILE=true - no. but i can add.

@taratorio
Copy link
Member

yes please! it will be good to get the heap profile at the right time

AskAlexSharov pushed a commit that referenced this issue Oct 22, 2024
…12404)

relates to #11387,
#11473,
#10734

tried to simulate the OOM using
#11799

What I found was infinitely growing alloc of headers when receiving new
header messages in sentry's `blockHeaders66` handler (check screenshot
below).

It looks like this is happening because in the case of a bad child
header: we delete it from the `links` map, however its parent link still
holds a reference to it so the deleted link & header never get gc-ed.
Furthermore if new similar bad hashes arrive after deletion they get
appended to their parent header's link and the children of that link can
grow indefinitely
([here](https://github.com/erigontech/erigon/blob/main/turbo/stages/headerdownload/header_algos.go#L1085-L1086)).
Ie confirmed with debug logs (note link at 13450415 has 140124
children):
```
DBUG[10-21|18:18:05.003] [downloader] InsertHeader: Rejected header parent info hash=0xb488d67deaf4103880fa972fd72a7a9be552e3bc653f124f1ad9cb45f36bcd07 height=13450415 children=140124
```

<br/>

The solution for this is to remove the bad link from its parent child
list
[here](https://github.com/erigontech/erigon/blob/main/turbo/stages/headerdownload/header_algos.go#L544)
so that 1) it gets gc-ed and 2) the children list does not grow
indefinitely.

<br/>


![oom-heap-profile2](https://github.com/user-attachments/assets/518fa658-c199-48b6-aa2d-110673264144)
@taratorio
Copy link
Member

please re-open if you see again, should be fixed by #12404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants