polkadot 0.9.40-a2b62fb872b Validator Issues / ram and peers issues / lost sync / chilled #679

SimonKraus · 2023-03-25T11:06:15Z

The latest medium-priority release caused problems on many validator instances.
A lot of validators got chilled since updating.

I'll just try to sum up the problems reported by various operators for visibility and to get decision-making going.

Peering gone wild
Even waiting nodes that don't participate in consensus see a rapid increasement of peers to 700+ ignoring the --in-peers and --out-peers
Number of TCP connections
Ram consumption
Consumed Ram clearly increased with this release and spikes in Ram usage caused the node to crash
Stops syncing
There are reports of nodes just stopping to sync the chain without any identifiable problem.
Some operators noticed problems on one machine but had none on the second identical build.
UnboundedChannelPersistentlyLarge
message = Channel mpsc_import_notification_stream on node localhost:9616 contains more than 200 items for more than 5 minutes. Node might be frozen.
Mar 24 06:13:04 worker02 polkadot[429783]: 2023-03-24 06:13:04 (offchain call) Error submitting a transaction to the pool: Transaction pool error: Too low priority (18446744073709551615 > 18446744073709551615)
Mar 24 06:13:04 worker02 polkadot[429783]: 2023-03-24 06:13:04 Ran out of free WASM instances
Mar 24 06:13:09 worker02 polkadot[429783]: 2023-03-24 06:13:09 Got a bad assignment from peer hash=0x65d2558f1c946ba12867777cfdcbb0182e944e511c2d3a0b3de9b1fa9d79b6b8 peer_id=PeerId("12D3KooWBy4Mhc9j4GJxZfKjettapQbsGr8z5wVNHXPgrYnnU4RH") error=Unknown block: 0x65d2558f1c946ba12867777cfdcbb0182e944e511c2d3a0b3de9b1fa9d79b6b8

Many of the 1KV validators have downgraded because of the problems which caused them invalid for program participation.

The text was updated successfully, but these errors were encountered:

dcolley · 2023-03-25T13:58:39Z

I see this happened a number of times since upgrading. The block sync rate just starts lagging

dot-dev machine is running release 0.9.40 from ubuntu repo

riusricardo · 2023-03-25T22:51:45Z

@eskimor @bkchr could you please take a look whenever you have some time?

bkchr · 2023-03-25T22:54:20Z

Peering gone wild\nEven waiting nodes that don't participate in consensus see a rapid increasement of peers to 700+ ignoring the --in-peers and --out-peers\n\nNumber of TCP connections\n

Number of TCP connections is not related to in-peers or out-peers. I assume the shown graph is from a validator that started to be in the active set? Every validator is connected to every other validator.

There was also a bug with the number of reported peers in the logs. This was already fixed.

bkchr · 2023-03-25T23:00:31Z

UnboundedChannelPersistentlyLarge
message = Channel mpsc_import_notification_stream on node localhost:9616 contains more than 200 items for more than 5 minutes. Node might be frozen.

Please provide more logs around this issue.

Stops syncing
There are reports of nodes just stopping to sync the chain without any identifiable problem.
Some operators noticed problems on one machine but had none on the second identical build.

We need at least sync=debug logs to get some insight.

bkchr · 2023-03-25T23:03:21Z

Mar 24 06:13:04 worker02 polkadot[429783]: 2023-03-24 06:13:04 (offchain call) Error submitting a transaction to the pool: Transaction pool error: Too low priority (18446744073709551615 > 18446744073709551615)

From the offchain worker, maybe the staking transaction for the result of the next election. Generally isn't an issue.

Mar 24 06:13:04 worker02 polkadot[429783]: 2023-03-24 06:13:04 Ran out of free WASM instances

Requires more logs, but could indicate that the node is too slow.

bkchr · 2023-03-25T23:03:51Z

Ram consumption
Consumed Ram clearly increased with this release and spikes in Ram usage caused the node to crash

Do we have logs for this node?

altonen · 2023-03-26T08:35:05Z

@dcolley could you provide logs with -lsync=trace,sub-libp2p=trace?

bLd75 · 2023-03-26T11:19:10Z

Here are some trace logs of a "low specs" server running latest version
Starting node: https://gist.github.com/bLd75/76956cad5397a0fc541ec2ba780c4c68

At this point I wasn't able to reproduce the issue where the node stops syncing blocks with no error message, will keep it running and check.

altonen · 2023-03-26T12:05:22Z

Are you able to confirm this reported behavior:

Some operators noticed problems on one machine but had none on the second identical build.

The logs you provided seem to indicate a lot of connectivity issues which would then translate to syncing issues as there are no peers to provide blocks. One issue in particular stands out as interesting:

2023-03-26T15:01:47+04:00 2023-03-26 11:01:47.265 TRACE tokio-runtime-worker sub-libp2p: Libp2p => Failed to reach PeerId("12D3KooWH3VwWkeFh7PwuoetACmpnpf7VE7qrXT8G2TXxZPntMuG"): Dial error: Unexpected peer ID 12D3KooWCvZamGbdLjPze7ew4frRFNCaAZVc5AZWWNKsNPmdrtcd at Dialer { address: "/ip4/195.154.69.235/tcp/30333/p2p/12D3KooWH3VwWkeFh7PwuoetACmpnpf7VE7qrXT8G2TXxZPntMuG", role_override: Dialer }.

which originates from here.

There was another syncing-related issue resolved the other day that was related to NAT configuration: paritytech/polkadot#6696 (comment). Could you and any other person noticing these syncing problems first verify that the issue is not because of any connectivity issues pertaining to NAT configurations.

paradox-tt · 2023-03-26T12:33:00Z

I eventually got an error on one of my nodes.

ar 26 00:01:35 ns3214960 polkadot[476585]: 2023-03-26 00:01:35 💤 Idle (40 peers), best: #17199078 (0x8a10…da7f), finalized #17199076 (0x80e1…16a3), ⬇ 2.7MiB/s ⬆ 2.5MiB/s
Mar 26 00:01:36 ns3214960 polkadot[476585]: 2023-03-26 00:01:36 ✨ Imported #17199079 (0x2e00…968e)
Mar 26 00:01:36 ns3214960 polkadot[476585]: 2023-03-26 00:01:36 ✨ Imported #17199079 (0xcfb3…9627)
Mar 26 00:01:36 ns3214960 polkadot[476585]: 2023-03-26 00:01:36 cannot spawn a worker: Os { code: 2, kind: NotFound, message: "No such file or directory" } debug_id=prepare
Mar 26 00:01:36 ns3214960 polkadot[476585]: 2023-03-26 00:01:36 failed to spawn a prepare worker: ProcessSpawn
Mar 26 00:01:41 ns3214960 systemd[1]: kusama2.service: Main process exited, code=killed, status=9/KILL
Mar 26 00:01:41 ns3214960 systemd[1]: kusama2.service: Failed with result 'signal'.
Mar 26 00:02:41 ns3214960 systemd[1]: kusama2.service: Scheduled restart job, restart counter is at 1.

xubincc · 2023-03-26T13:47:26Z

My validator synchronization has stopped, and the following log appears

Mar 26 13:29:00 kusama-3-new bash[36605]: 2023-03-26 13:29:00 The number of unprocessed messages in channel mpsc_network_worker exceeded 100000.
Mar 26 13:29:00 kusama-3-new bash[36605]: The channel was created at:
Mar 26 13:29:00 kusama-3-new bash[36605]: 0: sc_utils::mpsc::tracing_unbounded
Mar 26 13:29:00 kusama-3-new bash[36605]: 1: sc_network::service::NetworkWorker<B,H>::new
Mar 26 13:29:00 kusama-3-new bash[36605]: 2: sc_service::builder::build_network
Mar 26 13:29:00 kusama-3-new bash[36605]: 3: polkadot_service::new_full
Mar 26 13:29:00 kusama-3-new bash[36605]: 4: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 5: sc_cli::runner::Runner::run_node_until_exit
Mar 26 13:29:00 kusama-3-new bash[36605]: 6: polkadot_cli::command::run
Mar 26 13:29:00 kusama-3-new bash[36605]: 7: polkadot::main
Mar 26 13:29:00 kusama-3-new bash[36605]: 8: std::sys_common::backtrace::__rust_begin_short_backtrace
Mar 26 13:29:00 kusama-3-new bash[36605]: 9: main
Mar 26 13:29:00 kusama-3-new bash[36605]: 10: __libc_start_main
Mar 26 13:29:00 kusama-3-new bash[36605]: 11: _start
Mar 26 13:29:00 kusama-3-new bash[36605]: Last message was sent from:
Mar 26 13:29:00 kusama-3-new bash[36605]: 0: sc_utils::mpsc::TracingUnboundedSender::unbounded_send
Mar 26 13:29:00 kusama-3-new bash[36605]: 1: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 2: polkadot_network_bridge::tx::handle_incoming_subsystem_communication::{{closure}}
Mar 26 13:29:00 kusama-3-new bash[36605]: 3: <futures_util::future::try_future::MapErr<Fut,F> as core::future::future::Future>::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 4: <core::future::from_generator::GenFuture as core::future::future::Future>::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 5: <tracing_futures::Instrumented as core::future::future::Future>::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 6: tokio::runtime::task::raw::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
Mar 26 13:29:00 kusama-3-new bash[36605]: 8: tokio::runtime::scheduler::multi_thread::worker::run
Mar 26 13:29:00 kusama-3-new bash[36605]: 9: tokio::runtime::task::raw::poll
Mar 26 13:29:00 kusama-3-new bash[36605]: 10: std::sys_common::backtrace::__rust_begin_short_backtrace
Mar 26 13:29:00 kusama-3-new bash[36605]: 11: core::ops::function::FnOnce::call_once{{vtable.shim}}
Mar 26 13:29:00 kusama-3-new bash[36605]: 12: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
Mar 26 13:29:00 kusama-3-new bash[36605]: at ./rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/alloc/src/boxed.rs:1987:9
Mar 26 13:29:00 kusama-3-new bash[36605]: 13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once
Mar 26 13:29:00 kusama-3-new bash[36605]: at ./rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/alloc/src/boxed.rs:1987:9
Mar 26 13:29:00 kusama-3-new bash[36605]: 14: std::sys::unix::thread::Thread::new::thread_start
Mar 26 13:29:00 kusama-3-new bash[36605]: at ./rustc/90743e7298aca107ddaa0c202a4d3604e29bfeb6/library/std/src/sys/unix/thread.rs:108:17
Mar 26 13:29:00 kusama-3-new bash[36605]: 15: start_thread
Mar 26 13:29:00 kusama-3-new bash[36605]: 16: clone
Mar 26 13:29:00 kusama-3-new bash[36605]:
Mar 26 13:29:02 kusama-3-new bash[36605]: 2023-03-26 13:29:02 💤 Idle (191 peers), best: #17207065 (0x94bd…a13e), finalized #17207063 (0x6d36…a06c), ⬇ 178.4kiB/s ⬆ 371.0kiB/s
Mar 26 13:29:17 kusama-3-new bash[36605]: 2023-03-26 13:29:17 💤 Idle (191 peers), best: #17207065 (0x94bd…a13e), finalized #17207063 (0x6d36…a06c), ⬇ 175.4kiB/s ⬆ 376.1kiB/s
Mar 26 13:29:22 kusama-3-new bash[36605]: 2023-03-26 13:29:22 💤 Idle (191 peers), best: #17207065 (0x94bd…a13e), finalized #17207063 (0x6d36…a06c), ⬇ 246.1kiB/s ⬆ 360.3kiB/s
Mar 26 13:29:27 kusama-3-new bash[36605]: 2023-03-26 13:29:27 💤 Idle (191 peers), best: #17207065 (0x94bd…a13e), finalized #17207063 (0x6d36…a06c), ⬇ 398.5kiB/s ⬆ 351.5kiB/s
Mar 26 13:29:29 kusama-3-new bash[36605]: 2023-03-26 13:29:29 New inbound substream to PeerId("12D3KooWKMVJza5iXuern1mRNS7csin4dcAs59YHGEuU6P6RGBeZ") exceeds inbound substream limit. Removed older substream waiting to be reused.
Mar 26 13:29:29 kusama-3-new bash[36605]: 2023-03-26 13:29:29 New inbound substream to PeerId("12D3KooWKMVJza5iXuern1mRNS7csin4dcAs59YHGEuU6P6RGBeZ") exceeds inbound substream limit. Removed older substream waiting to be reused.
Mar 26 13:29:29 kusama-3-new bash[36605]: 2023-03-26 13:29:29 New inbound substream to PeerId("12D3KooWKMVJza5iXuern1mRNS7csin4dcAs59YHGEuU6P6RGBeZ") exceeds inbound substream limit. Removed older substream waiting to be reused.
Mar 26 13:29:29 kusama-3-new bash[36605]: 2023-03-26 13:29:29 New inbound substream to PeerId("12D3KooWKMVJza5iXuern1mRNS7csin4dcAs59YHGEuU6P6RGBeZ") exceeds inbound substream limit. Removed older substream waiting to be reused.
Mar 26 13:29:29 kusama-3-new bash[36605]: 2023-03-26 13:29:29 New inbound substream to PeerId("12D3KooWKMVJza5iXuern1mRNS7csin4dcAs59YHGEuU6P6RGBeZ") exceeds inbound substream limit. Removed older substream waiting to be reused.

altonen · 2023-03-26T13:58:28Z

@xubincc what kind of hardware are you running the validator on?

Before this release, syncing was part of sc-network meaning it didn't use the mpsc_network_worker channel for messages. Now that it's above sc-network and interacts with NetworkWorker through NetworkService, it also uses the mpsc_network_worker which in your case becomes overloaded with messages.

xubincc · 2023-03-26T14:06:27Z

@altonen 8vcpu 16GB ram and NVMe SSD disk

stakeworld · 2023-03-26T15:52:23Z

In the night of 24 march, right after the update I had two nodes chilled, they are on dedicated hardware (Ryzen 3600, 64 Gb, NVME drives) which have been running stable (A+ one-t performance) for a long time without problems

What seemed to happen: it was importing blocks, then at 2:41 it stopped importing and was loosing it's peers (peer count dropping), until after a few minutes it picks up the import again. This keeps happening on and off, because it happened at the end of a session I was chilled.

https://logpaste.com/ZlkEQCRj

Later the process crashed with a The number of unprocessed messages in channel mpsc_network_worker exceeded 100000

https://logpaste.com/ep6JeECC

After this the stopping importing and restoring importing continues with a lot of (offchain call) Error submitting a transaction to the pool: Transaction pool error: Too low priority (18446744073709551615 > 18446744073709551615)

https://logpaste.com/UqBKaeZ8

Grafana gave a lot of "lagging block" errors. In the morning I reverted back and all problems went away.

Later I put two nodes again on the new version and for now they seem to be holding up without the earlier pattern. On element I see some reports of this kind of behaviour and some more chilled validators. Maybe some interaction with the "mpsc_network_worker" which overloads or something like that?

dcolley · 2023-03-26T17:30:02Z

I have 12th Gen Intel(R) Core(TM) i9-12900K, 128G RAM and NVME disks.

I'm running 2x kusama and 1x polkadot validators on this machine - has been running for months like this.
One validator is in the active set right now so I don't want to mess with it too much.

I ran the polkadot benchmark machine and got this:

+----------+----------------+-------------+-------------+-------------------+
| Category | Function       | Score       | Minimum     | Result            |
+===========================================================================+
| CPU      | BLAKE2-256     | 1.29 GiBs   | 1.00 GiBs   | ✅ Pass (128.8 %) |
|----------+----------------+-------------+-------------+-------------------|
| CPU      | SR25519-Verify | 992.36 KiBs | 666.00 KiBs | ✅ Pass (149.0 %) |
|----------+----------------+-------------+-------------+-------------------|
| Memory   | Copy           | 6.96 GiBs   | 14.32 GiBs  | ❌ Fail ( 48.6 %) |
|----------+----------------+-------------+-------------+-------------------|
| Disk     | Seq Write      | 1.38 GiBs   | 450.00 MiBs | ✅ Pass (314.2 %) |
|----------+----------------+-------------+-------------+-------------------|
| Disk     | Rnd Write      | 660.11 MiBs | 200.00 MiBs | ✅ Pass (330.1 %) |
+----------+----------------+-------------+-------------+-------------------+

I suspect the RAM fail is because of the active validator

LukeWheeldon · 2023-03-26T17:33:15Z

@dcolley What RAM do you have? DDR4? DDR5? 3200Mhz? Single or double channel?

dcolley · 2023-03-26T17:34:58Z

The RAM is DDR4 2133Mhz

Around 12h57 the block rate dropped, I don't see any issues reported in the logs

dcolley · 2023-03-26T17:36:52Z

I have 4x of these:

Handle 0x003C, DMI type 17, 92 bytes
Memory Device
        Array Handle: 0x003B
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: Controller0-ChannelA-DIMM0
        Bank Locator: BANK 0
        Type: DDR4
        Type Detail: Synchronous
        Speed: 2133 MT/s
        Manufacturer: Corsair
        Serial Number: 00000000
        Asset Tag: 9876543210
        Part Number: CMK64GX4M2E3200C16  
        Rank: 2
        Configured Memory Speed: 2133 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.35 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: Not Specified
        Module Manufacturer ID: Bank 3, Hex 0x9E
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 32 GB
        Cache Size: None
        Logical Size: None

LukeWheeldon · 2023-03-26T17:48:17Z

@dcolley As far as you RAM test is concerned, I would think the culprit is your RAM speed. You may want to switch to 3200Mhz.

dcolley · 2023-03-26T18:14:53Z

So what has changed? This server has been running with the same spec for months.
If there is a new minimum RAM spec many validators will have to upgrade.

dcolley · 2023-03-26T18:24:57Z

On another machine with exactly the same RAM, I get this result:

+----------+----------------+-------------+-------------+-------------------+
| Category | Function       | Score       | Minimum     | Result            |
+===========================================================================+
| CPU      | BLAKE2-256     | 1.72 GiBs   | 1.00 GiBs   | ✅ Pass (171.3 %) |
|----------+----------------+-------------+-------------+-------------------|
| CPU      | SR25519-Verify | 1.10 MiBs   | 666.00 KiBs | ✅ Pass (169.4 %) |
|----------+----------------+-------------+-------------+-------------------|
| Memory   | Copy           | 13.92 GiBs  | 14.32 GiBs  | ✅ Pass ( 97.2 %) |
|----------+----------------+-------------+-------------+-------------------|
| Disk     | Seq Write      | 407.21 MiBs | 450.00 MiBs | ✅ Pass ( 90.5 %) |
|----------+----------------+-------------+-------------+-------------------|
| Disk     | Rnd Write      | 82.72 MiBs  | 200.00 MiBs | ❌ Fail ( 41.4 %) |
+----------+----------------+-------------+-------------+-------------------+

The disk write is tested on SSD boot volume, but we have NVME - not sure how to force the benchmark command to test the NVME

stakeworld · 2023-03-26T18:27:21Z

Today again the importing stopped multiple times and the process crashes while catching up importing with a mpsc network error. Same like my previous post the import stops, only idle messages, the peers drop until at a certain moment a lot of catch up importing and the the process crashes The number of unprocessed messages in channel mpsc_network_worker exceeded 100000.. Lots of missed pv point that session. I think if this happens at the end of a session the node gets chilled. Multiple posts about chilling nodes on element.

Same like @dcolley this system was running smoothly for ages before this update

@altonen @bkchr it really looks something is off with the sync/network functioning, maybe the move of the syncing/sc-networking?

https://logpaste.com/bNstR1ap

LukeWheeldon · 2023-03-26T18:47:40Z

@dcolley

To benchmark with a specific directory for IO: polkadot benchmark machine -d /home/polkadot

As for the benchmark itself, I don't believe it has changed recently. The 13.92 GiBs minimum has been there for quite a while. If two machines have substantially different benchmark results, there is probably something different between those two.

Now this benchmark is only to compare with the hardware Parity has decided to use themselves (I believe). If the performance of your validator is generally A+, then it is doing just fine. Personally, I would not be comfortable with 6.96 GiBs memory speed or the lack of ECC memory, but that's just me.

edit: if you want to compare the memory performance of your two "supposedly" identical machines, while using a tool other than polkadot benchmark machine, you can use sysbench memory run.

tugytur · 2023-03-26T18:56:40Z

We have one validator that we had synced from scratch with polkadot 0.9.40-a2b62fb872b on that one the DB got corrupted today.

Flags:

ExecStart=/usr/local/bin/polkadot \
        --name "XXXX" \
        --validator \
        --state-pruning 1000 \
        --no-mdns \
        --no-private-ip \
        --no-hardware-benchmarks \
        --public-addr=/ip4/XXXX/tcp/30333 \
        --listen-addr=/ip4/XXXX/tcp/30333 \
        --chain=kusama \
        --database=paritydb \
        --sync=warp \
        --prometheus-port 9615 \
        --rpc-port 9933 \
        --ws-port 9944 \
        --no-telemetry \
        --keystore-path=/var/blockchain/keystore \
        --base-path=/var/blockchain/data

We first started to get couple of the following error logs:

Failed to validate candidate para_id=Id(2000) error=InvalidCandidate(PrepareError("panic: assertion failed: pos >= self.last_offset"))
Deterministic error occurred during preparation (should have been ruled out by pre-checking phase) para_id=Id(2000) e="panic: assertion failed: pos >= self.last_offset"
Failed to validate candidate due to internal error err=ValidationFailed("panic: assertion failed: pos >= self.last_offset") candidate_hash=0x9795e8bef5474a90049d43ccfad897f07a442edd37cdc2f7c147d09d3e0944ab para_id=Id(2000) traceID=201491800243732924274093964563087661040

Shortly after the whole DB got corrupted and the validator crashed.

Failed to validate candidate para_id=Id(2000) error=InvalidCandidate(PrepareError("panic: assertion failed: pos >= self.last_offset"))
Deterministic error occurred during preparation (should have been ruled out by pre-checking phase) para_id=Id(2000) e="panic: assertion failed: pos >= self.last_offset"
Failed to validate candidate due to internal error err=ValidationFailed("panic: assertion failed: pos >= self.last_offset") candidate_hash=0xc3fdfaa2e85e19e65533b3e77f5714c3172ead4942f51dc04abb9b814c8f5b71 para_id=Id(2000) traceID=260518193792545430448337757597691483331
Background worker error: IO Error: Invalid argument (os error 22)
GRANDPA voter error: could not complete a round on disk: Database
Essential task `grandpa-voter` failed. Shutting down service.
Error:
   0: Other: Essential task failed.

dcolley · 2023-03-26T18:57:13Z

We get A+/A on most performance reports so I'm not too worried about that.
the machine running 0.9.39-1 is stable and all nodes keep up with the blocks just fine - even when all 3 validators are in the active set.

Thanks for the recommendation for another tool:

server 5 (had slow RAM on bench test):

sysbench memory  run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 104857600 (11373128.02 per second)

102400.00 MiB transferred (11106.57 MiB/sec)


General statistics:
    total time:                          9.2189s
    total number of events:              104857600

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.03
         95th percentile:                        0.00
         sum:                                 3736.93

Threads fairness:
    events (avg/stddev):           104857600.0000/0.00
    execution time (avg/stddev):   3.7369/0.00

server 6 (same memory)

sysbench memory  run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 104857600 (12857043.68 per second)

102400.00 MiB transferred (12555.71 MiB/sec)


General statistics:
    total time:                          8.1549s
    total number of events:              104857600

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.03
         95th percentile:                        0.00
         sum:                                 3294.84

Threads fairness:
    events (avg/stddev):           104857600.0000/0.00
    execution time (avg/stddev):   3.2948/0.00

altonen · 2023-03-26T19:17:48Z

@stakeworld

There are at least three underlying issues that I can think of that contribute to this:

NetworkWorker is a chokepoint in sc-network and the effects of that have obviously been exacerbated by this release. This is a known issue and we are working on a solution to bypass NetworkWorker and give protocols a direct channel to libp2p. Sadly it's an epic refactoring so it takes some time before it's out but once it's finished it should fix this issue with mpsc_network_worker channel getting overloaded and reduce CPU usage of sc-network in general.
Incoming substream attempts are not bounded in any way, basically we allow remote node to try and open substreams again and again without banning them even temporarily. This basically means that every time a notification substream is opened, an event is emitted and it goes through the already overloaded NetworkWorker which then manifests, for example, as these ridiculously high number of accepted peers. This is because SyncingEngine rejects the connection because it is already at maximum (or so falsely believes because of another bug that we are working on) and send a message to NetworkWorker. Before this release, syncing was contained within Protocol meaning NetworkWorker didn't get these extra messages and it worked. What we could do is try and chill incoming substreams temporarily if they have been rejected within the last 30 seconds or so but there was some discussion in another syncing-related issue that it might have unforeseeable consequences so it's a bit of risk
Something changed in some recent Polkadot release where previous NAT configurations stopped working and caused the peer count to be unstable as outlined here. This ties to point 2 in the sense that if the peer connection is unstable (connection/disconnecting frequently) but is stable enough to open a substream, it can cause kind of a DoS on NetworWorker because the channel is constantly getting flooded with inbound substream requests.

We are working on fixing these problems but they require some deep refactoring in the entire sc-network crate and will take time until they're finished. I will look tomorrow if we can deploy some temporary dirty fixes to alleviate these issues in the mean time. If there are other issues to the ones I've outline above, I will need those trace logs for sync and sub-libp2p

gregorst3 · 2023-03-26T19:44:59Z

I had also problems on only one node, today around 2023-03-26 17:30:06.504 (CET TIME)

Last block imported was

Mar 26 17:30:05 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:30:05.217  INFO tokio-runtime-worker substrate: 💤 Idle (1042 peers), best: #17208284 (0x30ce…886f), finalized #17208282 (0xa951…b061), ⬇ 1.8MiB/s ⬆ 2.1MiB/s
Mar 26 17:30:06 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:30:06.353  INFO tokio-runtime-worker substrate: ✨ Imported #17208285 (0x20ef…0164)

Then it went IDLE and peers started to decrease slowly...
example:

Mar 26 17:30:50 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:30:50.220  INFO tokio-runtime-worker substrate: 💤 Idle (1002 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 152.4kiB/s ⬆ 180.5kiB/s
Mar 26 17:30:55 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:30:55.220  INFO tokio-runtime-worker substrate: 💤 Idle (996 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 144.2kiB/s ⬆ 110.4kiB/s
Mar 26 17:31:00 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:00.220  INFO tokio-runtime-worker substrate: 💤 Idle (994 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 82.2kiB/s ⬆ 85.0kiB/s
Mar 26 17:31:05 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:05.220  INFO tokio-runtime-worker substrate: 💤 Idle (990 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 118.9kiB/s ⬆ 92.5kiB/s
Mar 26 17:31:10 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:10.220  INFO tokio-runtime-worker substrate: 💤 Idle (985 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 116.9kiB/s ⬆ 115.2kiB/s
Mar 26 17:31:15 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:15.220  INFO tokio-runtime-worker substrate: 💤 Idle (982 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 139.9kiB/s ⬆ 129.7kiB/s
Mar 26 17:31:20 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:20.221  INFO tokio-runtime-worker substrate: 💤 Idle (979 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 167.3kiB/s ⬆ 150.5kiB/s
Mar 26 17:31:25 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:25.221  INFO tokio-runtime-worker substrate: 💤 Idle (977 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 109.0kiB/s ⬆ 123.7kiB/s
Mar 26 17:31:30 Kusama-Cosmoon polkadot[123]: 2023-03-26 17:31:30.221  INFO tokio-runtime-worker substrate: 💤 Idle (975 peers), best: #17208285 (0x20ef…0164), finalized #17208282 (0xa951…b061), ⬇ 89.6kiB/s ⬆ 98.7kiB/s

At 18:01:31.021 (cet time), node started to import the missed blocks during this timeframe...

Mar 26 18:01:31 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:31.021  INFO tokio-runtime-worker substrate: ✨ Imported #17208286 (0x5a44…0368)
Mar 26 18:01:31 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:31.065  INFO tokio-runtime-worker substrate: ✨ Imported #17208287 (0xf932…53f4)
Mar 26 18:01:31 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:31.117  INFO tokio-runtime-worker substrate: ✨ Imported #17208288 (0x21df…6a69)
Mar 26 18:01:31 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:31.172  INFO tokio-runtime-worker substrate: ✨ Imported #17208289 (0xc1e3…0432)

The log was full of "Imported " lines until 18:01:46, where it crashed

Mar 26 18:01:46 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:46.373  INFO tokio-runtime-worker substrate: ✨ Imported #17208540 (0xc6ba…172e)
Mar 26 18:01:46 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:46.406  INFO tokio-runtime-worker substrate: ✨ Imported #17208541 (0xe954…5ff0)
Mar 26 18:01:49 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:49.200  WARN tokio-runtime-worker parachain::availability-store: err=ContextChannelClosed
Mar 26 18:01:49 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:49.200 ERROR tokio-runtime-worker overseer: Overseer exited with error err=Generated(SubsystemStalled("pvf-checker-subsystem", "signal", "polkadot_node_subsystem_types::OverseerSignal"))
Mar 26 18:01:49 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:49.201 ERROR tokio-runtime-worker sc_service::task_manager: Essential task `overseer` failed. Shutting down service.
Mar 26 18:01:49 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:49.201 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="bitfield-signing-subsystem" err=FromOrigin { origin: "bitfield-signing", source: Generated(Context("Sign>
Mar 26 18:01:49 Kusama-Cosmoon polkadot[123]: 2023-03-26 18:01:49.205 ERROR tokio-runtime-worker polkadot_overseer: subsystem exited with error subsystem="dispute-coordinator-subsystem" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(>
Mar 26 18:01:52 Kusama-Cosmoon polkadot[123]: Error:
Mar 26 18:01:52 Kusama-Cosmoon polkadot[123]:    0: Other: Essential task failed.
Mar 26 18:01:52 Kusama-Cosmoon polkadot[123]: Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Mar 26 18:01:52 Kusama-Cosmoon polkadot[123]: Run with RUST_BACKTRACE=full to include source snippets.
Mar 26 18:01:52 Kusama-Cosmoon systemd[1]: polkadot-validator.service: Main process exited, code=exited, status=1/FAILURE
Mar 26 18:01:52 Kusama-Cosmoon systemd[1]: polkadot-validator.service: Failed with result 'exit-code'.
Mar 26 18:01:52 Kusama-Cosmoon systemd[1]: polkadot-validator.service: Consumed 2d 18h 37min 5.107s CPU time.
Mar 26 18:03:52 Kusama-Cosmoon systemd[1]: polkadot-validator.service: Scheduled restart job, restart counter is at 1.
Mar 26 18:03:52 Kusama-Cosmoon systemd[1]: Stopped Polkadot Validator.```


No ram spike (normal usage around 2.5 gb)  or CPU spike was observed during this timeframe.

The issue was not observed on other nodes managed by me

stakeworld · 2023-03-26T20:36:04Z

@gregorst3, thats the same pattern I observed; the node gets high in peer count, stops importing and goes "idle", peer count drops and at a certain point it restarts importing a lot of blocks at once which seems to overload some network subsystem and causes a node crash. If above happens at the end of a session the node gets involuntarily chilled.

Strangely some nodes seem to suffer more then others. Maybe after the update the resource demand got bigger and nodes who had more spare resources were able to compensate better or something like that? But that is just guessing.

@altonen thanks for the extensive reply, that sounds logical and makes it more clear

LukeWheeldon · 2023-04-13T22:12:13Z

The node has crashed a little while ago, log file of the 10000 last entries (I can provide more if necessary):
https://we.tl/t-x8Ti3SaOFL

altonen · 2023-04-14T08:12:46Z

These look interesting but I don't know much about the dispute coordinator

2023-04-13 17:47:54.765  WARN tokio-runtime-worker parachain::approval-voting: Waiting for approval signatures timed out - dead lock?
2023-04-13 17:47:54.765  WARN tokio-runtime-worker parachain::dispute-coordinator: Fetch for approval votes got cancelled, only expected during shutdown!
2023-04-13 17:47:54.765  INFO tokio-runtime-worker parachain::dispute-coordinator: New dispute initiated for candidate. candidate_hash=0xd627691b7c614a0b4950ebb883895478297a32b3dec637501dc5cfb0267e83d8 session=29568 traceID=284659422506117383075125336000894162040
2023-04-13 17:47:54.765  INFO tokio-runtime-worker parachain::dispute-coordinator: Dispute on candidate concluded with 'valid' result candidate_hash=0xd627691b7c614a0b4950ebb883895478297a32b3dec637501dc5cfb0267e83d8 session=29568 traceID=284659422506117383075125336000894162040

@eskimor @ordian or @sandreim probably know better

ordian · 2023-04-14T09:40:03Z

Dispute coordinator messages are "normal". They just mean we've imported votes for a dispute. There were enough votes for the dispute to conclude. The candidate was "valid", so nothing to worry about.

WARN tokio-runtime-worker parachain::approval-voting: Waiting for approval signatures timed out - dead lock?

This warning is more concerning though. It looks indeed like there is a potential deadlock:
Approval-voting subsystem sends a message to approval-distribution. Over an unbounded channel, but awaits the result directly (classical deadlock trap):
https://github.com/paritytech/polkadot/blob/6a0b32a9bd1672fdceef972684d2da7d7eadf74c/node/core/approval-voting/src/lib.rs#L1373-L1380

approval-distribution sends messages over a bounded channel to approval-voting (and also awaits the results):
https://github.com/paritytech/polkadot/blob/6a0b32a9bd1672fdceef972684d2da7d7eadf74c/node/network/approval-distribution/src/lib.rs#L796

However, the timeout is the way to break the cycle. It could only happens when the approval-voting queue is full. paritytech/polkadot#6782 should make this much less likely to happen in practice by reducing the number of messages.

The timeout could also be hit if node is slow/under load.

sandreim · 2023-04-18T16:15:40Z

@ordian is correct. This looks concerning, maybe there is a way to refactor and remove the cycle, by moving GetApprovalSignaturesForCandidate in approval-distribution without involving approval-voting.

gle1pn1r · 2023-05-08T03:02:55Z

We experienced the mentioned problems related to nodes getting stuck / syncing after upgrading our nodes in South America to 0.9.41 and had to roll back to 0.9.39. We're seeing the same problems with 0.9.42, and again, only on nodes in SA - not in EU. As far as I understand, this release is meant to fix these problems correct? Unfortunately we don't have debug level logs on the nodes to provide here, but we're running with p2p ports opened and no NAT.

altonen · 2023-05-08T07:16:26Z

How many nodes are you connected to?

This release included a fix for syncing which makes it kick inactive peers, releasing slots for peers who are hopefully more active. I don't know how to read the first graph but is both best and finalized block getting stuck? The second image doesn't look extremely concerning unless it is happening constantly.

gle1pn1r · 2023-05-08T07:22:49Z

How many nodes are you connected to?

When that happened, 80 - in line with the usual count

I don't know how to read the first graph but is both best and finalized block getting stuck?

Yes, best and finalized. This generally doesn't happen but we experienced it on 0.9.41, and once we saw this on a couple of our nodes shortly after upgrading to 0.9.42, we thought that it might be the same problem and decided to roll back

altonen · 2023-05-08T07:43:11Z

What is the ratio between in/out peers? The default setting is 8 out/32 in.

There is a 30 second delay between detecting the node is stuck and kicking the peers so if your best block is stuck multiple minutes then there is some other problem but I don't think that is the case unless the node is literally unable to establish any new connections and all it has is nodes that are already full.

If I would have to guess what happened since you don't have logs, it looks like SyncingEngine detected that it's stuck and evicted all idle peers, connected to some other peers and was able to recover from the stall in ~10 minutes. It's definitely not optimal that's for sure and we could consider detecting if the inbound substream is closed and evict the peer right away which would probably yield more linear curves but if this not happening to you constantly, I wouldn't necessarily worry if the curves look look linear on hour(s) scale.

We did get a report from our DevOps that one of our RPC nodes in Brazil had similar issue but that was before the release. I've been busy with other stuff but I will try to look into this issue soon and see if I'm able to make any sense why it's lagging in SA.

LukeWheeldon · 2023-06-13T22:14:32Z

I keep having some UnboundedChannelPersistentlyLarge on a variety of servers, in which case I generally manually restart the polkadot process as once the first one happens, I generally get more soon after and this eventually leads to a stall.

Is there any progress on this issue? Thanks!

altonen · 2023-06-14T05:46:38Z

The issue was supposed to have been addressed in 0.9.42. Which channel is getting clogged?

LukeWheeldon · 2023-06-14T22:06:26Z

I get those messages below on a regular basis. It's much less than before 0.9.42, but it's still there once in a while on most/all my nodes.

Labels
alertname = UnboundedChannelPersistentlyLarge
chain = polkadot
entity = mpsc_import_notification_stream
instance = server:9615
job = server
network = polkadot
severity = warning
Annotations
message = Channel mpsc_import_notification_stream on node server:9615 contains more than 200 items for more than 5 minutes. Node might be frozen.

altonen · 2023-06-15T06:51:27Z

This looks like a different bug than what this issue was originally about because the channel that's getting clogged is the block import notification stream. Could you open a new issue for this?

* Remove user bob from bootstrap test * Remove unused testSubClient from bootstrap test * Add 6-account basic channel test * Change test name * Move ethClient.initialize * Get dispatchAs working for Alice only * Fetch sudo key instead of using Alice * Add transactions for the remaining 5 accounts Tests are failing because only a single transaction is reflected in the eth account. * Match naming convention for incentivized channel * Send multiple proofs from parachain relayer * Replace proof leaf with bundle from event * Tweak log for accounts & nonces * Update recommended polkadot version to 0.9.28 * Check for nil incentivized channel commitment * Avoid zero value proofs by setting len 0 * Fix proof struct field order Field order matters when decoding into RawMerkleProof. * Tweak warning log message * REMOVEME Debug logs * Handle non-perfect complete trees in hashSides * Fix power of 2 XOR strikes again! * This decodes properly, but rather use the proof * Add TODO for bundle & leaf validation * Loop by event instead of account * Tweak test README * Remove TODOs * Fix IncentivizedChannelCommitment pointer * Remove unused method * Return pointer from constructor instead of value * Refactor beefy-listener - Extract scanForBasicChannelProofs. - Rename to basicChannel{AccountNonces,ScanAccounts}. - Pass digestItemHash around instead of digestItem. * Remove extra debug logs * Separate parameters to generateHashSides * Newline * Add note about account ids * Add tests for generateHashSides * Remove NewIncentivizedChannelCommitment * Separate proofs from bundles * Remove fetchOffchainData

* fix broken message lane benchmarks * proof-size related benchmarks * impl Size for proof parameters * include proof weight into weight formula * left TODO * fixed proof size * WeightInfoExt::receive_messages_proof_weight * charge for extra message bytes delivery in send_message * removed default impl of WeightsInfoExt * moved weight formulas to WeightInfoExt * receive_messages_proof_outbound_lane_state_overhead is included twice in weight * typo * typo * fixed TODO * more asserts * started wotk on message-lane documentation * expected_extra_storage_proof_size() is actually expected in delivery confirmation tx * update README.md * ensure_able_to_receive_confirmation * test rialto message lane weights * removed TODO * removed unnecessary trait requirements * fixed arguments * fix compilation * decreased basic delivery tx weight * fmt * clippy * Update modules/message-lane/src/benchmarking.rs Co-authored-by: Hernando Castano <[email protected]> * structs * Update primitives/millau/src/lib.rs Co-authored-by: Hernando Castano <[email protected]> * removed readme.md * removed obsolete trait bounds * Revert "removed readme.md" This reverts commit 50b7376a41687a94c27bf77565434be153f87ca1. * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * Update bin/runtime-common/src/messages.rs Co-authored-by: Tomasz Drwięga <[email protected]> * PreComputedSize Co-authored-by: Hernando Castano <[email protected]> Co-authored-by: Tomasz Drwięga <[email protected]>

bkchr · 2024-04-22T09:44:28Z

Stale.

riusricardo added the C7-high label Mar 25, 2023

bkchr removed the C7-high label Mar 26, 2023

LukeWheeldon mentioned this issue Jun 15, 2023

Channel mpsc_import_notification_stream #611

Closed

Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023

bkchr closed this as completed Apr 22, 2024

polkadot 0.9.40-a2b62fb872b Validator Issues / ram and peers issues / lost sync / chilled #679

polkadot 0.9.40-a2b62fb872b Validator Issues / ram and peers issues / lost sync / chilled #679

Comments

SimonKraus commented Mar 25, 2023

dcolley commented Mar 25, 2023 • edited Loading

riusricardo commented Mar 25, 2023

bkchr commented Mar 25, 2023

bkchr commented Mar 25, 2023

bkchr commented Mar 25, 2023

bkchr commented Mar 25, 2023

altonen commented Mar 26, 2023 • edited by bkchr Loading

bLd75 commented Mar 26, 2023

altonen commented Mar 26, 2023

paradox-tt commented Mar 26, 2023

xubincc commented Mar 26, 2023

altonen commented Mar 26, 2023

xubincc commented Mar 26, 2023

stakeworld commented Mar 26, 2023

dcolley commented Mar 26, 2023

LukeWheeldon commented Mar 26, 2023

dcolley commented Mar 26, 2023 • edited Loading

dcolley commented Mar 26, 2023

LukeWheeldon commented Mar 26, 2023

dcolley commented Mar 26, 2023

dcolley commented Mar 26, 2023 • edited Loading

stakeworld commented Mar 26, 2023 • edited Loading

LukeWheeldon commented Mar 26, 2023 • edited Loading

tugytur commented Mar 26, 2023

dcolley commented Mar 26, 2023

altonen commented Mar 26, 2023

gregorst3 commented Mar 26, 2023 • edited Loading

stakeworld commented Mar 26, 2023 • edited Loading

LukeWheeldon commented Apr 13, 2023

altonen commented Apr 14, 2023

ordian commented Apr 14, 2023

sandreim commented Apr 18, 2023

gle1pn1r commented May 8, 2023 • edited Loading

altonen commented May 8, 2023

gle1pn1r commented May 8, 2023

altonen commented May 8, 2023 • edited Loading

LukeWheeldon commented Jun 13, 2023

altonen commented Jun 14, 2023

LukeWheeldon commented Jun 14, 2023

altonen commented Jun 15, 2023

bkchr commented Apr 22, 2024

dcolley commented Mar 25, 2023 •

edited

Loading

altonen commented Mar 26, 2023 •

edited by bkchr

Loading

dcolley commented Mar 26, 2023 •

edited

Loading

dcolley commented Mar 26, 2023 •

edited

Loading

stakeworld commented Mar 26, 2023 •

edited

Loading

LukeWheeldon commented Mar 26, 2023 •

edited

Loading

gregorst3 commented Mar 26, 2023 •

edited

Loading

stakeworld commented Mar 26, 2023 •

edited

Loading

gle1pn1r commented May 8, 2023 •

edited

Loading

altonen commented May 8, 2023 •

edited

Loading