Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update tokio to 1.26.0 #31055

Closed

Conversation

lijunwangs
Copy link
Contributor

@lijunwangs lijunwangs commented Apr 4, 2023

Problem

Noticed deadlock when a Quinn endpoint is dropped. To support repair using quic, we now return the endpoint used for the streamer to the caller so that the client to use that same endpoint to talk to the server side. The server side can send the response to the same endpoint. We have found sporadic hang in the Tokio code when a Quinn Endpoint is dropped.

The hung stack looks the following:

#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00005640aa854a65 in parking_lot::raw_mutex::RawMutex::lock_slow ()
#2 0x00005640ac46d56d in tokio::io::driver::scheduled_io::ScheduledIo::wake0 ()
#3 0x00005640ac4736a2 in <tokio::io::driver::Inner as core::ops::drop::Drop>::drop ()
#4 0x00005640abfa32d5 in alloc::sync::Arc::drop_slow ()
#5 0x00005640abf9fe9b in core::ptr::drop_in_place<tokio::park::either::Eithertokio::process::imp::driver::Driver,tokio::park::thread::ParkThread> ()
#6 0x00005640abfa22a1 in alloc::sync::Arc::drop_slow ()
#7 0x00005640abfa2a17 in alloc::sync::Arc::drop_slow ()
#8 0x00005640abfa27c2 in alloc::sync::Arc::drop_slow ()
#9 0x00005640abfa9636 in tokio::runtime::task::harness::Harness<T,S>::dealloc ()
#10 0x00005640ac44cda9 in <tokio::io::driver::registration::Registration as core::ops::drop::Drop>::drop ()
#11 0x00005640abfa59b3 in core::ptr::drop_in_placetokio::net::udp::UdpSocket ()
#12 0x00005640aac12e11 in alloc::sync::Arc::drop_slow ()
#13 0x00005640aac0ba66 in core::ptr::drop_in_placequinn::endpoint::Endpoint ()
#14 0x00005640aac13df9 in alloc::sync::Arc::drop_slow ()
#15 0x00005640aac1577d in alloc::sync::Arc::drop_slow ()
#16 0x00005640aaa0402b in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#17 0x00005640aac5ec39 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#18 0x00005640ac5b8923 in std::sys::unix::thread::Thread::new::thread_start ()
#19 0x00007fe5f23096db in start_thread (arg=0x7fda55015700) at pthread_create.c:463
#20 0x00007fe5f1c9461f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Summary of Changes

Upgrade Tokio version to 1.26.0 which is free of such hang.

Fixes #

@lijunwangs
Copy link
Contributor Author

lijunwangs commented Apr 4, 2023

Ran bench-tps tests with/without the fix, found no significant difference:

master: (aa3e0b9) tpu-client

[2023-04-04T21:16:01.483767949Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 18163.48 | 348329
[2023-04-04T21:16:01.483779094Z INFO solana_bench_tps::bench]
Average max TPS: 18163.48, 0 nodes had 0 TPS
[2023-04-04T21:16:01.483782258Z INFO solana_bench_tps::bench]
Highest TPS: 18163.48 sampling period 1s max transactions: 348329 clients: 1 drop rate: 0.00
[2023-04-04T21:16:01.483786113Z INFO solana_bench_tps::bench] Average TPS: 11221.104

[2023-04-04T21:16:43.830458193Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 14041.37 | 317328
[2023-04-04T21:16:43.830469845Z INFO solana_bench_tps::bench]
Average max TPS: 14041.37, 0 nodes had 0 TPS
[2023-04-04T21:16:43.830480050Z INFO solana_bench_tps::bench]
Highest TPS: 14041.37 sampling period 1s max transactions: 317328 clients: 1 drop rate: 0.00
[2023-04-04T21:16:43.830488939Z INFO solana_bench_tps::bench] Average TPS: 10220.677

[2023-04-04T21:17:26.177213376Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2023-04-04T21:17:26.177216068Z INFO solana_bench_tps::bench]
Average max TPS: 13410.83, 0 nodes had 0 TPS
[2023-04-04T21:17:26.177242423Z INFO solana_bench_tps::bench]
Highest TPS: 13410.83 sampling period 1s max transactions: 300324 clients: 1 drop rate: 0.00
[2023-04-04T21:17:26.177250152Z INFO solana_bench_tps::bench] Average TPS: 9672.861

This PR: tpu-client:

[2023-04-04T22:58:32.731781354Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 18146.93 | 337349
[2023-04-04T22:58:32.731790092Z INFO solana_bench_tps::bench]
Average max TPS: 18146.93, 0 nodes had 0 TPS
[2023-04-04T22:58:32.731795790Z INFO solana_bench_tps::bench]
Highest TPS: 18146.93 sampling period 1s max transactions: 337349 clients: 1 drop rate: 0.01
[2023-04-04T22:58:32.731800608Z INFO solana_bench_tps::bench] Average TPS: 11076.781

[2023-04-04T22:59:15.078863967Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2023-04-04T22:59:15.078867631Z INFO solana_bench_tps::bench]
Average max TPS: 14019.28, 0 nodes had 0 TPS
[2023-04-04T22:59:15.078909265Z INFO solana_bench_tps::bench]
Highest TPS: 14019.28 sampling period 1s max transactions: 307326 clients: 1 drop rate: 0.00
[2023-04-04T22:59:15.078914817Z INFO solana_bench_tps::bench] Average TPS: 9897.063

[2023-04-04T22:59:56.585074007Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 17418.14 | 319386
[2023-04-04T22:59:56.585088481Z INFO solana_bench_tps::bench]
Average max TPS: 17418.14, 0 nodes had 0 TPS
[2023-04-04T22:59:56.585092532Z INFO solana_bench_tps::bench]
Highest TPS: 17418.14 sampling period 1s max transactions: 319386 clients: 1 drop rate: 0.01
[2023-04-04T22:59:56.585102210Z INFO solana_bench_tps::bench] Average TPS: 10569.6045

rpc-client master:
grep "Average TPS" -B5 rpc-main.txt
[2023-04-04T21:26:44.038368753Z INFO solana_bench_tps::bench] http://35.233.177.221:8899 | 26584.41 | 353889
[2023-04-04T21:26:44.038377661Z INFO solana_bench_tps::bench]
Average max TPS: 26584.41, 0 nodes had 0 TPS
[2023-04-04T21:26:44.038388485Z INFO solana_bench_tps::bench]
Highest TPS: 26584.41 sampling period 1s max transactions: 353889 clients: 1 drop rate: 0.30
[2023-04-04T21:26:44.038393436Z INFO solana_bench_tps::bench] Average TPS: 11689.199

rpc-client this PR:

grep "Average TPS" -B5 rpc-new-tokio.txt
[2023-04-04T23:02:10.242194465Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2023-04-04T23:02:10.242198480Z INFO solana_bench_tps::bench]
Average max TPS: 41678.18, 0 nodes had 0 TPS
[2023-04-04T23:02:10.242213099Z INFO solana_bench_tps::bench]
Highest TPS: 41678.18 sampling period 1s max transactions: 331086 clients: 1 drop rate: 0.25
[2023-04-04T23:02:10.242219541Z INFO solana_bench_tps::bench] Average TPS: 11000.657

thin-client, master:
[2023-04-04T22:32:42.511746162Z INFO solana_bench_tps::bench] 104.198.3.140:8003 | 33006.11 | 515536
[2023-04-04T22:32:42.511758258Z INFO solana_bench_tps::bench]
Average max TPS: 33006.11, 0 nodes had 0 TPS
[2023-04-04T22:32:42.511761779Z INFO solana_bench_tps::bench]
Highest TPS: 33006.11 sampling period 1s max transactions: 515536 clients: 1 drop rate: 0.02
[2023-04-04T22:32:42.511771206Z INFO solana_bench_tps::bench] Average TPS: 17100.736

[2023-04-04T22:33:24.333075275Z INFO solana_bench_tps::bench]
Average max TPS: 31231.25, 0 nodes had 0 TPS
[2023-04-04T22:33:24.333078006Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2023-04-04T22:33:24.333078949Z INFO solana_bench_tps::bench]
Highest TPS: 31231.25 sampling period 1s max transactions: 560829 clients: 1 drop rate: 0.02
[2023-04-04T22:33:24.333136506Z INFO solana_bench_tps::bench] Average TPS: 18502.898

[2023-04-04T22:34:07.167283756Z INFO solana_bench_tps::bench] 34.168.126.60:8003 | 32842.41 | 518334
[2023-04-04T22:34:07.167294338Z INFO solana_bench_tps::bench]
Average max TPS: 32842.41, 0 nodes had 0 TPS
[2023-04-04T22:34:07.167297924Z INFO solana_bench_tps::bench]
Highest TPS: 32842.41 sampling period 1s max transactions: 518334 clients: 1 drop rate: 0.00
[2023-04-04T22:34:07.167305530Z INFO solana_bench_tps::bench] Average TPS: 16705.557

thin-client, this PR

[2023-04-04T23:05:43.878128086Z INFO solana_bench_tps::bench] 35.233.177.221:8003 | 25967.38 | 504129
[2023-04-04T23:05:43.878148764Z INFO solana_bench_tps::bench]
Average max TPS: 25967.38, 0 nodes had 0 TPS
[2023-04-04T23:05:43.878154361Z INFO solana_bench_tps::bench]
Highest TPS: 25967.38 sampling period 1s max transactions: 504129 clients: 1 drop rate: 0.03
[2023-04-04T23:05:43.878159676Z INFO solana_bench_tps::bench] Average TPS: 16507.777

[2023-04-04T23:06:26.400310458Z INFO solana_bench_tps::bench]
Average max TPS: 30490.07, 0 nodes had 0 TPS
[2023-04-04T23:06:26.400316443Z INFO solana_bench_tps::bench]
Highest TPS: 30490.07 sampling period 1s max transactions: 508329 clients: 1 drop rate: 0.00
[2023-04-04T23:06:26.400314451Z INFO solana_metrics::metrics] datapoint: bench-tps-lamport_balance balance=14373355930776120i
[2023-04-04T23:06:26.400320101Z INFO solana_bench_tps::bench] Average TPS: 16381.896

[2023-04-04T23:07:08.233045337Z INFO solana_bench_tps::bench] 34.168.126.60:8003 | 30030.56 | 515596
[2023-04-04T23:07:08.233056989Z INFO solana_bench_tps::bench]
Average max TPS: 30030.56, 0 nodes had 0 TPS
[2023-04-04T23:07:08.233061505Z INFO solana_bench_tps::bench]
Highest TPS: 30030.56 sampling period 1s max transactions: 515596 clients: 1 drop rate: 0.01
[2023-04-04T23:07:08.233066079Z INFO solana_bench_tps::bench] Average TPS: 17162.008

Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't merge this. Tokio has been pinned because of this long-standing issue: #24644
There is a lot of motivation to update, but the issue has been a critical one for RPC operators.
We (and by we, I mean mostly @mschneider and team) have just started actively working with RPC providers to repro, instrument, test various newer tokio versions.

@codecov
Copy link

codecov bot commented Apr 5, 2023

Codecov Report

Merging #31055 (8880aa5) into master (aa3e0b9) will decrease coverage by 0.1%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #31055     +/-   ##
=========================================
- Coverage    81.5%    81.5%   -0.1%     
=========================================
  Files         728      728             
  Lines      205737   205737             
=========================================
- Hits       167695   167688      -7     
- Misses      38042    38049      +7     

@lijunwangs
Copy link
Contributor Author

Withdrawing this PR due to the RPC concerns -- but I think we need to root cause that to move forward we are many versions behind the Tokio main line. For the Deadlock I mentioned here, there is a workaround, the Endpoint has to be dropped carefully in order to avoid it: any client reference to the same Endpoint must be dropped first before the listener (server code) is dropped on it. This requires careful ordering of our services when it is being shutdown.

@lijunwangs lijunwangs closed this Apr 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants