Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an implementation of the aio.c using EPOLL in Linux. #805

Merged
merged 9 commits into from
Jun 14, 2024

Conversation

tesonep
Copy link
Collaborator

@tesonep tesonep commented May 14, 2024

Avoiding the limit of file descriptors

Avoiding the limit of file descriptors
@akgrant43
Copy link
Contributor

akgrant43 commented Jun 4, 2024

Hi Pablo,

Thanks very much for this!

I've been running a GT VM with this patch for a couple of weeks, and under light load (and a small number of sockets) it works fine. Unfortunately it is crashing the VM in what appears to be two separate situations:

  1. under load

This isn't very reproducable, but a couple of symptoms I've seen:

  • If another library is calling epoll_wait() it will crash with a segmentation fault in that thread and call.
  • After a crash, attempting to call printCallStack() from the debugger will sometimes crash, i.e. memory corruption.
  • A separate crash where the library calling epoll_wait() is not loaded:
...
Socket count: 998
Socket count: 999
Socket count: 1000

Thread 2 "PharoVM" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7b0c6c0 (LWP 51339)]
0x0000000320001354 in ?? ()
(gdb) info threads
  Id   Target Id                                           Frame 
  1    Thread 0x7ffff7b108c0 (LWP 51336) "GlamorousToolki" 0x00007ffff7c3663d in syscall ()
   from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
* 2    Thread 0x7ffff7b0c6c0 (LWP 51339) "PharoVM"         0x0000000320001354 in ?? ()
  3    Thread 0x7ffff58476c0 (LWP 51340) "PharoVM"         0x00007ffff7bfdaf5 in clock_nanosleep@GLIBC_2.2.5 ()
   from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7b108c0 (LWP 51336))]
#0  0x00007ffff7c3663d in syscall () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7c3663d in syscall () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#1  0x000055555572be94 in std::sys::unix::futex::futex_wait () at library/std/src/sys/unix/futex.rs:62
#2  std::sys_common::thread_parking::futex::Parker::park () at library/std/src/sys_common/thread_parking/futex.rs:52
#3  std::thread::park () at library/std/src/thread/mod.rs:1070
#4  0x000055555567c9d2 in std::sync::mpmc::list::Channel<T>::recv::{{closure}}::h834c02460d2fa055 ()
#5  0x000055555567c72b in std::sync::mpmc::list::Channel<T>::recv::h1aef3010e00e4ed6 ()
#6  0x000055555567b4fd in vm_runtime::event_loop::EventLoop::run::h542c8ac55af408e0 ()
#7  0x0000555555675943 in vm_runtime::constellation::Constellation::run::hf8ae69d06de4ab09 ()
#8  0x000055555558f9e7 in vm_client_cli::main::h1f12db8a14167363 ()
#9  0x00005555555949d3 in std::sys_common::backtrace::__rust_begin_short_backtrace::h8d0b389f33313915 ()
#10 0x0000555555594a09 in std::rt::lang_start::{{closure}}::hc8dabe638680900a ()
#11 0x000055555572b827 in core::ops::function::impls::{impl#2}::call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/core/src/ops/function.rs:284
#12 std::panicking::try::do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/panicking.rs:552
#13 std::panicking::try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library/std/src/panicking.rs:516
#14 std::panic::catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library/std/src/panic.rs:142
#15 std::rt::lang_start_internal::{closure#2} () at library/std/src/rt.rs:148
#16 std::panicking::try::do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/panicking.rs:552
#17 std::panicking::try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library/std/src/panicking.rs:516
#18 std::panic::catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library/std/src/panic.rs:142
#19 std::rt::lang_start_internal () at library/std/src/rt.rs:148
#20 0x00005555555949fe in std::rt::lang_start::h594095b100ee11f9 ()
#21 0x00007ffff7b52fce in __libc_start_call_main () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#22 0x00007ffff7b53089 in __libc_start_main_impl () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#23 0x0000555555587635 in _start ()

In these cases there have been less than 1023 sockets open, thus avoiding the next issue...

  1. socketWritable() in SocketPluginImpl.c still calls select(), so will fail when a fd >= 1024 is used:
Socket count: 1498
Socket count: 1499
Socket count: 1500
*** stack smashing detected ***: terminated

Thread 2 "PharoVM" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7ad36c0 (LWP 14241)]
0x00007ffff7b7ed7c in __pthread_kill_implementation () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
(gdb) bt
#0  0x00007ffff7b7ed7c in __pthread_kill_implementation ()
   from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#1  0x00007ffff7b2f9c6 in raise () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#2  0x00007ffff7b188fa in abort () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#3  0x00007ffff7b19767 in __libc_message.cold () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#4  0x00007ffff7c0d7f9 in __fortify_fail () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#5  0x00007ffff7c0eaa4 in __stack_chk_fail () from /nix/store/qn3ggz5sf3hkjs2c797xf7nan3amdxmp-glibc-2.38-27/lib/libc.so.6
#6  0x00007fff9eaf1c08 in socketWritable (s=1088)
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/extracted/plugins/SocketPlugin/src/common/SocketPluginImpl.c:456
#7  0x00007fff9eaf3d83 in sqSocketSendDone (s=0x36017a1d0)
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/extracted/plugins/SocketPlugin/src/common/SocketPluginImpl.c:1227
#8  0x00007fff9eafa8f5 in primitiveSocketSendDone ()
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/extracted/plugins/SocketPlugin/src/common/SocketPlugin.c:2095
#9  0x00000003200017c8 in ?? ()
#10 0x0000010019378348 in ?? ()
#11 0x0000010000756218 in ?? ()
#12 0x00007ffff7ad2560 in ?? ()
#13 0x00007ffff7e42594 in interpret ()
    at /home/alistair/gtvm/gtoolkit-vm/target/debug/build/vm-bindings-d83ce2bf61e9bb61/out/generated/64/vm/src/cointerp.c:3030
Backtrace stopped: frame did not save the PC

(the reason the socket count gets up to 1500 is that the test program opens all the sockets, and then starts writing).

Once the socketWritable() issue is addressed, if you can supply me with a debug Pharo VM I can try to reproduce the crash in a vanilla Pharo VM.

The test harness I've been using is:

testharness.zip

  1. Load TCPSocketEchoTest.st. This overwrites withTCPEchoServer: to keep track of all the connections and repeat the read and write cycle.
  2. Start the server:
Smalltalk vm maxExternalSemaphoresSilently: 8192.
TCPSocketEchoTest new runServer; yourself.
  1. Run s100.py 10 times simultaneously.

To reproduce the second issue just run 15 times simultaneously.

@tesonep
Copy link
Collaborator Author

tesonep commented Jun 6, 2024

Hi @akgrant43, I have fixed the implementation.

Cheers,
Pablo

@akgrant43
Copy link
Contributor

Hi @tesonep ,

Thanks! I'll be able to test it next week.

Cheers,
Alistair

@akgrant43
Copy link
Contributor

akgrant43 commented Jun 13, 2024

Hi Pablo,

This is looking much better!:

  • I've been running it on my personal machine for the last few days.
  • The test harness that triggered the issues I reported earlier now works with 2000 connections.
  • It's worked without issue in a live environment with 1800 clients in AWS connecting with normal workload.

I'll continue to use it on my personal machine and will report if anything comes up, but from my perspective it's ready to release.

Do you have any idea of when it is likely to be released?

Thanks!
Alistair

@guillep
Copy link
Member

guillep commented Jun 13, 2024

Thanks for the feedback! Very much appretiated!

@tesonep
Copy link
Collaborator Author

tesonep commented Jun 14, 2024

Thanks so much for checking, I was waiting for your Ok to start pushing the release.

@tesonep tesonep merged commit 7b2fd64 into pharo-project:pharo-10 Jun 14, 2024
1 of 2 checks passed
@akgrant43
Copy link
Contributor

Great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants