Fix sporadic failures testing with JRuby #1012

headius · 2023-11-08T00:40:25Z

A collection of fixes to make specs more robust or fix issues causing sporadic failures on JRuby.

If this spec proceeds to reset the barrier before there are any waiters, the reset will not trigger them and the spec will fail. This at least tries to ensure that the waiter is there and will be interrupted as expected.

On slower or heavily-loaded systems, such as CI, this join may need more than one second. Give it ten before giving up and raising an error.

spec/concurrent/atomic/cyclic_barrier_spec.rb

This avoids getting stuck and waiting forever if something goes awry.

headius · 2023-11-08T01:28:04Z

Existing commits up to this point fix the following:

  1) Concurrent::CyclicBarrier#number_waiting with waiting threads should be equal to the waiting threads count
     Failure/Error: expect(thread_join).not_to be_nil, thread.inspect
       #<Thread:0x40712ee9 /Users/headius/work/concurrent-ruby/spec/support/example_group_extensions.rb:32 aborting>
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-support-3.12.1/lib/rspec/support.rb:108:in `block in Support'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-support-3.12.1/lib/rspec/support.rb:117:in `notify_failure'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/fail_with.rb:35:in `fail_with'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/handler.rb:40:in `handle_failure'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/handler.rb:84:in `block in handle_matcher'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/handler.rb:27:in `with_matcher'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/handler.rb:76:in `handle_matcher'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/expectation_target.rb:78:in `not_to'
     # /Users/headius/work/jruby/lib/ruby/gems/shared/gems/rspec-expectations-3.12.3/lib/rspec/expectations/expectation_target.rb:106:in `not_to'
     # ./spec/spec_helper.rb:57:in `block in <main>'

I have at least one other intermittent failure I need to investigate:

  1) Concurrent ErlangActor on pool behaves like erlang actor asking timing out
     Failure/Error: raise NoActor.new(@Pid) if @Terminated.resolved?

     Concurrent::ErlangActor::NoActor:
       #<Concurrent::ErlangActor::Pid:0x7f58d87f terminated normally with true>
     Shared Example Group: "erlang actor" called from ./spec/concurrent/edge/erlang_actor_spec.rb:975
     # ./lib/concurrent-ruby-edge/concurrent/edge/erlang_actor.rb:706:in `ask'
     # ./lib/concurrent-ruby-edge/concurrent/edge/erlang_actor.rb:82:in `ask'
     # ./spec/concurrent/edge/erlang_actor_spec.rb:934:in `block in <main>'
     # org/jruby/RubyBasicObject.java:2614:in `instance_exec'
     # /home/runner/work/jruby/jruby/lib/ruby/gems/shared/gems/rspec-core-3.12.2/lib/rspec/core/example.rb:263:in `block in run'

headius · 2023-11-08T01:33:27Z

I discovered this PR and commits by @eregon omitting "flaky tests" on TruffleRuby some years ago. The ErlangActor failure above would be excluded by this.

#943

I can certainly add JRuby there, but it would be better to figure out why these tests are flaky and fix them.

headius · 2023-11-08T01:37:34Z

This error just showed up once on the Ruby 3.2 CI jobs for this PR:

  1) Concurrent::ReentrantReadWriteLock write lock wakes up waiting readers when the write lock is released
     Failure/Error: (1..3).each { |n| expect(threads[n].status).to eql "sleep" }

       expected: "sleep"
            got: "run"

       (compared using eql?)

headius · 2023-11-08T01:44:05Z

TR failure in https://github.com/ruby-concurrency/concurrent-ruby/actions/runs/6792582483 does not seem related to my changes.

bensheldon · 2023-11-08T02:29:20Z

TR failure in https://github.com/ruby-concurrency/concurrent-ruby/actions/runs/6792582483 does not seem related to my changes.

Agreed. Those look very racy and asserting that a thread is sleeping seems troublesome. I'd suggest rewriting those something like this:

  it "cannot be acquired when another thread holds a write lock" do
    latch   = CountDownLatch.new
    avoid_latch = CountDownLatch.new # expect this latch never to be reached
    threads = [
      in_thread { lock.acquire_write_lock; latch.count_down },
      in_thread { latch.wait; lock.acquire_write_lock; avoid_latch.count_down }
    ]
    expect { Timeout.timeout(1) { threads[0].join }}.not_to raise_error
    expect(threads[0]).to hold(lock).for_write
    expect(threads[1]).not_to hold(lock).for_write
    wait_up_to(0.2) { threads[1].status == 'sleep' }
    
    # more reliable expectations
    expect(avoid_latch.wait(0.2)).to be false
    expect(lock.try_write_lock).to be false
  end

But also could probably be in another PR---not that I am anyone to say one way or another.

headius · 2023-11-08T02:34:14Z

I'm not sure what would be causing the TR segfaults in https://github.com/ruby-concurrency/concurrent-ruby/actions/runs/6792582483 so I'l lleave that for TR folks to figure out. The other TR failure in https://github.com/ruby-concurrency/concurrent-ruby/actions/runs/6792654287/job/18466286918?pr=1012 might be another flaky test, as @bensheldon suggested above, but I'm not sure of the right fix and I have not seen it on JRuby yet.

If it's ok, I'd like to merge what we have here so JRuby can start to have some consistency in concurrent-ruby testing. If we see more sporadic failures, I'll open another issue.

eregon

Looks good

headius added 2 commits November 7, 2023 16:17

Wait for blockers to proceed

f746228

If this spec proceeds to reset the barrier before there are any waiters, the reset will not trigger them and the spec will fail. This at least tries to ensure that the waiter is there and will be interrupted as expected.

Bump join timeout up for slower system

ada6288

On slower or heavily-loaded systems, such as CI, this join may need more than one second. Give it ten before giving up and raising an error.

bensheldon reviewed Nov 8, 2023

View reviewed changes

spec/concurrent/atomic/cyclic_barrier_spec.rb Outdated Show resolved Hide resolved

Use repeat_until_success to wait for waiters

a8fa486

This avoids getting stuck and waiting forever if something goes awry.

headius mentioned this pull request Nov 8, 2023

Fix intermittent fails in CI jruby/jruby#8007

Merged

Also skip on JRuby until we can fix sporadic failures

27f7833

headius force-pushed the fix_sporadic_failures branch from a6603e4 to 27f7833 Compare November 8, 2023 01:44

headius merged commit dadc2ad into ruby-concurrency:master Nov 8, 2023
12 checks passed

headius deleted the fix_sporadic_failures branch November 8, 2023 02:59

eregon reviewed Nov 8, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sporadic failures testing with JRuby #1012

Fix sporadic failures testing with JRuby #1012

headius commented Nov 8, 2023

headius commented Nov 8, 2023

headius commented Nov 8, 2023

headius commented Nov 8, 2023

headius commented Nov 8, 2023

bensheldon commented Nov 8, 2023

headius commented Nov 8, 2023

eregon left a comment

Fix sporadic failures testing with JRuby #1012

Fix sporadic failures testing with JRuby #1012

Conversation

headius commented Nov 8, 2023

headius commented Nov 8, 2023

headius commented Nov 8, 2023

headius commented Nov 8, 2023

headius commented Nov 8, 2023

bensheldon commented Nov 8, 2023

headius commented Nov 8, 2023

eregon left a comment

Choose a reason for hiding this comment