Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dismatch between GC_suspend_ack_sem and n_live_threads. #633

Closed
wants to merge 1 commit into from

Conversation

delguoqing
Copy link
Contributor

The cause of this problem

This is a corner case I found in Android, but this also applies for other platforms, as long as they have chances of receiving signals and therefore pausing their threads.

When the user or system want to create a bugreport, it will send a SIGQUIT to each process. Because android app is forked from zygote, they all have a Signal Catcher thread that handles SIGQUIT. The signal catcher thread will in turn send SIGRT_1 to all of the other threads to pause them and and get the ucontext, then do a stack unwind.

1.When the GC_stopping_thread have finished collecting and want to restart the world, it first change GC_stop_count to GC_stop_count + 1.

2.Just before the GC_stopping_thread have a chance to raise signals to other threads by calling GC_restart_all. Other threads might have been awaken by other signals than GC_sig_thr_restart(in my case SIGRT_1), then because of the updated value of GC_stop_count. The following loop condition will not meet.

  do {
      sigsuspend(&suspend_handler_mask);
      /* Iterate while not restarting the world or thread is suspended. */
  } while (ao_load_acquire_async(&GC_stop_count) == my_stop_count
#          ifdef GC_ENABLE_SUSPEND_THREAD
             || ((suspend_cnt & 1) != 0
                 && (word)ao_load_async(&(me -> ext_suspend_cnt))
                    == suspend_cnt)
#          endif
          );

The threads can go straight to the end of the suspend_handler and update last_stop_count.

  {
    /* If the RESTART signal loss is possible (though it should be      */
    /* less likely than losing the SUSPEND signal as we do not do       */
    /* much between the first sem_post and sigsuspend calls), more      */
    /* handshaking is provided to work around it.                       */
    sem_post(&GC_suspend_ack_sem);
    /* Set the flag that the thread has been restarted. */
    if (GC_retry_signals)
      ao_store_release_async(&(me -> last_stop_count),
                             my_stop_count | THREAD_RESTARTED);
  }
  RESTORE_CANCEL(cancel_state);
  1. The GC_stopping_thread then continues and call GC_restart_all, in order to send GC_sig_thr_restart to all other threads. However, it sees that some threads have their last_stop_count already updated and decide that it will not raise signal to it "again". So n_live_threads is not counting those threads that runs too fast in step 2.
    for (i = 0; i < THREAD_TABLE_SZ; i++) {
      for (p = GC_threads[i]; p != NULL; p = p -> tm.next) {
        if (!THREAD_EQUAL(p -> id, self)) {
          if ((p -> flags & (FINISHED | DO_BLOCKING)) != 0) continue;
#         ifdef GC_ENABLE_SUSPEND_THREAD
              if ((p -> ext_suspend_cnt & 1) != 0) continue;
#         endif
          if (GC_retry_signals
                && AO_load(&(p -> last_stop_count)) == GC_stop_count)
              continue; /* The thread has been restarted. */
          n_live_threads++;
#         ifdef DEBUG_THREADS
            GC_log_printf("Sending restart signal to %p\n", (void *)p->id);
#         endif
  1. In the end, we can have n_live_threads less than GC_suspend_ack_sem. So we will call sem_wait only n_live_threads times, thus leaving GC_suspend_ack_sem a non-zero value after this round of GC.

My fix for this problem

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.
I added a boolean argument for GC_suspend_all/GC_restart_all to indicate if we're actually doing the retry. If not, we ignore last_stop_count and always raise signals to suspending threads.

@ivmai
Copy link
Owner

ivmai commented Aug 26, 2024

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.

I've switched to this problem, let me think how to do handle spurious wakeups properly.

ivmai added a commit that referenced this pull request Aug 27, 2024
PR #633 (bdwgc).

When sending restart signal to all threads, there could be a situation
when some user signal (e.g. SIGQUIT) has already restarted the thread:
in this case we still need to count the thread in n_live_threads, so
that to decrement the semaphore's value proper amount of times.

* pthread_stop_world.c [!NACL] (in_resend_restart_signals): New static
variable.
* pthread_stop_world.c [!NACL] (GC_restart_all): If GC_retry_signals
and p->last_stop_count matches GC_stop_count but not
in_resend_restart_signals then proceed to n_live_threads++; add comment
and FIXME.
* pthread_stop_world.c [!NACL] (GC_start_world): Add assertion
that in_resend_restart_signals is not sent (on entrance); make
in_resend_restart_signals set while executing
resend_lost_signals_retry().

Co-authored-by: Ivan Maidanski <[email protected]>
@ivmai
Copy link
Owner

ivmai commented Aug 27, 2024

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.

I have significantly reworked your commit (by introducing a static variable instead of an argument) but the high-level logic remains the same. Please test it.

@ivmai
Copy link
Owner

ivmai commented Aug 27, 2024

Also, I've added a FIXME because there still could be a chance of a mismatch of ack_sem vs n_live_threads when we are retrying (i.e. a signal is lost and sem wait timed out and then the thread is restarted by SIGQUIT), but the chance of this is much smaller compared to this rare case.

@ivmai ivmai closed this Aug 27, 2024
ivmai added a commit that referenced this pull request Sep 1, 2024
(a cherry-pick of commit d07d2ff from 'master')

PR #633 (bdwgc).

When sending restart signal to all threads, there could be a situation
when some user signal (e.g. SIGQUIT) has already restarted the thread:
in this case we still need to count the thread in n_live_threads, so
that to decrement the semaphore's value proper amount of times.

* pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS]
(in_resend_restart_signals): New static variable.
* pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS]
(GC_restart_all): If GC_retry_signals and p->stop_info.last_stop_count
matches GC_stop_count|THREAD_RESTARTED but not
in_resend_restart_signals then proceed to n_live_threads++; add comment
and FIXME.
* pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS]
(GC_start_world): Add assertion that in_resend_restart_signals is not
sent (on entrance); make in_resend_restart_signals set while executing
resend_lost_signals_retry().

Co-authored-by: Ivan Maidanski <[email protected]>
ivmai added a commit that referenced this pull request Sep 4, 2024
(a cherry-pick of commit 984ab90 from 'release-8_2')

PR #633 (bdwgc).

When sending restart signal to all threads, there could be a situation
when some user signal (e.g. SIGQUIT) has already restarted the thread:
in this case we still need to count the thread in n_live_threads, so
that to decrement the semaphore's value proper amount of times.

* pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS]
(in_resend_restart_signals): New static variable.
* pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS]
(GC_restart_all): If GC_retry_signals and p->stop_info.last_stop_count
matches GC_stop_count|THREAD_RESTARTED but not
in_resend_restart_signals then proceed to n_live_threads++; add comment
and FIXME.
* pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS]
(GC_start_world): Add assertion that in_resend_restart_signals is not
sent (on entrance); make in_resend_restart_signals set while executing
resend_lost_signals_retry().

Co-authored-by: Ivan Maidanski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants