Fix dismatch between GC_suspend_ack_sem and n_live_threads. #633

delguoqing · 2024-04-11T03:10:11Z

The cause of this problem

This is a corner case I found in Android, but this also applies for other platforms, as long as they have chances of receiving signals and therefore pausing their threads.

When the user or system want to create a bugreport, it will send a SIGQUIT to each process. Because android app is forked from zygote, they all have a Signal Catcher thread that handles SIGQUIT. The signal catcher thread will in turn send SIGRT_1 to all of the other threads to pause them and and get the ucontext, then do a stack unwind.

1.When the GC_stopping_thread have finished collecting and want to restart the world, it first change GC_stop_count to GC_stop_count + 1.

2.Just before the GC_stopping_thread have a chance to raise signals to other threads by calling GC_restart_all. Other threads might have been awaken by other signals than GC_sig_thr_restart(in my case SIGRT_1), then because of the updated value of GC_stop_count. The following loop condition will not meet.

  do {
      sigsuspend(&suspend_handler_mask);
      /* Iterate while not restarting the world or thread is suspended. */
  } while (ao_load_acquire_async(&GC_stop_count) == my_stop_count
#          ifdef GC_ENABLE_SUSPEND_THREAD
             || ((suspend_cnt & 1) != 0
                 && (word)ao_load_async(&(me -> ext_suspend_cnt))
                    == suspend_cnt)
#          endif
          );

The threads can go straight to the end of the suspend_handler and update last_stop_count.

  {
    /* If the RESTART signal loss is possible (though it should be      */
    /* less likely than losing the SUSPEND signal as we do not do       */
    /* much between the first sem_post and sigsuspend calls), more      */
    /* handshaking is provided to work around it.                       */
    sem_post(&GC_suspend_ack_sem);
    /* Set the flag that the thread has been restarted. */
    if (GC_retry_signals)
      ao_store_release_async(&(me -> last_stop_count),
                             my_stop_count | THREAD_RESTARTED);
  }
  RESTORE_CANCEL(cancel_state);

The GC_stopping_thread then continues and call GC_restart_all, in order to send GC_sig_thr_restart to all other threads. However, it sees that some threads have their last_stop_count already updated and decide that it will not raise signal to it "again". So n_live_threads is not counting those threads that runs too fast in step 2.

    for (i = 0; i < THREAD_TABLE_SZ; i++) {
      for (p = GC_threads[i]; p != NULL; p = p -> tm.next) {
        if (!THREAD_EQUAL(p -> id, self)) {
          if ((p -> flags & (FINISHED | DO_BLOCKING)) != 0) continue;
#         ifdef GC_ENABLE_SUSPEND_THREAD
              if ((p -> ext_suspend_cnt & 1) != 0) continue;
#         endif
          if (GC_retry_signals
                && AO_load(&(p -> last_stop_count)) == GC_stop_count)
              continue; /* The thread has been restarted. */
          n_live_threads++;
#         ifdef DEBUG_THREADS
            GC_log_printf("Sending restart signal to %p\n", (void *)p->id);
#         endif

In the end, we can have n_live_threads less than GC_suspend_ack_sem. So we will call sem_wait only n_live_threads times, thus leaving GC_suspend_ack_sem a non-zero value after this round of GC.

My fix for this problem

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.
I added a boolean argument for GC_suspend_all/GC_restart_all to indicate if we're actually doing the retry. If not, we ignore last_stop_count and always raise signals to suspending threads.

ivmai · 2024-08-26T07:15:45Z

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.

I've switched to this problem, let me think how to do handle spurious wakeups properly.

PR #633 (bdwgc). When sending restart signal to all threads, there could be a situation when some user signal (e.g. SIGQUIT) has already restarted the thread: in this case we still need to count the thread in n_live_threads, so that to decrement the semaphore's value proper amount of times. * pthread_stop_world.c [!NACL] (in_resend_restart_signals): New static variable. * pthread_stop_world.c [!NACL] (GC_restart_all): If GC_retry_signals and p->last_stop_count matches GC_stop_count but not in_resend_restart_signals then proceed to n_live_threads++; add comment and FIXME. * pthread_stop_world.c [!NACL] (GC_start_world): Add assertion that in_resend_restart_signals is not sent (on entrance); make in_resend_restart_signals set while executing resend_lost_signals_retry(). Co-authored-by: Ivan Maidanski <[email protected]>

ivmai · 2024-08-27T09:46:37Z

I suggest that we raise signal to all of the threads for the first time, if we're not actually retrying.

I have significantly reworked your commit (by introducing a static variable instead of an argument) but the high-level logic remains the same. Please test it.

ivmai · 2024-08-27T09:47:57Z

Also, I've added a FIXME because there still could be a chance of a mismatch of ack_sem vs n_live_threads when we are retrying (i.e. a signal is lost and sem wait timed out and then the thread is restarted by SIGQUIT), but the chance of this is much smaller compared to this rare case.

(a cherry-pick of commit d07d2ff from 'master') PR #633 (bdwgc). When sending restart signal to all threads, there could be a situation when some user signal (e.g. SIGQUIT) has already restarted the thread: in this case we still need to count the thread in n_live_threads, so that to decrement the semaphore's value proper amount of times. * pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS] (in_resend_restart_signals): New static variable. * pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS] (GC_restart_all): If GC_retry_signals and p->stop_info.last_stop_count matches GC_stop_count|THREAD_RESTARTED but not in_resend_restart_signals then proceed to n_live_threads++; add comment and FIXME. * pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS] (GC_start_world): Add assertion that in_resend_restart_signals is not sent (on entrance); make in_resend_restart_signals set while executing resend_lost_signals_retry(). Co-authored-by: Ivan Maidanski <[email protected]>

(a cherry-pick of commit 984ab90 from 'release-8_2') PR #633 (bdwgc). When sending restart signal to all threads, there could be a situation when some user signal (e.g. SIGQUIT) has already restarted the thread: in this case we still need to count the thread in n_live_threads, so that to decrement the semaphore's value proper amount of times. * pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS] (in_resend_restart_signals): New static variable. * pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS] (GC_restart_all): If GC_retry_signals and p->stop_info.last_stop_count matches GC_stop_count|THREAD_RESTARTED but not in_resend_restart_signals then proceed to n_live_threads++; add comment and FIXME. * pthread_stop_world.c [!NACL && !GC_OPENBSD_UTHREADS] (GC_start_world): Add assertion that in_resend_restart_signals is not sent (on entrance); make in_resend_restart_signals set while executing resend_lost_signals_retry(). Co-authored-by: Ivan Maidanski <[email protected]>

Fix dismatch between GC_suspend_ack_sem and n_live_threads.

d9c039e

delguoqing mentioned this pull request Apr 11, 2024

rare case of infinite loop on Android #630

Closed

ivmai closed this Aug 27, 2024

ivmai mentioned this pull request Aug 27, 2024

Fix extremely rare case of infinite send lost signals if thread restarted by SIGQUIT after restart signal is lost #657

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dismatch between GC_suspend_ack_sem and n_live_threads. #633

Fix dismatch between GC_suspend_ack_sem and n_live_threads. #633

delguoqing commented Apr 11, 2024

ivmai commented Aug 26, 2024

ivmai commented Aug 27, 2024

ivmai commented Aug 27, 2024

Fix dismatch between GC_suspend_ack_sem and n_live_threads. #633

Fix dismatch between GC_suspend_ack_sem and n_live_threads. #633

Conversation

delguoqing commented Apr 11, 2024

The cause of this problem

My fix for this problem

ivmai commented Aug 26, 2024

ivmai commented Aug 27, 2024

ivmai commented Aug 27, 2024