Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed terminating stale threads on trap/proc_exit #1929

Merged
merged 1 commit into from
Feb 7, 2023

Conversation

hritikgupta
Copy link
Contributor

@hritikgupta hritikgupta commented Feb 1, 2023

This is to terminate suspended threads in case an atomic wait occurs with a huge or indefinite (-1) timeout, followed by a proc exit or trap.

/**
* create a copy of wait_map with no lock
* this allows invoking methods in the callback
* which require acquiring lock on the original map
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see there's any lock in the notify_stale_threads callback. Should we rather have a lock on the shared_memory_list_lock mutex at the beginning of the function and release it at the end? Otherwise the map could be modified in the other thread making the traversal corrupted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wasm_runtime_atomic_notify step in callback calls acquire_wait_info which tries to acquire shared_memory_list_lock, so I guess encapsulating it with that will cause issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashMap doesn't duplicate the key/values pairs, so wait_map_without_lock will point to the same elements of wait_map

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you at least want to lock while you copy hashmaps. Also, what happens if after making a copy a new waiter is added to the map?

Copy link
Contributor Author

@hritikgupta hritikgupta Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loganek While copying the original map, the original map already acquires a lock as it traverses. As inserts happen in the copy map, every insert locks the copy map. So I think the copy should be consistent? If you meant wrapping copy with shared_memory_list_lock -> addressed that.

What could be concerning here is that during the traversal of copy map, the original map could be updated. If I put a lock on the original map during traversal, the execution gets stuck because the callback steps in notify_stale_threads try to put a lock on the original map. (This is exactly why I had to create a copy). This might lead to some entries (newly added ones) in the original map not catered to at that point of time, but would be eventually handled, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either way I'm still not sure it would be safe since you're not cloning the map but copying the pointers iiuc

@hritikgupta hritikgupta requested review from loganek and eloparco and removed request for loganek and eloparco February 3, 2023 02:02
@hritikgupta
Copy link
Contributor Author

closing and reopening to reinitiate the build

@hritikgupta hritikgupta closed this Feb 3, 2023
@hritikgupta hritikgupta reopened this Feb 3, 2023
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_runtime_common.c Show resolved Hide resolved
{
AtomicWaitAddressArgs *data = (AtomicWaitAddressArgs *)user_data;
if (data->len > 0) {
if (!(data->addr = wasm_runtime_realloc(data->addr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When to free the memory? And why do nothing except LOG_ERROR when malloc/realloc memory failed?

core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
memset(args, 0, sizeof(*args));
os_mutex_lock(&shared_memory_list_lock);
// create list of addresses
bh_hash_map_traverse(wait_map, create_list_of_waiter_addresses, args);
Copy link
Contributor

@wenyongh wenyongh Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about traversing two times: the first time to get the total element count of wait_map, use it to allocate memory for AtomicWaitAddressArgs *args with size == offsetof(AtomicWaitAddressArgs, addr) + sizeof(void *) * total_elem_count , and then traverse the second time without allocating memory?

static void
xxx_callback(void *key, void *value, void *p_total_elem_count)
{
    *(uint32 *)p_total_elem_count = *(uint32 *)p_total_elem_count + 1;
}

os_mutex_lock(&shared_memory_list_lock);
total_elem_count = 0;
bh_hash_map_traverse(wait_map, xxx_callback, (void *)&total_elem_count);
allocate memory
bh_hash_map_traverse(wait_map, ..., args); /* set each data->addr[i] */
os_mutex_unlock(...)

Copy link
Contributor Author

@hritikgupta hritikgupta Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wondering if there is any specific advantage of traversing two times? is it to ensure the final traversal of data->addr is safe i.e. we don't go out of bounds?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of that is to avoid multiple reallocations - knowing the size in advance will let you make only one allocation.

core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
core/iwasm/common/wasm_shared_memory.c Outdated Show resolved Hide resolved
@eloparco
Copy link
Contributor

eloparco commented Feb 7, 2023

Does this PR work in AOT mode?

Also, can you try it with the sample thread_termination.c after replacing

for (int i = 0; i < TIMEOUT_SECONDS; i++)
with a wait operation?

And try with TEST_TERMINATION_IN_MAIN_THREAD set to 0 and 1

#define TEST_TERMINATION_IN_MAIN_THREAD 1

Just to make sure that the base cases are covered.

@hritikgupta hritikgupta changed the base branch from dev/wasi_threads to main February 7, 2023 11:17
@wenyongh wenyongh merged commit f3c1ad4 into bytecodealliance:main Feb 7, 2023
victoryang00 pushed a commit to victoryang00/wamr-aot-gc-checkpoint-restore that referenced this pull request May 27, 2024
This is to terminate suspended threads in case an atomic wait occurs with
a huge or indefinite (-1) timeout, followed by a proc exit or trap.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants