Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading from Pool is not thread-safe when pool_alloctmpspace() is used #471

Open
dmach opened this issue Aug 11, 2021 · 5 comments · May be fixed by #472
Open

Reading from Pool is not thread-safe when pool_alloctmpspace() is used #471

dmach opened this issue Aug 11, 2021 · 5 comments · May be fixed by #472

Comments

@dmach
Copy link

dmach commented Aug 11, 2021

Several functions use pool_alloctmpspace() to return a string allocated in Pool's temporary area.
Such strings must be processed and discarded immediately.
Everything works fine as long as the code runs in a single thread.

We're currently writing new dnfdaemon - a DBus interface that processes incoming requests in an infinite loop and turns them into asynchronous threads. We are aware of potential concurrency issues, designing the code to avoid them as much as possible.

Our use case is: load repositories into Pool and let users run multiple queries at a time.
We believe that pool_alloctmpspace() might become an issue when used in threads and that impacts all libsolv functions that use it, incl. pool_solvable2str() for example.

Locking doesn't seem to be a good idea, because a user would have to always unlock it explicitly after the value is consumed and it also decreases performance.
I'm wondering if thread_local annotation on pool->tmpspace couldn't be used to eliminate race conditions among threads.
Maybe also returning a newly allocated string might be an option, but I don't know the performance impact.
Or do you have any better idea?

@dmach dmach linked a pull request Aug 17, 2021 that will close this issue
@ignatenkobrain
Copy link
Collaborator

I think more correct approach (which will also make it "safe" for multiple threads, assuming that you don't touch string pool, e.g. by creating new packages) is to mimic pool_solvable2str() yourself and use your temporary memory for those.

I think we could probably have 2 variations of these functions, so that caller can choose which one he wants (thread-safe or fast)…

They way current function is designed is to use as less memory as possible and to be as fast as possible.

@dmach
Copy link
Author

dmach commented Aug 20, 2021

I don't think it's as easy as you suggested.
pool->tmpspace is used across the whole code incl. many lookups.

@j-mracek
Copy link
Contributor

@mlschroe I know multi threading is not an easy topic but I would like to ask you for your opinion. We are preparing a new DNF daemon and multi threading support in libsolv is a feature that is critical for the design. Do you think whether is it possible or feasible to make libsolv thread safe? Right now we see an issue with pool->tmpspace but there can be other workflows in libsolv that can not be compatible with multi thread usage. Anyway we will try our best to resolve the problem but at beginning we need to understand how big problem is in front of us.

@mlschroe
Copy link
Member

I don't think you can easily make libsolv multi-thread safe. You can do this for the tmpspace, but there's more to it. For example provider lookup caches the result is the whatprovides hash.

If we want to go this route we need to go through all the API functions and find out if they need locking. The original design was "a pool must only be accessed by one thread at a time", so there might be some ugly corner cases...

@j-mracek
Copy link
Contributor

Thank you very much for information. OK, we will try to not use libsolv in multi-thread.

jlebon added a commit to jlebon/rpm-ostree that referenced this issue Oct 25, 2023
In general, libsolv isn't thread-safe.[[1]] There are multiple
functions which return pointers that can be invalidated by other
threads. It's hard to follow the libsolv side of this, but I think
`solvable_lookup_bin_checksum()` (called by `dnf_package_get_chksum()`)
is such a case. Because we do importing in bulk using multi-threading,
rpm-ostree triggers this bug.

This is the root cause of coreos#4565, where during the `cosa fetch` we embed
the wrong repodata chksum_repr in the cached RPM's OSTree commit. So
then `cosa build` considers it a cache miss when it sees the mismatch
and wants to redownload it.

A trivial way to reproduce coreos#4565 more easily is with:

```diff
diff --git a/src/libpriv/rpmostree-rpm-util.cxx b/src/libpriv/rpmostree-rpm-util.cxx
index 658a5fd8..ad8dc104 100644
--- a/src/libpriv/rpmostree-rpm-util.cxx
+++ b/src/libpriv/rpmostree-rpm-util.cxx
@@ -1142,6 +1142,7 @@ get_repodata_chksum_repr (DnfPackage &pkg)
     chksum_raw = dnf_package_get_hdr_chksum (&pkg, &chksum_type);
   else
     chksum_raw = dnf_package_get_chksum (&pkg, &chksum_type);
+  g_usleep (5000);
   if (chksum_raw)
     chksum = hy_chksum_str (chksum_raw, chksum_type);
```

Fix this by locking down this critical section. As mentioned in the
comment, a more correct fix would be to not call unsafe libsolv APIs in
the importer, but that'd require a larger rework. The locking added here
doesn't appear to affect performance.

I've checked that we don't do any other unsafe libsolv calls in the
importer path that could be subject to this.

Fixes: coreos#4565

[1]: openSUSE/libsolv#471
cgwalters pushed a commit to coreos/rpm-ostree that referenced this issue Oct 26, 2023
In general, libsolv isn't thread-safe.[[1]] There are multiple
functions which return pointers that can be invalidated by other
threads. It's hard to follow the libsolv side of this, but I think
`solvable_lookup_bin_checksum()` (called by `dnf_package_get_chksum()`)
is such a case. Because we do importing in bulk using multi-threading,
rpm-ostree triggers this bug.

This is the root cause of #4565, where during the `cosa fetch` we embed
the wrong repodata chksum_repr in the cached RPM's OSTree commit. So
then `cosa build` considers it a cache miss when it sees the mismatch
and wants to redownload it.

A trivial way to reproduce #4565 more easily is with:

```diff
diff --git a/src/libpriv/rpmostree-rpm-util.cxx b/src/libpriv/rpmostree-rpm-util.cxx
index 658a5fd8..ad8dc104 100644
--- a/src/libpriv/rpmostree-rpm-util.cxx
+++ b/src/libpriv/rpmostree-rpm-util.cxx
@@ -1142,6 +1142,7 @@ get_repodata_chksum_repr (DnfPackage &pkg)
     chksum_raw = dnf_package_get_hdr_chksum (&pkg, &chksum_type);
   else
     chksum_raw = dnf_package_get_chksum (&pkg, &chksum_type);
+  g_usleep (5000);
   if (chksum_raw)
     chksum = hy_chksum_str (chksum_raw, chksum_type);
```

Fix this by locking down this critical section. As mentioned in the
comment, a more correct fix would be to not call unsafe libsolv APIs in
the importer, but that'd require a larger rework. The locking added here
doesn't appear to affect performance.

I've checked that we don't do any other unsafe libsolv calls in the
importer path that could be subject to this.

Fixes: #4565

[1]: openSUSE/libsolv#471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants