Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpriv/rpm-util: Query package repo checksum under lock #4673

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Oct 25, 2023

In general, libsolv isn't thread-safe.[1] There are multiple functions which return pointers that can be invalidated by other threads. It's hard to follow the libsolv side of this, but I think solvable_lookup_bin_checksum() (called by dnf_package_get_chksum()) is such a case. Because we do importing in bulk using multi-threading, rpm-ostree triggers this bug.

This is the root cause of #4565, where during the cosa fetch we embed the wrong repodata chksum_repr in the cached RPM's OSTree commit. So then cosa build considers it a cache miss when it sees the mismatch and wants to redownload it.

A trivial way to reproduce #4565 more easily is with:

diff --git a/src/libpriv/rpmostree-rpm-util.cxx b/src/libpriv/rpmostree-rpm-util.cxx
index 658a5fd8..ad8dc104 100644
--- a/src/libpriv/rpmostree-rpm-util.cxx
+++ b/src/libpriv/rpmostree-rpm-util.cxx
@@ -1142,6 +1142,7 @@ get_repodata_chksum_repr (DnfPackage &pkg)
     chksum_raw = dnf_package_get_hdr_chksum (&pkg, &chksum_type);
   else
     chksum_raw = dnf_package_get_chksum (&pkg, &chksum_type);
+  g_usleep (5000);
   if (chksum_raw)
     chksum = hy_chksum_str (chksum_raw, chksum_type);

Fix this by locking down this critical section. As mentioned in the comment, a more correct fix would be to not call unsafe libsolv APIs in the importer, but that'd require a larger rework. The locking added here doesn't appear to affect performance.

I've checked that we don't do any other unsafe libsolv calls in the importer path that could be subject to this.

Fixes: #4565

In general, libsolv isn't thread-safe.[[1]] There are multiple
functions which return pointers that can be invalidated by other
threads. It's hard to follow the libsolv side of this, but I think
`solvable_lookup_bin_checksum()` (called by `dnf_package_get_chksum()`)
is such a case. Because we do importing in bulk using multi-threading,
rpm-ostree triggers this bug.

This is the root cause of coreos#4565, where during the `cosa fetch` we embed
the wrong repodata chksum_repr in the cached RPM's OSTree commit. So
then `cosa build` considers it a cache miss when it sees the mismatch
and wants to redownload it.

A trivial way to reproduce coreos#4565 more easily is with:

```diff
diff --git a/src/libpriv/rpmostree-rpm-util.cxx b/src/libpriv/rpmostree-rpm-util.cxx
index 658a5fd8..ad8dc104 100644
--- a/src/libpriv/rpmostree-rpm-util.cxx
+++ b/src/libpriv/rpmostree-rpm-util.cxx
@@ -1142,6 +1142,7 @@ get_repodata_chksum_repr (DnfPackage &pkg)
     chksum_raw = dnf_package_get_hdr_chksum (&pkg, &chksum_type);
   else
     chksum_raw = dnf_package_get_chksum (&pkg, &chksum_type);
+  g_usleep (5000);
   if (chksum_raw)
     chksum = hy_chksum_str (chksum_raw, chksum_type);
```

Fix this by locking down this critical section. As mentioned in the
comment, a more correct fix would be to not call unsafe libsolv APIs in
the importer, but that'd require a larger rework. The locking added here
doesn't appear to affect performance.

I've checked that we don't do any other unsafe libsolv calls in the
importer path that could be subject to this.

Fixes: coreos#4565

[1]: openSUSE/libsolv#471
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome investigation!

(It's worth making the obvious statement here that this bug couldn't have happened with Rust.)

@cgwalters cgwalters enabled auto-merge (rebase) October 26, 2023 00:30
@cgwalters cgwalters merged commit 7e85d58 into coreos:main Oct 26, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

intermittent flake in separated fetch vs build phases
2 participants