-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --enable-asan
and --enable-ubsan
switches
#12928
Conversation
Bonus points for custom kernel packages for kASAN while we're at it... |
|
I was nearly entirely joking, though I wouldn't have been sad if your reply was "oh yeah sure."; I had figured that after a cleanup period it'd probably end up with one or two daily/weekly runs with kASAN, just because it slows things down so much... I'm just glad to see someone diving into this. :) |
Well, this probably cannot be achieved in workflows as we would need to have kASan-enabled kernel. If @behlendorf agrees to revising current buildbot slaves and to adding kASan-enabled VM I'm all for giving it a shot. |
Sure, you'd need to either purely use the buildbot workflow for it or convince the Github CI instances to kexec, and I somehow suspect they would frown if they ever found out. Everyone seemed positive on using kASAN the last time it was discussed (I think there might have been a discussion in another thread too but I don't immediately see that...) |
Given the threading issues we've had over the years, KTSAN might be a handy thing to have running for catching data races. It actually shouldn't hurt performance nearly as much as something like kASAN either, and might help suss out "strangeness" as the OS threading mechanisms change under/around ZFS and through internal code churn. |
+1 to more automatic testing is more or less always my opinion until proven
unhelpful.
…On Mon, Jan 3, 2022 at 7:42 PM RageLtMan ***@***.***> wrote:
Given the threading issues we've had over the years, KTSAN
<https://github.com/google/kernel-sanitizers/tree/master/KTSAN> might be
a handy thing to have running for catching data races. It actually
shouldn't hurt performance nearly as much as something like kASAN either,
and might help suss out "strangeness" as the OS threading mechanisms change
under/around ZFS and through internal code churn.
—
Reply to this email directly, view it on GitHub
<#12928 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABUI7JXXUBXJHFR5EOKSM3UUI7ADANCNFSM5LC4YUSQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
|
I'm afraid kASan-instrumented zfs-tests require more resources than I can afford at the moment. Sanity tests croaked when run on 12 vCPU/24 GiB VM.
|
Conveniently, I have a surfeit of RAM. I'll go build a custom kernel with kASAN and give it a go... |
We've had a kmemleak enabled builder in the past and the issue there really was performance. I'd be all for adding a kASAN builder if 1) the tests pass with it enabled, 2) it runs in a reasonable amount of time, and 3) we build packages for some distribution so it's easy to install. Starting with just running the |
e: @behlendorf, what would you think about something like a runner that triggers once a {day, week, ...} with kASAN (a/o kmemleak, though I have no experience with that one) enabled, rather than on every commit pushed to a PR? It wouldn't tell you before merging that something was on fire, or even which merged bit was, but it'd be better than nothing, and would avoid issues with "well, the runners are now forever backlogged on kASAN runs because they take an eternity"... As far as easy packaging, I'd probably start with handbuilt kernel packages for them on top of $PICK_YOUR_DISTRO (my leaning obviously being Debian at the moment, but there's not really much limiting it, other than "probably not Fedora"...), and then eventually move up to a helper that periodically cuts new baselines like whatever grabs the new FreeBSD snapshots periodically. Aww, my test system got to
in
e: the part that actually made it die in a fire appears to have been:
|
I dont know if this is true nowadays, i've not run vbox in many years in part due to this, but the visor used to be quite unsafe with IO and memory operations (not honoring sync/direct, doing weird things with memory layout, leaving a bunch of visor memory accessible to the VM, etc) - so might not be the best thing on which to test if any of that is still problematic. How much memory are you giving the poor thing, and how much do we think it needs? If this is too expensive to run in public cloud, we might be able to donate some pages and cycles in our stacks (similar concern though in that we run grsec at the phys page table layer so our MM is quite different from upstream, way more so after AUTOSLAB came into being). |
I thought I had given it 16 GB, but apparently it didn't save, and that run was with a puny 4, so no wonder it died in a fire. I have since given it 24. (As far as dangerous life choices, I've not made it crash and burn in general without enabling features that have flashing "experimental do not use" lights over them - in particular, TRIM on disk images and using host IO cache, neither of which is a default.) |
As I let the kASAN testsuite rip (and that's very loose usage of "rip"), this is what I get from just running
|
@nabijaczleweli @behlendorf
|
AFAICT (which is to say: not that far, but I looked my best), there is no actual leak there. Even replacing all the zpool_get_all_props() with zpool_props_refresh(), which does free the previous zpool_props, doesn't make the asan warning go away. Oddly, valgrind just gives me shit, somehow managing to break free() semi-entirely, it seems, and nothing gets freed. It enters both regcomp() and regfree() in libzfs_init/_fini but seemingly nothing actually gets freed. Likewise with the newargv in zpool main(). |
Caught by valgrind while investigating openzfs#12928 (comment) Signed-off-by: Ahelenia Ziemiańska <[email protected]>
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Signed-off-by: Ahelenia Ziemiańska <[email protected]>
@behlendorf
EDIT
The testbed did not collapse. I think the ARC-related problems in the test cases can be attributed to the low
Kmemleak scans and were run every 60 seconds and saved to a file,
|
Heh. Digging deeper into it, and fixing some other stuff, |
We actually still have a kmemleak enabled builder as part of the CI. It's just not enabled by default since we turned off the code coverage analysis. If you add the following line to your commit message you can request it. This would request only the
|
here's a minimal reproducer (yes its atrocious DO NOT bully me): #include <stdio.h>
void * libzfs_init(void);
void libzfs_fini(void *);
typedef int (*zfs_iter_f)(void *, void *);
void zfs_foreach_mountpoint(void *, void **, unsigned long, zfs_iter_f, void *, int);
void *zfs_open(void *, const char *, int);
void zfs_close(void *);
static int iter(void *a, void * b) {
printf("%p, %p\n", a, b);
return 0;
}
int main(int argc, char ** argv) {
void * zfs = libzfs_init();
void * h = zfs_open(zfs, "scratchpsko", 1);
zfs_foreach_mountpoint(zfs, &h, 1, iter, 0, argc - 1);
zfs_close(h);
libzfs_fini(zfs);
}
Notably, the leak volume is 7048, just like Also, under valgrind on my normal system (zfs 2.1.2-1 + some backports, bullseye), I get
which is as-expected and
Which matches the ASAN stats. Update: if you comment out tpool_create/tpool_wait/tpool_destroy in |
Codecov Report
@@ Coverage Diff @@
## master #12928 +/- ##
==========================================
+ Coverage 75.17% 76.85% +1.67%
==========================================
Files 402 403 +1
Lines 128071 129826 +1755
==========================================
+ Hits 96283 99774 +3491
+ Misses 31788 30052 -1736
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@nabijaczleweli
Case 1
Case 2
|
They're later |=d with constants, but never reset Caught by valgrind while investigating #12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12954
--enable-asan
and --enable-ubsan
switches
@behlendorf, I think this PR is ready for a general review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM at least for CI's part.
"be specified by using a name containing a colon (:).\n")); | ||
(void) fprintf(fp, "%s", gettext("\nUser-defined properties " | ||
"can be specified by using a name containing a colon " | ||
"(:).\n")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add the "%s" format specifier to just this one fprintf()
? I see several others were updated as well, were warnings issues for these in particular?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it gives warnings for those (for me these were a distribution of "non-constant used as format" and "potentially NULL format", both are bogus in this case). OTOH, if there isn't anything to format here, it's better to make this a fputs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is GCC bug. GCC enables additional format string checks when -fsanitize=undefined
is used. Unfortunately there are false positives here and there (tested on GCC 10 and 11), e. g. raspberrypi/userland#631 (comment)
It looks incredibly lame, I did not find a better solution/workaround, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that's unfortunate but we can live with it.
@@ -631,6 +631,8 @@ fatal(int do_perror, char *message, ...) | |||
|
|||
(void) fflush(stdout); | |||
buf = umem_alloc(FATAL_MSG_SZ, UMEM_NOFAIL); | |||
if (buf == NULL) | |||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UMEM_NOFAIL
flag ensures this can't fail. There are several other places in the ztest.c
which use this flag and don't check the return value. I'm guessing the unsan checks just flagged this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, UBSan has no way to determine that umem_alloc()
won't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm just surprised it only flagged this one case where there are clearly several others.
@@ -41,8 +41,7 @@ log_must display_status "$TESTPOOL" | |||
# | |||
|
|||
log_must zfs create -o dedup=on -V 2G $TESTPOOL/$TESTVOL | |||
|
|||
log_must eval "new_fs $ZVOL_DEVDIR/$TESTPOOL/$TESTVOL >/dev/null 2>&1" | |||
log_must eval "new_fs $ZVOL_DEVDIR/$TESTPOOL/$TESTVOL >/dev/null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unrelated, but we should add a block_device_wait $ZVOL_DEVDIR/$TESTPOOL/$TESTVOL
here to make sure the block device is available before using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block_device_wait $ZVOL_DEVDIR/$TESTPOOL/$TESTVOL
applied.
Not ignoring stderr gives us much better diagnostic. When the test case failed I had to wonder what were the reasons for file system creation failure and I ended up removing >/dev/null
. I know it might be obvious that it is due to missing block device, time saved on inserting diagnostics is time earned.
@@ -35,7 +35,7 @@ | |||
DISK1=${DISKS%% *} | |||
|
|||
log_must zpool create -f $TESTPOOL $DISK1 | |||
log_must zpool trim $TESTPOOL | |||
log_must zpool trim -r 1 "$TESTPOOL" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably also unrelated, but address a false positive you were seeing in the test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes the trim ends before the next command checks if trim is happening.
ASan/UBSan makes really interesting behaviors bubble up :)
`configure` now accepts `--enable-asan` and `--enable-ubsan` switches which results in passing `-fsanitize=address` and `-fsanitize=undefined`, respectively, to the compiler. Those flags are enabled in GitHub workflows for ZTS and zloop. Errors reported by both instrumentations are corrected, except for: - Memory leak reporting is (temporarily) suppressed. The cost of fixing them is relatively high compared to the gains. - Checksum computing functions in `module/zcommon/zfs_fletcher*` have UBSan errors suppressed. It is completely impractical to enforce 64-byte payload alignment there due to performance impact. - There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom memory allocator is used there rendering that measure unfeasible. - Memory leaks detection has to be suppressed for `cmd/zvol_id`. `zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is incompatible with memory leaks detection. Signed-off-by: szubersk <[email protected]>
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#12954
"be specified by using a name containing a colon (:).\n")); | ||
(void) fprintf(fp, "%s", gettext("\nUser-defined properties " | ||
"can be specified by using a name containing a colon " | ||
"(:).\n")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that's unfortunate but we can live with it.
@@ -631,6 +631,8 @@ fatal(int do_perror, char *message, ...) | |||
|
|||
(void) fflush(stdout); | |||
buf = umem_alloc(FATAL_MSG_SZ, UMEM_NOFAIL); | |||
if (buf == NULL) | |||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'm just surprised it only flagged this one case where there are clearly several others.
They're later |=d with constants, but never reset Caught by valgrind while investigating #12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes #12954
Thank you all for invaluable tips and ideas pitched in this PR! Things that could be improved/explored later on:
|
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#12954
`configure` now accepts `--enable-asan` and `--enable-ubsan` switches which results in passing `-fsanitize=address` and `-fsanitize=undefined`, respectively, to the compiler. Those flags are enabled in GitHub workflows for ZTS and zloop. Errors reported by both instrumentations are corrected, except for: - Memory leak reporting is (temporarily) suppressed. The cost of fixing them is relatively high compared to the gains. - Checksum computing functions in `module/zcommon/zfs_fletcher*` have UBSan errors suppressed. It is completely impractical to enforce 64-byte payload alignment there due to performance impact. - There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom memory allocator is used there rendering that measure unfeasible. - Memory leaks detection has to be suppressed for `cmd/zvol_id`. `zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is incompatible with memory leaks detection. Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes openzfs#12928
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#12954
`configure` now accepts `--enable-asan` and `--enable-ubsan` switches which results in passing `-fsanitize=address` and `-fsanitize=undefined`, respectively, to the compiler. Those flags are enabled in GitHub workflows for ZTS and zloop. Errors reported by both instrumentations are corrected, except for: - Memory leak reporting is (temporarily) suppressed. The cost of fixing them is relatively high compared to the gains. - Checksum computing functions in `module/zcommon/zfs_fletcher*` have UBSan errors suppressed. It is completely impractical to enforce 64-byte payload alignment there due to performance impact. - There's no ASan heap poisoning in `module/zstd/lib/zstd.c`. A custom memory allocator is used there rendering that measure unfeasible. - Memory leaks detection has to be suppressed for `cmd/zvol_id`. `zvol_id` is run by udev with the help of `ptrace(2)`. Tracing is incompatible with memory leaks detection. Reviewed-by: Ahelenia Ziemiańska <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: szubersk <[email protected]> Closes openzfs#12928
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#12954
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#12954
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Ahelenia Ziemiańska <[email protected]> Closes openzfs#12954
Motivation and Context
Description
configure
now accepts--enable-asan
and--enable-ubsan
switcheswhich results in passing
-fsanitize=address
and
-fsanitize=undefined
, respectively, to the compiler. Thoseflags are enabled in GitHub workflows for ZTS and zloop. Errors
reported by both instrumentations are corrected, except for:
Memory leak reporting is (temporarily) suppressed. The cost of
fixing them is relatively high compared to the gains.
Checksum computing functions in
module/zcommon/zfs_fletcher*
have UBSan errors suppressed. It is completely impractical
to enforce 64-byte payload alignment there due to performance
impact.
There's no ASan heap poisoning in
module/zstd/lib/zstd.c
. A custommemory allocator is used there rendering that measure
unfeasible.
Memory leaks detection has to be suppressed for
cmd/zvol_id
.zvol_id
is run by udev with the help ofptrace(2)
. Tracing isincompatible with memory leaks detection.
Close #12215
Close #12216
How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.