-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix endianness issues with zstd #12022
Conversation
I happen to run some zstd-compressed datasets on a BE machine at the moment, what can I do to work around this? Is simply changing the compression settings and recompress the data (like with send/recv) enough, or do I need to entirely recreate the affected datasets/pools? Thanks a lot for the fixes, by the way. |
@koachan the first thing will be to determine if the data written to your disk is actually wrong. If so, then we will have to somehow indicate this (bump the zstd version?), and have some backwards compat code for when reading blocks labeled 10405, to decode them properly in the cross-endian case, and make all new writes with the corrected code, use a new version indicator. We might be able to avoid this, if we are sure we are writing the data to the disk the same way in both endians, and then we just have to do the right thing to decode it when reading. |
I was going to wait until I finished testing it on all the relevant arches and pushed the updated branch to note this here, but since you noticed it... I have an updated branch that allows reading from every endianness mangling I've encountered on x86_64 and passes ZTS without issues. (...and writes the data out in the right order.) Once I've tested it on sparc64+ppc64+FBSD/ppc64, I'll push it. |
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64, and one written by current FBSD/ppc64. (I'll add one for Linux/sparc64 once the unmodified git checkout finishes building. Linux/ppc64 behaves identically to Linux/sparc64.) Signed-off-by: Rich Ercolani <[email protected]>
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64, and one written by current FBSD/ppc64. (I'll add one for Linux/sparc64 once the unmodified git checkout finishes building. Linux/ppc64 behaves identically to Linux/sparc64.) Signed-off-by: Rich Ercolani <[email protected]>
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64, and one written by current FBSD/ppc64. (I'll add one for Linux/sparc64 once the unmodified git checkout finishes building. Linux/ppc64 behaves identically to Linux/sparc64.) Signed-off-by: Rich Ercolani <[email protected]>
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64, one for Linux/sparc64, and one written by FBSD/ppc64. Signed-off-by: Rich Ercolani <[email protected]>
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64, one for Linux/sparc64, and one written by FBSD/ppc64. Signed-off-by: Rich Ercolani <[email protected]>
All right, a bunch of slow testing later (though not all of it, since sparc64 is amazingly slow...), here's the new branch. if anyone has better alternatives to what I did in the get/set functions that also passes #12030 (and, ideally, uses one consistent layout on every platform going forward), I'm all ears. edit: since someone asked me "why can't we just check whether the 00 byte is at the 'head' or the 'tail'?" |
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Signed-off-by: Rich Ercolani <[email protected]>
2718dc7
to
aa3d2a5
Compare
There, one big replacement of things with macros later, it seems to work as expected for read and write on Linux/x86_64, Linux/sparc64, and FreeBSD/ppc64, as well as all three producing data readable as expected on unmodified Linux/x86_64. I also stuck an "a" on the zstd module version. |
cmd/zdb/zdb.c
Outdated
* So we're searching 4 bytes to figure out where the version | ||
* and level bytes are. The level bytes can cover the whole range | ||
* of uint8_t, and so are not helpful. Fortunately, the version | ||
* field is going to have a leading 00 from now until version | ||
* 6.55.56 or higher with how it's represented, so we can dig | ||
* that out, and know that wherever we found it, the two bytes | ||
* "next" to it in the range are the other version bytes, in order, | ||
* and whichever byte remains is the level field. | ||
*/ | ||
uint32_t level = blob->raw_version_level; | ||
uint8_t findme = 0xff; | ||
int shift; | ||
for (shift = 0; shift < 4; shift++) { | ||
findme = BF32_GET(level, 8*shift, 8); | ||
if (findme == 0) | ||
break; | ||
} | ||
switch (shift) { | ||
case 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than duplicating this code, maybe make one function that gets both level and version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I did that and added helpers to get one of the two.
cmd/zdb/zdb.c
Outdated
case 0: | ||
version = BSWAP_32(version); | ||
version = BF32_GET(version, 8, 24); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reverse-engineering this code, I think that overall this is decoding a 32-bit number that can look like one of:
0x L0 V0 V1 00
0x V0 V1 00 L0
0x L0 00 V1 V0
0x 00 V1 V0 L0
where 00 is a literal 0, L0 is the level, V0 is the low bits of the version and V1 is the high bits of the version.
I would appreciate seeing this explanation in the comments, as well as an explanation of how the on-disk format could come to be this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm terribly sorry, I thought your prior request was explicitly "an explanation for why we can't use standard macros for this", which I thought was alleviated by doing so.
I have, to be clear, only seen two of these four cases in the wild (2 for LE and 3 for BE), outside of bytes I mangled with my own code modifications; I just included all the cases (and tested them, with aforementioned manually-induced-mangling) for completeness.
I'll include an explanation for the aforementioned two, but it will amount to "we wrote a representation which is not guaranteed to be consistent across platforms to disk".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we could be more specific, e.g. saying that it used a bitfield and the order of the elements of the bitfield are not specified by the C language, so it might be version-first or level-first. Although in practice you could check if any of our supported platforms actually did it in an other-than expected order, which might simplify this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked Linux/{aarch64,mips64el,ppc64,riscv64,sparc64,x86_64,x86} and FreeBSD/{x86_64,ppc64}, and the only difference is that BE did one case and LE did another. So I could hardcode checks for the two cases instead of all 4, but it doesn't feel like it simplifies much, since it's still going to require a nontrivial explanation for the two different bit configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There, I replaced the existing comment with one with diagrams. Please let me know if it's not clear enough.
module/zstd/zfs_zstd.c
Outdated
static uint8_t | ||
zfs_get_hdrlevel(const zfs_zstdhdr_t *blob) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to share this code with zdb's version? Perhaps by putting it in a header file as an inline function? The same could be done for the setting functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done.
include/sys/zstd/zstd.h
Outdated
* | ||
* 32 24 16 8 0 | ||
* +-------+-------+-------+-------+ | ||
* | level | 0 | version | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you indicate the high and low bytes of the version, because I think they are swapped here vs line 121. i.e. the (2-byte) "version" here is not the same as the (2-byte) "version" on line 121.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, they are; I thought that the clarity of just saying "version" versus "v0 | v1"/"v1 | v0" was a reasonable tradeoff, but I'll go change it since you disagree.
include/sys/zstd/zstd.h
Outdated
* | ||
* 32 24 16 8 0 | ||
* +-------+-------+-------+-------+ | ||
* | version | 0 | level | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this logic depends on "version" having some bits set in both its low byte and its high byte. It looks like that's the case for the current ZSTD_VERSION_NUMBER, but could we VERIFY() that's the case before we write it out? Otherwise we might accidentally bump to version 1.05.12
, which I think would have a zero low byte, and then this logic breaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, it does, though I don't think 1.5.12 would do it? If I'm understanding ZSTD_VERSION_NUMBER right, 1.5.12 turns into 1x10000 + 5x100 + 12 = 10512/0x2910 - 1.4.96 would, though, or 1.7.52, or ...
So (unless I'm overlooking something) there's two cases that can happen without the version number rising pretty dramatically far - the low byte is 0x00, or the high byte is 0x00. One effectively can't happen without the version rolling back in time to < 0.02.56 (or forward past 6.55.35, at which point I'd hope we could either drop the compat code or put it behind a non-default tunable), and the other winds up with a layout like 00 v0 00 LV or LV 00 v0 00; in either of the latter two cases, I believe the code will do the right thing no matter which order it iterates (though it should be consistent in order across platforms with this version anyway).
Assuming the above is right, I'm happy to add something to VERIFY that the case it can't handle and can happen (e.g. version >= 6.55.36) isn't true before writing it, though I think I'd prefer (and will unless told otherwise) to make it VERIFY against, say, <= 62000 so it causes breakage well before it's about to break the read code.
include/sys/zstd/zstd.h
Outdated
* header we're dealing with.) | ||
* | ||
* So now we use the BF32_SET macros to get consistent behavior (the LE | ||
* encoding, since x86 currently rules the world) across platforms, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the code actually uses the big-endian encoding (e.g. hdr->raw_version_level = BE_32(hdr->raw_version_level);
). Assuming we plan to maintain this "figure out the byteorder" functionality forever, maybe we should continue to write it in host byteorder, so that there's some chance of exercising this searching code even with pools written by the current software?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sake of precision - I tested this code against the assumption that it writes out bytes to disk in the same order that x86_64 platforms currently do, by attempting to read the bytes written by my modified code from both {x86_64, sparc64} hosts and by unmodified {x86_64,sparc64} hosts on an unpatched copy of git master on x86_64. Both of the former and the obvious one of the latter work.
Would you prefer I refer to the encoding in question as BSWAP_32(LE), to make it clear that it's the swapped representation of LE's view of the world, not the representation that BE platforms natively have, or something else?
I'd prefer to write only one ordering going forward (though I can straightforwardly go add further test cases to the ZTS PR where the above-discussed property of 00 v0 00 LV and vice-versa are true), so that, if we reach the point where this assertion becomes false (e.g. version > 6.55.35), we can reasonably safely stop using this by default a little while before it becomes a problem.
I have no strong preference, except in the absence of any preference to just leave it as two separate things rather than flattening them. |
Works for me. @ahrens if you could take another look at this we should be able to get it wrapped up. |
1d1707d
to
a12aac6
Compare
@ahrens can you take a look at this? thanks |
a12aac6
to
f3024e4
Compare
Gentle ping? I've just rebased it against master again. It'd be nice for at least all 3 users of ZFS on sparc64 to see this land... |
A question: would a test on virtual disk image/loopback devices be enough? Also, is testing that the pool can be imported (+ not get corrupted) on various endianness enough, or do I need to run other test cases too? |
If you want to experiment with it for some reason, you can test it on a real disk or a file or whatever permutation of storage backend you like. You could even use the test case in #12030 to run through trying both the BE and LE ondisk outcomes. |
@rincebrain thanks for the reminder. This change looks good to me and I agree we should go ahead and get it merged. Can you just squash this and wrap it up with a nice commit message. |
It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Fixes: openzfs#12008 Signed-off-by: Rich Ercolani <[email protected]>
f3024e4
to
8c95296
Compare
Squashed and wrote a brief commit message. |
Thanks for the squash and comment. Merged! |
It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12008 Closes openzfs#12022
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Signed-off-by: Rich Ercolani <[email protected]>
As detailed in #12022 and #12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12030
It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12008 Closes openzfs#12022
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12030
It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12008 Closes openzfs#12022
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12030
It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12008 Closes #12022
As detailed in #12022 and #12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes #12030
It turns out that layouts of union bitfields are a pain, and the current code results in an inconsistent layout between BE and LE systems, leading to zstd-active datasets on one erroring out on the other. Switch everyone over to the LE layout, and add compatibility code to read both. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Matthew Ahrens <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12008 Closes openzfs#12022
As detailed in openzfs#12022 and openzfs#12008, it turns out the current zstd implementation is quite nonportable, and results in various configurations of ondisk header that only each platform can read. So I've added a test which contains a dataset with a file written by Linux/x86_64 and one written by FBSD/ppc64. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: John Kennedy <[email protected]> Signed-off-by: Rich Ercolani <[email protected]> Closes openzfs#12030
All right, heck it, let's start soliciting feedback.
Motivation and Context
After working around the trivial issue in #12008 (which is also in this PR; I can separate it out if need be), it was discovered that using compress=zstd on any(? at least Linux/ppc64 and Linux/sparc64) big-endian platform results in incorrect storage of the header for the record, such that any similarly incorrect system can read/write the data fine, but "correct" systems (e.g. Linux/x86_64) will EIO on reading that data.
(Fascinatingly, FBSD/ppc64 encodes it differently than Linux/{ppc64,sparc64,x86_64}, such that Linux/{ppc64,sparc64} can read its data and vice-versa, which is why I included samples of all three in the test case in #12030 )As I commented in #12030, apparently my prior Linux/sparc64 test case was generated by a WIP patchset which mangled bytes differently, since I can no longer reproduce it, so that's been removed.Description
This PR essentially contains two changes:
(It occurs to me that specifying the relevant struct definition here might help. The header is defined as:
and a troublesome thing that's done with it is:
hdr->raw_version_level = BE_32(hdr->raw_version_level);
)
We were basically relying on the encoding of the bitfield in memory and serializing that out to disk, resulting in problems for interoperability.
This PR removes the bitfield (but leaves the raw_version_level uint32_t, as among other things a convenient pointer to the right place...), and wraps all access to the formerly-bitfielded contents of that uint32_t with get/set functions that search to determine the correct encoding (in the get case) and use pointer math to lay things out the same way for everyone (in the set case).
To quote the get method's comment, for how we do that:
Question: Do we want to somehow increment the zstd module version, e.g. 1.4.5a? Because from my own experience, at least DKMS will sometimes report "hey the module version is the same I totally don't have to replace it", and that would lead to all sorts of problems..
Question: Do we want to restrict set to both values at once? Either by making the individual setters private or by making them each individually call the other's zfs_get_hdr...() and call some combination setter that sets both at once - because the code currently only sets them both (on a new hdr, no less), but if one ever got used on a hdr with the incorrect ordering without the other, we'd have lossily mangled that hdr.
How Has This Been Tested?
("pass" because there are existing failures I should go report that happen on unmodified git master...)
Types of changes
Checklist:
Signed-off-by
.