-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Posting checksum events fails with ENOMEM on FreeBSD #14717
Comments
I personally have found the contents of the histogram buckets useful a couple of times, for telling it was always a certain bit mangled, or that all the bits were wrong (e.g. the result was noise). So I'd prefer to not lose them. Could we change the interface moving forward so we chunk up the events, rather than either bloating the buffer per message with uncertain implications or dropping fields? It seems like it'd be more robust to if we ever had occasion to add more data to those again, and failing to pass too-large messages in the "old" interface would be status quo ante anyway... |
What about removing all of the histogram stuff from the event, but adding a custom dtrace probe to get it? If checksum events are rare, then the histogram probably won't be useful anyway. And if checksum events are frequent, then one can start dtracing after noticing checksum events. |
Removing useful information because FreeBSD has a tiny buffer seems like a poor tradeoff, to me. What are you concerned about, with increasing the buffer limit? Is it a very hot path for some use cases? Is it fixed-size copies every time, so everyone gets a larger copy penalty? |
Yes, it's a fixed size, so increasing it for one event increases it for all. And it's already pretty big, compared to what most events need (usually fewer than 200 bytes). How useful is the histogram, really? You said that you noticed a stuck bit sometimes. What kind of hardware was that on? |
If valid values goes only to 64, you could divide by two the length of the histogram messages by returning each 0-63 value as a Base64 character. |
The other "bad" idea that came to mind is, we have all these compression functions laying around...I wonder if any of them are sufficient to binpack this better, knowing a lot of them have fixed overheads... |
No, because that PR was only a partial solution. The messages are still too long. |
@rincebrain this is still a problem. Would you be open to adding an off-by-default sysctl knob, something like |
Oh, interesting, I had assumed this went in back when 14716 did. Having read the code more carefully, embarassingly, I'm not actually sure I mind this any more, because what I was using is bad_set_bits/bad_cleared_bits, not the histogram. I'm so sorry for derailing this over my incorrect perception. |
Great! I'll create a PR to remove the histogram fields. However, I now realize there's another problem too. For disks that aren't in RAIDZ arrays, ZFS adds a few other fields. They look like this |
SGTM. Hopefully if someone else is going to miss them, they'll speak up. |
The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>
With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>
The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>
With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>
The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15052
The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#15052
With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#14717 Closes openzfs#15052
The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15052
With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #14717 Closes #15052
The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#15052
With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#14717 Closes openzfs#15052
System information
Describe the problem you're observing
On FreeBSD, devctl events must fit within a 1016 byte string buffer. If they are too large, they will be silently dropped and never sent to userland. ZFS's
ereport.fs.zfs.checksum
events are far too large: 1862 bytes in my test case. So they never get sent to userland. This means that zfsd is unable to degrade a vdev that produces too many checksum errors.Describe how to reproduce the problem
Optionally, run a dtrace script like this, which will show various ZFS events get queued for userland, but not any
ereport.fs.zfs.checksum
events.Possible solutions
PR #14716 partially fixes the problem by making the report format less verbose. However, it isn't enough. We still need to trim another ~75 bytes from the report (the exact number depends on the size of the pool and vdev names). Here are some possibilities for how to fully fix it:
bad_cleared_histogram
andbad_set_histogram
into a singlebad_bit_histogram
. That would save about 150 bytes.zio_err
will probably always be zero. But I'm not sure which of the others can go.Current Report format
Here's an example of a current report, including the changes from PR #14716 .
The text was updated successfully, but these errors were encountered: