Posting checksum events fails with ENOMEM on FreeBSD #14717

asomers · 2023-04-04T23:08:24Z

System information

Type	Version/Name
Distribution Name	FreeBSD
Distribution Version	14.0-CURRENT
Kernel Version	bef81bc0aef (main as of 3-April-2023)
Architecture	amd64
OpenZFS Version	zfs-2.1.99-FreeBSD_g57cfae4a2 zfs-kmod-2.1.99-FreeBSD_g431083f75

Describe the problem you're observing

On FreeBSD, devctl events must fit within a 1016 byte string buffer. If they are too large, they will be silently dropped and never sent to userland. ZFS's ereport.fs.zfs.checksum events are far too large: 1862 bytes in my test case. So they never get sent to userland. This means that zfsd is unable to degrade a vdev that produces too many checksum errors.

Describe how to reproduce the problem

vim /etc/kyua/kyua.conf # Set test_suites.FreeBSD.disks
cd /usr/tests/sys/cddl/zfs/tests/zfsd
sudo kyua debug zfsd_test:zfsd_degrade_001_pos

Optionally, run a dtrace script like this, which will show various ZFS events get queued for userland, but not any ereport.fs.zfs.checksum events.

sudo dtrace -i 'fbt:kernel:devctl_queue:entry {printf("%s", stringof(args[0]->dei_data));}'

Possible solutions

PR #14716 partially fixes the problem by making the report format less verbose. However, it isn't enough. We still need to trim another ~75 bytes from the report (the exact number depends on the size of the pool and vdev names). Here are some possibilities for how to fully fix it:

Combine the bad_cleared_histogram and bad_set_histogram into a single bad_bit_histogram. That would save about 150 bytes.
Widen the histogram buckets from one bit wide to two bits wide.
Simply remove the histogram fields. zpool-events(8) suggests that they are intended for use with IDE and parallel SCSI, which are obsolete.
Remove several other fields. But which ones? zio_err will probably always be zero. But I'm not sure which of the others can go.
Double the size of FreeBSD's buffer, at an uncertain performance cost

Current Report format

Here's an example of a current report, including the changes from PR #14716 .

 class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62

The text was updated successfully, but these errors were encountered:

rincebrain · 2023-04-05T01:16:11Z

I personally have found the contents of the histogram buckets useful a couple of times, for telling it was always a certain bit mangled, or that all the bits were wrong (e.g. the result was noise). So I'd prefer to not lose them.

Could we change the interface moving forward so we chunk up the events, rather than either bloating the buffer per message with uncertain implications or dropping fields? It seems like it'd be more robust to if we ever had occasion to add more data to those again, and failing to pass too-large messages in the "old" interface would be status quo ante anyway...

asomers · 2023-04-05T03:19:08Z

What about removing all of the histogram stuff from the event, but adding a custom dtrace probe to get it? If checksum events are rare, then the histogram probably won't be useful anyway. And if checksum events are frequent, then one can start dtracing after noticing checksum events.

rincebrain · 2023-04-05T08:43:22Z

Removing useful information because FreeBSD has a tiny buffer seems like a poor tradeoff, to me. What are you concerned about, with increasing the buffer limit? Is it a very hot path for some use cases? Is it fixed-size copies every time, so everyone gets a larger copy penalty?

asomers · 2023-04-05T12:17:05Z

Yes, it's a fixed size, so increasing it for one event increases it for all. And it's already pretty big, compared to what most events need (usually fewer than 200 bytes). How useful is the histogram, really? You said that you noticed a stuck bit sometimes. What kind of hardware was that on?

Precidata · 2023-04-05T13:28:46Z

If valid values goes only to 64, you could divide by two the length of the histogram messages by returning each 0-63 value as a Base64 character.

rincebrain · 2023-04-05T18:59:01Z

The other "bad" idea that came to mind is, we have all these compression functions laying around...I wonder if any of them are sufficient to binpack this better, knowing a lot of them have fixed overheads...

mcmilk · 2023-04-13T09:22:11Z

@asomers - could this issue be closed - cause PR #14716 is in?

asomers · 2023-04-13T14:34:50Z

No, because that PR was only a partial solution. The messages are still too long.

asomers · 2023-07-10T22:28:34Z

@rincebrain this is still a problem. Would you be open to adding an off-by-default sysctl knob, something like vfs.zfs.checksum_event_verbose? When off, ZFS would omit the checksum histograms.

rincebrain · 2023-07-10T22:49:57Z

Oh, interesting, I had assumed this went in back when 14716 did.

Having read the code more carefully, embarassingly, I'm not actually sure I mind this any more, because what I was using is bad_set_bits/bad_cleared_bits, not the histogram.

I'm so sorry for derailing this over my incorrect perception.

asomers · 2023-07-11T16:44:38Z

Great! I'll create a PR to remove the histogram fields. However, I now realize there's another problem too. For disks that aren't in RAIDZ arrays, ZFS adds a few other fields. They look like this cksum_actual=4d458e2ca9781347e3b80ea566e095a42df54a8fcb91eb058f365662cfb23c73 cksum_algorithm=sha256 cksum_expected=e0c327ff9d388f51695151ff6f51d253ff865d718d6a514980ff67658f8d4be1. And those extra fields also cause the event to overflow the buffer. May I eliminate those, too? cksum_actual, in particular, should almost never be useful. Apart from fletcher4, a single-bit change in the input will render the checksum completely different.

rincebrain · 2023-07-11T19:24:15Z

SGTM. Hopefully if someone else is going to miss them, they'll speak up.

The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>

With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>

The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>

With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Sponsored-by: Axcient Signed-off-by: Alan Somers <[email protected]>

The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15052

The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#15052

With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#14717 Closes openzfs#15052

The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue #14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #15052

With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes #14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes #14717 Closes #15052

The checksum histograms were intended to be used with ATA and parallel SCSI, which are obsolete. With modern storage hardware, they will almost always look like white noise; all bits will be wrong. They only serve to bloat the event. That's a particular problem on FreeBSD, where events must fit into a 1016 byte buffer. This fixes issue openzfs#14717 for RAIDZ pools, but not for mirror pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#15052

With anything but fletcher-4, even a tiny change in the input will cause the checksum value to change completely. So knowing the actual and expected checksums doesn't provide much more information than "they don't match". The harm in sending them is simply that they bloat the event. In particular, on FreeBSD the event must fit into a 1016 byte buffer. Fixes openzfs#14717 for mirrored pools. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Rich Ercolani <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored-by: Axcient Closes openzfs#14717 Closes openzfs#15052

asomers added the Type: Defect Incorrect behavior (e.g. crash, hang) label Apr 4, 2023

asomers mentioned this issue Jul 11, 2023

Reduce bloat in ereport.fs.zfs.checksum events #15052

Closed

13 tasks

behlendorf closed this as completed in 6fd87e1 Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Posting checksum events fails with ENOMEM on FreeBSD #14717

Posting checksum events fails with ENOMEM on FreeBSD #14717

asomers commented Apr 4, 2023

rincebrain commented Apr 5, 2023

asomers commented Apr 5, 2023

rincebrain commented Apr 5, 2023

asomers commented Apr 5, 2023

Precidata commented Apr 5, 2023

rincebrain commented Apr 5, 2023

mcmilk commented Apr 13, 2023

asomers commented Apr 13, 2023

asomers commented Jul 10, 2023

rincebrain commented Jul 10, 2023

asomers commented Jul 11, 2023

rincebrain commented Jul 11, 2023

Posting checksum events fails with ENOMEM on FreeBSD #14717

Posting checksum events fails with ENOMEM on FreeBSD #14717

Comments

asomers commented Apr 4, 2023

System information

Describe the problem you're observing

Describe how to reproduce the problem

Possible solutions

Current Report format

rincebrain commented Apr 5, 2023

asomers commented Apr 5, 2023

rincebrain commented Apr 5, 2023

asomers commented Apr 5, 2023

Precidata commented Apr 5, 2023

rincebrain commented Apr 5, 2023

mcmilk commented Apr 13, 2023

asomers commented Apr 13, 2023

asomers commented Jul 10, 2023

rincebrain commented Jul 10, 2023

asomers commented Jul 11, 2023

rincebrain commented Jul 11, 2023