Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
O_DIRECT reads unaligned page size filesize
Varada ([email protected]) pointed out an issue with O_DIRECT reads with the following test case: dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79 truncate -s 40382 /local_zpool/file2 zpool export local_zpool zpool import -d ~/ local_zpool dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct That led to following panic happening: [ 307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed [ 307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter() [ 307.788743] Showing stack for process 9665 [ 307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 307.811682] Call Trace: [ 307.814131] dump_stack+0x41/0x60 [ 307.817449] spl_panic+0xd0/0xe8 [spl] [ 307.821210] ? irq_work_queue+0x9/0x20 [ 307.824961] ? wake_up_klogd.part.30+0x30/0x40 [ 307.829407] ? vprintk_emit+0x125/0x250 [ 307.833246] ? printk+0x58/0x6f [ 307.836391] spl_assert.constprop.1+0x16/0x20 [zfs] [ 307.841438] abd_fletcher_4_iter+0x6c/0x101 [zfs] [ 307.846343] ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs] [ 307.851922] abd_iterate_func+0xb1/0x170 [zfs] [ 307.856533] abd_fletcher_4_impl+0x3f/0xa0 [zfs] [ 307.861334] abd_fletcher_4_native+0x52/0x70 [zfs] [ 307.866302] ? enqueue_entity+0xf1/0x6e0 [ 307.870226] ? select_idle_sibling+0x23/0x700 [ 307.874587] ? enqueue_task_fair+0x94/0x710 [ 307.878771] ? select_task_rq_fair+0x351/0x990 [ 307.883208] zio_checksum_error_impl+0xff/0x5f0 [zfs] [ 307.888435] ? abd_fletcher_4_impl+0xa0/0xa0 [zfs] [ 307.893401] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [ 307.898203] ? __wake_up_common+0x7a/0x190 [ 307.902300] ? __switch_to_asm+0x41/0x70 [ 307.906220] ? __switch_to_asm+0x35/0x70 [ 307.910145] ? __switch_to_asm+0x41/0x70 [ 307.914061] ? __switch_to_asm+0x35/0x70 [ 307.917980] ? __switch_to_asm+0x41/0x70 [ 307.921903] ? __switch_to_asm+0x35/0x70 [ 307.925821] ? __switch_to_asm+0x35/0x70 [ 307.929739] ? __switch_to_asm+0x41/0x70 [ 307.933658] ? __switch_to_asm+0x35/0x70 [ 307.937582] zio_checksum_error+0x47/0xc0 [zfs] [ 307.942288] raidz_checksum_verify+0x3a/0x70 [zfs] [ 307.947257] vdev_raidz_io_done+0x4b/0x160 [zfs] [ 307.952049] zio_vdev_io_done+0x7f/0x200 [zfs] [ 307.956669] zio_execute+0xee/0x210 [zfs] [ 307.960855] taskq_thread+0x203/0x420 [spl] [ 307.965048] ? wake_up_q+0x70/0x70 [ 307.968455] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 307.974807] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 307.979260] kthread+0x10a/0x120 [ 307.982485] ? set_kthread_struct+0x40/0x40 [ 307.986670] ret_from_fork+0x35/0x40 The reason this was occuring was because by doing the zpool export that meant the initial read of O_DIRECT would be forced to go down to disk. In this case it was still valid as bs=1M is still page size aligned; howver, the file length was not. So when issuing the O_DIRECT read even after calling make_abd_for_dbuf() we had an extra page allocated in the original ABD along with the linear ABD attached at the end of the gang abd from make_abd_for_dbuf(). This is an issue as it is our expectations with read that the block sizes being read are page aligned. When we do our check we are only checking the request but not the actual size of data we may read such as the entire file. In order to remedy this situation, I updated zfs_read() to attempt to read as much as it can using O_DIRECT based on if the length is page-sized aligned. Any additional bytes that are requested are then read into the ARC. This still stays with our semantics that IO requests must be page sized aligned. There are a bit of draw backs here if there is only a single block being read. In this case the block will be read twice. Once using O_DIRECT and then using buffered to fill in the remaining data for the users request. However, this should not be a big issue most of the time. In the normal case a user may ask for a lot of data from a file and only the stray bytes at the end of the file will have to be read using the ARC. In order to make sure this case was completely covered, I added a new ZTS test case dio_unaligned_filesize to test this out. The main thing with that test case is the first O_DIRECT read will issue out two reads two being O_DIRECT and the third being buffered for the remaining requested bytes. As part of this commit, I also updated stride_dd to take an additional parameter of -e, which says read the entire input file and ingore the count (-c) option. We need to use stride_dd for FreeBSD as dd does not make sure the buffer is page aligned. This udpate to stride_dd allows us to use it to test out this case in dio_unaligned_filesize for both Linux and FreeBSD. While this may not be the most elegant solution, it does stick with the semanatics and still reads all the data the user requested. I am fine with revisiting this and maybe we just return a short read? Signed-off-by: Brian Atkinson <[email protected]>
- Loading branch information