From 08771955543ee62bfba7c29f08c9272453765d86 Mon Sep 17 00:00:00 2001 From: Brian Atkinson Date: Thu, 19 Jan 2023 15:19:47 -0700 Subject: [PATCH] O_DIRECT reads unaligned page size filesize Varada (varada.kari@gmail.com) pointed out an issue with O_DIRECT reads with the following test case: dd if=/dev/urandom of=/local_zpool/file2 bs=512 count=79 truncate -s 40382 /local_zpool/file2 zpool export local_zpool zpool import -d ~/ local_zpool dd if=/local_zpool/file2 of=/dev/null bs=1M iflag=direct That led to following panic happening: [ 307.769267] VERIFY(IS_P2ALIGNED(size, sizeof (uint32_t))) failed [ 307.782997] PANIC at zfs_fletcher.c:870:abd_fletcher_4_iter() [ 307.788743] Showing stack for process 9665 [ 307.792834] CPU: 47 PID: 9665 Comm: z_rd_int_5 Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 #1 [ 307.804298] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 307.811682] Call Trace: [ 307.814131] dump_stack+0x41/0x60 [ 307.817449] spl_panic+0xd0/0xe8 [spl] [ 307.821210] ? irq_work_queue+0x9/0x20 [ 307.824961] ? wake_up_klogd.part.30+0x30/0x40 [ 307.829407] ? vprintk_emit+0x125/0x250 [ 307.833246] ? printk+0x58/0x6f [ 307.836391] spl_assert.constprop.1+0x16/0x20 [zfs] [ 307.841438] abd_fletcher_4_iter+0x6c/0x101 [zfs] [ 307.846343] ? abd_fletcher_4_simd2scalar+0x83/0x83 [zfs] [ 307.851922] abd_iterate_func+0xb1/0x170 [zfs] [ 307.856533] abd_fletcher_4_impl+0x3f/0xa0 [zfs] [ 307.861334] abd_fletcher_4_native+0x52/0x70 [zfs] [ 307.866302] ? enqueue_entity+0xf1/0x6e0 [ 307.870226] ? select_idle_sibling+0x23/0x700 [ 307.874587] ? enqueue_task_fair+0x94/0x710 [ 307.878771] ? select_task_rq_fair+0x351/0x990 [ 307.883208] zio_checksum_error_impl+0xff/0x5f0 [zfs] [ 307.888435] ? abd_fletcher_4_impl+0xa0/0xa0 [zfs] [ 307.893401] ? spl_kmem_alloc_impl+0xce/0xf0 [spl] [ 307.898203] ? __wake_up_common+0x7a/0x190 [ 307.902300] ? __switch_to_asm+0x41/0x70 [ 307.906220] ? __switch_to_asm+0x35/0x70 [ 307.910145] ? __switch_to_asm+0x41/0x70 [ 307.914061] ? __switch_to_asm+0x35/0x70 [ 307.917980] ? __switch_to_asm+0x41/0x70 [ 307.921903] ? __switch_to_asm+0x35/0x70 [ 307.925821] ? __switch_to_asm+0x35/0x70 [ 307.929739] ? __switch_to_asm+0x41/0x70 [ 307.933658] ? __switch_to_asm+0x35/0x70 [ 307.937582] zio_checksum_error+0x47/0xc0 [zfs] [ 307.942288] raidz_checksum_verify+0x3a/0x70 [zfs] [ 307.947257] vdev_raidz_io_done+0x4b/0x160 [zfs] [ 307.952049] zio_vdev_io_done+0x7f/0x200 [zfs] [ 307.956669] zio_execute+0xee/0x210 [zfs] [ 307.960855] taskq_thread+0x203/0x420 [spl] [ 307.965048] ? wake_up_q+0x70/0x70 [ 307.968455] ? zio_execute_stack_check.constprop.1+0x10/0x10 [zfs] [ 307.974807] ? taskq_lowest_id+0xc0/0xc0 [spl] [ 307.979260] kthread+0x10a/0x120 [ 307.982485] ? set_kthread_struct+0x40/0x40 [ 307.986670] ret_from_fork+0x35/0x40 The reason this was occuring was because by doing the zpool export that meant the initial read of O_DIRECT would be forced to go down to disk. In this case it was still valid as bs=1M is still page size aligned; howver, the file length was not. So when issuing the O_DIRECT read even after calling make_abd_for_dbuf() we had an extra page allocated in the original ABD along with the linear ABD attached at the end of the gang abd from make_abd_for_dbuf(). This is an issue as it is our expectations with read that the block sizes being read are page aligned. When we do our check we are only checking the request but not the actual size of data we may read such as the entire file. In order to remedy this situation, I updated zfs_read() to attempt to read as much as it can using O_DIRECT based on if the length is page-sized aligned. Any additional bytes that are requested are then read into the ARC. This still stays with our semantics that IO requests must be page sized aligned. There are a bit of draw backs here if there is only a single block being read. In this case the block will be read twice. Once using O_DIRECT and then using buffered to fill in the remaining data for the users request. However, this should not be a big issue most of the time. In the normal case a user may ask for a lot of data from a file and only the stray bytes at the end of the file will have to be read using the ARC. In order to make sure this case was completely covered, I added a new ZTS test case dio_unaligned_filesize to test this out. The main thing with that test case is the first O_DIRECT read will issue out two reads two being O_DIRECT and the third being buffered for the remaining requested bytes. As part of this commit, I also updated stride_dd to take an additional parameter of -e, which says read the entire input file and ingore the count (-c) option. We need to use stride_dd for FreeBSD as dd does not make sure the buffer is page aligned. This udpate to stride_dd allows us to use it to test out this case in dio_unaligned_filesize for both Linux and FreeBSD. While this may not be the most elegant solution, it does stick with the semanatics and still reads all the data the user requested. I am fine with revisiting this and maybe we just return a short read? Signed-off-by: Brian Atkinson --- module/zfs/zfs_vnops.c | 46 +++++- tests/runfiles/common.run | 3 +- tests/zfs-tests/cmd/stride_dd.c | 152 ++++++++++++------ tests/zfs-tests/tests/Makefile.am | 1 + .../tests/functional/direct/dio.kshlib | 29 ++++ .../direct/dio_unaligned_filesize.ksh | 92 +++++++++++ .../functional/direct/dio_write_verify.ksh | 8 - 7 files changed, 268 insertions(+), 63 deletions(-) create mode 100755 tests/zfs-tests/tests/functional/direct/dio_unaligned_filesize.ksh diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c index 39d201a516f4..415341270519 100644 --- a/module/zfs/zfs_vnops.c +++ b/module/zfs/zfs_vnops.c @@ -380,14 +380,32 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr) ssize_t chunk_size = zfs_vnops_read_chunk_size; ssize_t n = MIN(zfs_uio_resid(uio), zp->z_size - zfs_uio_offset(uio)); ssize_t start_resid = n; + ssize_t dio_remaining_resid = 0; - /* - * All pages for an O_DIRECT request have already been mapped so - * there's no compelling reason to handle this uio is smaller chunks. - */ - if (uio->uio_extflg & UIO_DIRECT) + if (uio->uio_extflg & UIO_DIRECT) { + /* + * All pages for an O_DIRECT request ahve already been mapped + * so there's no compelling reason to handle this uio in + * smaller chunks. + */ chunk_size = DMU_MAX_ACCESS; + /* + * In the event that the O_DIRECT request is reading the entire + * file, it is possible file's length is not page sized + * aligned. However, lower layers expect that the Direct I/O + * request is page-aligned. In this case, as much of the file + * that can be read using Direct I/O happens and the remaining + * amount will be read through the ARC. + * + * This is still consistent with the semantics of Direct I/O in + * ZFS as at a minimum the I/O request must be page-aligned. + */ + dio_remaining_resid = n - P2ALIGN(n, PAGE_SIZE); + if (dio_remaining_resid != 0) + n -= dio_remaining_resid; + } + while (n > 0) { ssize_t nbytes = MIN(n, chunk_size - P2PHASE(zfs_uio_offset(uio), chunk_size)); @@ -427,7 +445,23 @@ zfs_read(struct znode *zp, zfs_uio_t *uio, int ioflag, cred_t *cr) n -= nbytes; } - int64_t nread = start_resid - n; + int64_t nread = start_resid; + if (error == 0 && (uio->uio_extflg & UIO_DIRECT) && + dio_remaining_resid != 0) { + /* + * Temporarily remove the UIO_DIRECT flag from the UIO so the + * remainder of the file can be read using the ARC. + */ + uio->uio_extflg &= ~UIO_DIRECT; + error = dmu_read_uio_dbuf(sa_get_db(zp->z_sa_hdl), uio, + dio_remaining_resid); + uio->uio_extflg |= UIO_DIRECT; + + if (error != 0) + n -= dio_remaining_resid; + } + nread -= n; + dataset_kstats_update_read_kstats(&zfsvfs->z_kstat, nread); out: zfs_rangelock_exit(lr); diff --git a/tests/runfiles/common.run b/tests/runfiles/common.run index b856fe90f2d8..82ce0f70bb0b 100644 --- a/tests/runfiles/common.run +++ b/tests/runfiles/common.run @@ -697,7 +697,8 @@ tags = ['functional', 'delegate'] tests = ['dio_aligned_block', 'dio_async_always', 'dio_async_fio_ioengines', 'dio_compression', 'dio_dedup', 'dio_encryption', 'dio_grow_block', 'dio_max_recordsize', 'dio_mixed', 'dio_mmap', 'dio_property', - 'dio_random', 'dio_recordsize', 'dio_unaligned_block'] + 'dio_random', 'dio_recordsize', 'dio_unaligned_block', + 'dio_unaligned_filesize'] tags = ['functional', 'direct'] [tests/functional/exec] diff --git a/tests/zfs-tests/cmd/stride_dd.c b/tests/zfs-tests/cmd/stride_dd.c index 055228494e1d..19aab1c97f0c 100644 --- a/tests/zfs-tests/cmd/stride_dd.c +++ b/tests/zfs-tests/cmd/stride_dd.c @@ -33,6 +33,7 @@ static int if_o_direct = 0; static int of_o_direct = 0; static int skip = 0; static int skipbytes = 0; +static int entire_file = 0; static const char *execname = "stride_dd"; static void usage(void); @@ -42,10 +43,10 @@ static void usage(void) { (void) fprintf(stderr, - "usage: %s -i inputfile -o outputfile -b blocksize -c count \n" + "usage: %s -i inputfile -o outputfile -b blocksize [-c count]\n" " [-s stride] [-k seekblocks] [-K seekbytes]\n" " [-a alignment] [-d if_o_direct] [-D of_o_direct]\n" - " [-p skipblocks] [-P skipbytes]\n" + " [-p skipblocks] [-P skipbytes] [-e entire_file]\n" "\n" "Simplified version of dd that supports the stride option.\n" "A stride of n means that for each block written, n - 1 blocks\n" @@ -56,12 +57,13 @@ usage(void) " inputfile: File to read from\n" " outputfile: File to write to\n" " blocksize: Size of each block to read/write\n" - " count: Number of blocks to read/write\n" + " count: Number of blocks to read/write (Required" + " unless -e is used)\n" " stride: Read/write a block then skip (stride - 1) blocks" "\n" " seekblocks: Number of blocks to skip at start of output\n" " seekbytes: Treat seekblocks as byte count\n" - " alignment: Alignment passed to posix_memalign() (default " + " alignment: Alignment passed to posix_memalign() (default" " PAGE_SIZE)\n" " if_o_direct: Use O_DIRECT with inputfile (default no O_DIRECT)" "\n" @@ -69,7 +71,9 @@ usage(void) " O_DIRECT)\n" " skipblocks: Number of blocks to skip at start of input " " (default 0)\n" - " skipbytes: Treat skipblocks as byte count\n", + " skipbytes: Treat skipblocks as byte count\n" + " entire_file: When used the entire inputfile will be read and" + " count will be ignored\n", execname); (void) exit(1); } @@ -103,7 +107,7 @@ parse_options(int argc, char *argv[]) extern char *optarg; extern int optind, optopt; - while ((c = getopt(argc, argv, "a:b:c:dDi:o:s:k:Kp:P")) != -1) { + while ((c = getopt(argc, argv, "a:b:c:deDi:o:s:k:Kp:P")) != -1) { switch (c) { case 'a': alignment = atoi(optarg); @@ -121,6 +125,10 @@ parse_options(int argc, char *argv[]) if_o_direct = 1; break; + case 'e': + entire_file = 1; + break; + case 'D': of_o_direct = 1; break; @@ -172,25 +180,106 @@ parse_options(int argc, char *argv[]) } } - if (bsize <= 0 || count <= 0 || stride <= 0 || ifile == NULL || - ofile == NULL || seek < 0 || invalid_alignment(alignment) || - skip < 0) { + if (bsize <= 0 || stride <= 0 || ifile == NULL || ofile == NULL || + seek < 0 || invalid_alignment(alignment) || skip < 0) { + (void) fprintf(stderr, + "Required parameter(s) missing or invalid.\n"); + (void) usage(); + } + + if (count <= 0 && entire_file == 0) { (void) fprintf(stderr, "Required parameter(s) missing or invalid.\n"); (void) usage(); } } +static void +read_entire_file(int ifd, int ofd, void *buf) +{ + int c; + + do { + c = read(ifd, buf, bsize); + if (c < 0) { + perror("read"); + exit(2); + } else if (c != 0) { + c = write(ofd, buf, bsize); + if (c < 0) { + perror("write"); + exit(2); + } + + } + + if (stride > 1) { + if (lseek(ifd, (stride - 1) * bsize, SEEK_CUR) == -1) { + perror("input lseek"); + exit(2); + } + if (lseek(ofd, (stride - 1) * bsize, SEEK_CUR) == -1) { + perror("output lseek"); + exit(2); + } + } + } while (c != 0); +} + +static void +read_on_count(int ifd, int ofd, void *buf) +{ + int i; + int c; + + for (i = 0; i < count; i++) { + c = read(ifd, buf, bsize); + if (c != bsize) { + if (c < 0) { + perror("read"); + } else { + (void) fprintf(stderr, + "%s: unexpected short read, read %d " + "bytes, expected %d\n", execname, + c, bsize); + } + exit(2); + } + + c = write(ofd, buf, bsize); + if (c != bsize) { + if (c < 0) { + perror("write"); + } else { + (void) fprintf(stderr, + "%s: unexpected short write, wrote %d " + "bytes, expected %d\n", execname, + c, bsize); + } + exit(2); + } + + if (stride > 1) { + if (lseek(ifd, (stride - 1) * bsize, SEEK_CUR) == -1) { + perror("input lseek"); + exit(2); + } + if (lseek(ofd, (stride - 1) * bsize, SEEK_CUR) == -1) { + perror("output lseek"); + exit(2); + } + } + } +} + int main(int argc, char *argv[]) { - int i; int ifd; int ofd; int ifd_flags = O_RDONLY; int ofd_flags = O_WRONLY | O_CREAT; void *buf; - int c; parse_options(argc, argv); @@ -241,44 +330,11 @@ main(int argc, char *argv[]) } } - for (i = 0; i < count; i++) { - c = read(ifd, buf, bsize); - if (c != bsize) { - if (c < 0) { - perror("read"); - } else { - (void) fprintf(stderr, - "%s: unexpected short read, read %d " - "bytes, expected %d\n", execname, - c, bsize); - } - exit(2); - } - - c = write(ofd, buf, bsize); - if (c != bsize) { - if (c < 0) { - perror("write"); - } else { - (void) fprintf(stderr, - "%s: unexpected short write, wrote %d " - "bytes, expected %d\n", execname, - c, bsize); - } - exit(2); - } + if (entire_file == 1) + read_entire_file(ifd, ofd, buf); + else + read_on_count(ifd, ofd, buf); - if (stride > 1) { - if (lseek(ifd, (stride - 1) * bsize, SEEK_CUR) == -1) { - perror("input lseek"); - exit(2); - } - if (lseek(ofd, (stride - 1) * bsize, SEEK_CUR) == -1) { - perror("output lseek"); - exit(2); - } - } - } free(buf); (void) close(ofd); diff --git a/tests/zfs-tests/tests/Makefile.am b/tests/zfs-tests/tests/Makefile.am index a4eb0cfdc50b..ae9229e8ede8 100644 --- a/tests/zfs-tests/tests/Makefile.am +++ b/tests/zfs-tests/tests/Makefile.am @@ -1474,6 +1474,7 @@ nobase_dist_datadir_zfs_tests_tests_SCRIPTS += \ functional/direct/dio_random.ksh \ functional/direct/dio_recordsize.ksh \ functional/direct/dio_unaligned_block.ksh \ + functional/direct/dio_unaligned_filesize.ksh \ functional/direct/dio_write_verify.ksh \ functional/direct/dio_write_stable_pages.ksh \ functional/direct/setup.ksh \ diff --git a/tests/zfs-tests/tests/functional/direct/dio.kshlib b/tests/zfs-tests/tests/functional/direct/dio.kshlib index 8e7e045536d7..4e8f8ccdf8bb 100644 --- a/tests/zfs-tests/tests/functional/direct/dio.kshlib +++ b/tests/zfs-tests/tests/functional/direct/dio.kshlib @@ -300,3 +300,32 @@ function check_read # pool file bs count skip flags buf_rd dio_rd log_fail "Direct reads $dio_rd_actual of $dio_rd_expect" fi } + +function get_file_size +{ + typeset filename="$1" + + if is_linux; then + filesize=$(stat -c %s $filename) + else + filesize=$(stat -s $filename | awk '{print $8}' | grep -o '[0-9]\+') + fi + + echo $filesize +} + +function do_truncate_reduce +{ + typeset filename=$1 + typeset size=$2 + + filesize=$(get_file_size $filename) + eval "echo original filesize: $filesize" + if is_linux; then + truncate $filename -s $((filesize - size)) + else + truncate -s -$size $filename + fi + filesize=$(get_file_size $filename) + eval "echo new filesize after truncate: $filesize" +} diff --git a/tests/zfs-tests/tests/functional/direct/dio_unaligned_filesize.ksh b/tests/zfs-tests/tests/functional/direct/dio_unaligned_filesize.ksh new file mode 100755 index 000000000000..571767d3b1c9 --- /dev/null +++ b/tests/zfs-tests/tests/functional/direct/dio_unaligned_filesize.ksh @@ -0,0 +1,92 @@ +#!/bin/ksh -p +# +# CDDL HEADER START +# +# The contents of this file are subject to the terms of the +# Common Development and Distribution License (the "License"). +# You may not use this file except in compliance with the License. +# +# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE +# or https://opensource.org/licenses/CDDL-1.0. +# See the License for the specific language governing permissions +# and limitations under the License. +# +# When distributing Covered Code, include this CDDL HEADER in each +# file and include the License file at usr/src/OPENSOLARIS.LICENSE. +# If applicable, add the following below this CDDL HEADER, with the +# fields enclosed by brackets "[]" replaced with your own identifying +# information: Portions Copyright [yyyy] [name of copyright owner] +# +# CDDL HEADER END +# + +# +# Copyright (c) 2022 by Triad National Security, LLC. +# + +. $STF_SUITE/include/libtest.shlib +. $STF_SUITE/tests/functional/direct/dio.cfg +. $STF_SUITE/tests/functional/direct/dio.kshlib + +# +# DESCRIPTION: +# Verify Direct I/O reads can read an entire file that is not +# page-aligned in length. When a file is not page-aligned in total +# length, as much that can be read using using O_DIRECT is done so and +# the rest is read using the ARC. O_DIRECT requires page-size alignment. +# +# STRATEGY: +# 1. Write a file that is page-aligned (buffered) +# 2. Truncate the file to be 512 bytes less +# 3. Export then import the Zpool flushing out the ARC +# 4. Read back the file using O_DIRECT +# 5. Verify the file is read back with both Direct I/O and buffered I/O +# + +verify_runnable "global" + +function cleanup +{ + log_must rm -f "$filename" + log_must set recordsize=$rs $TESTPOOL/$TESTFS + check_dio_write_chksum_verify_failures $TESTPOOL "raidz" 0 +} + +log_assert "Verify Direct I/O reads can read an entire file that is not \ + page-aligned" + +log_onexit cleanup + +mntpnt=$(get_prop mountpoint $TESTPOOL/$TESTFS) + +rs=$(get_prop recordsize $TESTPOOL/$TESTFS) +log_must zfs set recordsize=128k $TESTPOOL/$TESTFS + +bs=$((128 * 1024)) # bs=recordsize (128k) +filename="$mntpnt/testfile.iso" + +log_must stride_dd -i /dev/urandom -o $filename -b $bs -c 2 +# Truncating file so the total length is no longer page-size aligned +log_must do_truncate_reduce $filename 512 + +# Exporting the Zpool to make sure all future reads happen from the ARC +log_must zpool export $TESTPOOL +log_must zpool import $TESTPOOL + +# Reading the file back using Direct I/O +prev_dio_read=$(get_iostats_stat $TESTPOOL direct_read_count) +prev_arc_read=$(get_iostats_stat $TESTPOOL arc_read_count) +log_must stride_dd -i $filename -o /dev/null -b $bs -e -d +curr_dio_read=$(get_iostats_stat $TESTPOOL direct_read_count) +curr_arc_read=$(get_iostats_stat $TESTPOOL arc_read_count) +total_dio_read=$((curr_dio_read - prev_dio_read)) +total_arc_read=$((curr_arc_read - prev_arc_read)) + +# We should see both Direct I/O reads an ARC read to read the entire file that +# is not page-size aligned +if [[ $total_dio_read -lt 2 ]] || [[ $total_arc_read -lt 1 ]]; then + log_fail "Expect 2 reads from Direct I/O and 1 from the ARC but \ + Direct I/O: $total_dio_read ARC: $total_arc_read" +fi + +log_pass "Verified Direct I/O read can read a none page-aligned length file" diff --git a/tests/zfs-tests/tests/functional/direct/dio_write_verify.ksh b/tests/zfs-tests/tests/functional/direct/dio_write_verify.ksh index 75a0b9bda28f..350f451d8534 100755 --- a/tests/zfs-tests/tests/functional/direct/dio_write_verify.ksh +++ b/tests/zfs-tests/tests/functional/direct/dio_write_verify.ksh @@ -61,14 +61,6 @@ function cleanup log_must set_tunable32 VDEV_DIRECT_WR_VERIFY_PCT 2 } -function get_file_size -{ - typeset filename="$1" - typeset filesize=$(stat -c %s $filename) - - echo $filesize -} - log_assert "Verify checksum verify works for Direct I/O writes." if is_freebsd; then