Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixing race condition between mmap/direct IO
In commit ba30ec9 I got a little overzealous in code cleanup. While I was trying to remove all the stable page code for Linux, I misinterpreted why Brian Behlendorf originally had the try rangelock, drop page lock, and aquire range lock in zfs_fillpage(). This is still necessary even without stable pages. This has to occur to avoid a race condition between direct IO writes and pages being faulted in for mmap files. If the rangelock is not held, then a direct IO write can set db->db_data = NULL either in: 1. dmu_write_direct() -> dmu_buf_will_not_fill() -> dmu_buf_will_fill() -> dbuf_noread() -> dbuf_clear_data() 2. dmu_write_direct_done() This can cause the panic then, withtout the rangelock as dmu_read_imp() can get a NULL pointer for db->db_data when trying to do the memcpy. So this rangelock must be held in zfs_fillpage() not matter what. There are further semantics on when the rangelock should be held in zfs_fillpage(). It must only be held when doing zfs_getpage() -> zfs_fillpage(). The reason for this is mappedread() can call zfs_fillpage() if the page is not uptodate. This can occur becaue filemap_fault() will first add the pages to the inode's address_space mapping and then drop the page lock. This leaves open a window were mappedread() can be called. Since this can occur, mappedread() will hold both the page lock and the rangelock. This is perfectly valid and correct. However, it is important in this case to never grab the rangelock in zfs_fillpage(). If this happens, then a dead lock will occur. Finally it is important to note that the rangelock is first attempted to be grabbed with zfs_rangelock_tryenter(). The reason for this is the page lock must be dropped in order to grab the rangelock in this case. Otherwise there is a race between zfs_fillpage() and zfs_write() -> update_pages(). In update_pages() the rangelock is already held and it then grabs the page lock. So if the page lock is not dropped before acquiring the rangelock in zfs_fillpage() there can be a deadlock. Below is a stack trace showing the NULL pointer dereference that was occuring with the dio_mmap ZTS test case before this commit. [ 7737.430658] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 [ 7737.438486] PGD 0 P4D 0 [ 7737.441024] Oops: 0000 [openzfs#1] SMP NOPTI [ 7737.444692] CPU: 33 PID: 599346 Comm: fio Kdump: loaded Tainted: P OE --------- - - 4.18.0-408.el8.x86_64 openzfs#1 [ 7737.455721] Hardware name: GIGABYTE R272-Z32-00/MZ32-AR0-00, BIOS R21 10/08/2020 [ 7737.463106] RIP: 0010:__memcpy+0x12/0x20 [ 7737.467032] Code: ff 0f 31 48 c1 e2 20 48 09 c2 48 31 d3 e9 79 ff ff ff 90 90 90 90 90 90 0f 1f 44 00 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 <f3> 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 f3 a4 [ 7737.485770] RSP: 0000:ffffc1db829e3b60 EFLAGS: 00010246 [ 7737.490987] RAX: ffff9ef195b6f000 RBX: 0000000000001000 RCX: 0000000000000200 [ 7737.498111] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ef195b6f000 [ 7737.505235] RBP: ffff9ef195b70000 R08: ffff9eef1d1d0000 R09: ffff9eef1d1d0000 [ 7737.512358] R10: ffff9eef27968218 R11: 0000000000000000 R12: 0000000000000000 [ 7737.519481] R13: ffff9ef041adb6d8 R14: 00000000005e1000 R15: 0000000000000001 [ 7737.526607] FS: 00007f77fe2bae80(0000) GS:ffff9f0cae840000(0000) knlGS:0000000000000000 [ 7737.534683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7737.540423] CR2: 0000000000000000 CR3: 00000003387a6000 CR4: 0000000000350ee0 [ 7737.547553] Call Trace: [ 7737.550003] dmu_read_impl+0x11a/0x210 [zfs] [ 7737.554464] dmu_read+0x56/0x90 [zfs] [ 7737.558292] zfs_fillpage+0x76/0x190 [zfs] [ 7737.562584] zfs_getpage+0x4c/0x80 [zfs] [ 7737.566691] zpl_readpage_common+0x3b/0x80 [zfs] [ 7737.571485] filemap_fault+0x5d6/0xa10 [ 7737.575236] ? obj_cgroup_charge_pages+0xba/0xd0 [ 7737.579856] ? xas_load+0x8/0x80 [ 7737.583088] ? xas_find+0x173/0x1b0 [ 7737.586579] ? filemap_map_pages+0x84/0x410 [ 7737.590759] __do_fault+0x38/0xb0 [ 7737.594077] handle_pte_fault+0x559/0x870 [ 7737.598082] __handle_mm_fault+0x44f/0x6c0 [ 7737.602181] handle_mm_fault+0xc1/0x1e0 [ 7737.606019] do_user_addr_fault+0x1b5/0x440 [ 7737.610207] do_page_fault+0x37/0x130 [ 7737.613873] ? page_fault+0x8/0x30 [ 7737.617277] page_fault+0x1e/0x30 [ 7737.620589] RIP: 0033:0x7f77fbce9140 Signed-off-by: Brian Atkinson <[email protected]>
- Loading branch information