-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BRT: Cloning sparse files should take time proportional to allocated size #16014
Comments
Skipping holes sounds like a good idea. I guess cloning of holes should already be faster than cloning of data, since they are not counted in BRT, but may be we could save some more work for sync thread without too much redesign if we just skip cloning holes over holes. Unrelated to this approach, but couple weeks ago I produced 7 PRs trying to optimize block cloning: #15941, #15950, #15951, #15954, #15955, #15967, #15976. I would appreciate some review. |
I will take a look - this at first glance look like good optimization and cleanup. Do you have any getting started docs, talks, or other resources beyond brt.c?(Which is pretty good already!) I'm also considering looking into what it would take to make bclone work on objects with dirty records before sync, so it would be nice to learn the internals. I am against the idea that any userspace operation outcome should differ for dirty vs. non dirty object state or that any userspace operation should need to force a txg sync to accomplish some ordinary task. |
This shouldn't be closed after #16007 merge? Can someone retest this? :D |
No. It really was a bug fix, just as written. It can be that some of my optimizations in the area may improve performance in this scenario, but not up to a level when huge file of holes would be cloned instantly. The cloning code is still pretty sequential. It does not have special optimizations for sparse files. |
Describe the feature would like to see added to OpenZFS
Copying large sparse files with BRT should take time proportional to the number of allocated blocks rather than time proportional to the number of records in the file (i.e. apparent size).
On my system a 128G file with one 4k block of data takes 15 seconds:
And 512G file takes about 4x longer:
Lastly a 2T file takes about 4x longer again:
This is Fedora 39 6.6.8-200.fc39.x86_64 running 8f2f6cd non-DEBUG with a default-created pool with 128k recordsize on a single consumer NVMe vdev.
(I'm filing this as a feature because the performance is reasonable enough, but maybe some would argue that this is a regression vs. existing behavior with copying using sparse-aware tools.)
How will this feature improve OpenZFS?
While block cloning sparse files works reasonably well, ZFS can easily create petabyte-sized sparse files, and recent
cp
can copy them efficiently withSEEK_DATA
. Ideally block cloning should have similar performance.This would help users with very large, sparsely allocated files such as VM images or large sparse array files. Also users that are trying to reflink extremely large, empty sparse files might otherwise be surprised by linear copy performance vs. existing
cp
before reflinking was added to ZFS.Additional context
See
dnode_next_offset
for possible implementation, though that function doesn't correctly handle traversing files with dirty blocks. It might be possible to query dbuf state for the range of in-memory dbufs to see if the cloned range is dirty before attempting cloning or use something likednode_is_dirty
as a workaround.The text was updated successfully, but these errors were encountered: