Pull request for 4.15 #7

Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so we don't gain benefits by batching 64 bios here. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

If we're still going to wait after schedule(), we don't have to do finish_wait() to remove our %wait_queue_entry since prepare_to_wait() won't add the same %wait_queue_entry twice. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the same job. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

Both wait_for_commit() and wait_for_writer() are checking the condition out of the mutex lock. This refactors code a bit to be lock safe. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Some static functions are needlessly forward declared. Let's remove those declarations since they add no value. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

So that perf can show the state symbol. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

These aren't used outside of volumes.c. Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Signed-off-by: Omar Sandoval <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We didn't copy fsid to struct super_block.s_uuid so Overlay disables index feature with btrfs as the lower FS. kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off. Fix this by publishing the fsid through struct super_block.s_uuid. [ dsterba: I think that setting s_uuid is the last missing bit. Overlay needs the file handle encoding support from the lower filesystem, which is supported. Filling the whole filesystem id is correct, the subvolume id is encoded in the file handle buffer from inside btrfs_encode_fh. ] Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

After mapping block with BTRFS_MAP_WRITE, parities have been sorted to the end position, so this search can start from the first parity stripe. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> [ copied changelog as a comment ] Signed-off-by: David Sterba <[email protected]>

While we submit direct writes, if the inode is flagged with nodatasum, there's no benefit to submit asynchronously, because a) we don't have to calculate checksum across processors, b) and direct IO has started a plug, but async submit makes us queue IO on each device's scheduled IO list instead of DIO's plug list, so that IOs get much less merges in general. Lets use sync submit for nodatasum inodes. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

Src was initially part of 31ff1cd ("Btrfs: Copy into the log tree in big batches"), however 16e7549 ("Btrfs: incompatible format change to remove hole extents") changed parameters passed to copy_items which made the src variable redundant. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Timofey Titovets <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

iterate_dir_item:found_key - introduced in 31db9f7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used. record_ref:num - ditto This is a first pass with the low-hanging fruit. There are still quite a few unsued parameters in some function which have to abide by a callback interface. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs_changed_cb_t represents the signature of the callback being passed to btrfs_compare_trees. Currently there is only one such callback, namely changed_cb in send.c. This function doesn't really uses the first 2 parameters, i.e. the roots. Since there are not other functions implementing the btrfs_changed_cb_t let's remove the unused parameters from the prototype and implementation. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

Introduced by 5a5f79b ("Btrfs: allow unaligned DIO") and never used. The buffered fallback from unaligned DIO works as expected. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Timofey Titovets <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

…hunk Currently the code executes add_extent_mapping and if it is successful it links the new mapping, it then proceeds to unlock the extent mapping tree and check for failure and handle them. Instead, rework the code to only perform a single check if add_extent_mapping has failed and handle it, otherwise the code continues in a linear fashion. No functional changes Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

__link_block_group is called from only 2 places and at each call site the space_info being passed is the same as the space info assigned to the passed cache struct. Let's remove the redundant argument and make the function reference the space_info from the passed block_group_cache. No functional changes Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ renamed to link_block_group ] Signed-off-by: David Sterba <[email protected]>

If 'btrfs_alloc_path()' fails, we must free the resources already allocated, as done in the other error handling paths in this function. Signed-off-by: Christophe JAILLET <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>

The value of variable 'can_recover' is never used after being set, thus it should be removed, as it was never used since the first commit 68a7342 ("Btrfs: cleanup orphaned root orphan item"). Signed-off-by: Christos Gkekas <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

There are checks on fs_info in __btrfs_panic to avoid dereferencing a null fs_info, however, there is a call to btrfs_crit that may also dereference a null fs_info. Fix this by adding a check to see if fs_info is null and only print the s_id if fs_info is non-null. Detected by CoverityScan CID#401973 ("Dereference after null check") Fixes: efe120a ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix") Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Signed-off-by: Satoru Takeuchi <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We have started plug in btrfs_write_and_wait_marked_extents() but the generated IOs actually go to device's schedule IO list where the work is doing in another task, thus the started plug doesn't make any sense. And since we wait for IOs immediately after writing meta blocks, it's the same case as writing log tree, doing sync submit can merge more IOs. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Since both committing transaction and writing log-tree are doing plugging on metadata IO, we can unify to use %sync_writers to benefit both cases, instead of checking bio_flags while writing meta blocks of log-tree. We can remove this bio_flags because in order to write dirty blocks, log tree also uses btrfs_write_marked_extents(), inside which we have enabled %sync_writers, therefore, every write goes in a synchronous way, so does checksuming. Please also note that, bio_flags is applied per-context while %sync_writers is applied per-inode, so this might incur some overhead, ie. 1) while log tree is flushing its dirty blocks via btrfs_write_marked_extents(), in which %sync_writers is increased by one. 2) in the meantime, some writeback operations may happen upon btrfs's metadata inode, so these writes go synchronously, too. However, AFAICS, the overhead is not a big one while the win is that we unify the two places that needs synchronous way and remove a special hack/flag. This removes the bio_flags related stuff for writing log-tree. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We've seen the following backtrace stack in ftrace or dmesg log, kworker/u16:10-4244 [000] 241942.480955: function: btrfs_put_ordered_extent kworker/u16:10-4244 [000] 241942.480956: kernel_stack: <stack trace> => finish_ordered_fn (ffffffffa0384475) => btrfs_scrubparity_helper (ffffffffa03ca577) <-----"incorrect" => btrfs_freespace_write_helper (ffffffffa03ca98e) <-----"correct" => process_one_work (ffffffff81117b2f) => worker_thread (ffffffff81118c2a) => kthread (ffffffff81121de0) => ret_from_fork (ffffffff81d7087a) btrfs_freespace_write_helper is actually calling normal_worker_helper instead of btrfs_scrubparity_helper, so somehow kernel has parsed the incorrect function address while unwinding the stack, btrfs_scrubparity_helper really shouldn't be shown up. It's caused by compiler doing inline for our helper function, adding a noinline tag can fix that. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> [ use noinline_for_stack ] Signed-off-by: David Sterba <[email protected]>

Forward the correct return value -ENOMEM from btrfsic_dev_state_alloc() too. Signed-off-by: Allen Pais <[email protected]> Reviewed-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ adjust changelog ] Signed-off-by: David Sterba <[email protected]>

Don't populate the read-only array types on the stack, instead make it static const. Makes the object code smaller by nearly 60 bytes: Before: text data bss dec hex filename 90536 6552 64 97152 17b80 fs/btrfs/ioctl.o After: text data bss dec hex filename 90414 6616 64 97094 17b46 fs/btrfs/ioctl.o Signed-off-by: Colin Ian King <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

The local bio_list may have pending bios when doing cleanup, it can end up with memory leak if they don't get freed. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

…read By analyzing the perf on btrfs send, we found it take large amount of cpu time on page_cache_sync_readahead. This effort can be reduced after switching to asynchronous one. Overall performance gain on HDD and SSD were 9 and 15 percent if simply send a large file. Signed-off-by: Kuanling Huang <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Nikolay reported that generic/273 was failing currently with ENOSPC. Turns out this is because we get to the point where the outstanding reservations are greater than the pinned space on the fs. This is a mistake, previously we used the current reservation amount in may_commit_transaction, not the entire outstanding reservation amount. Fix this to find the minimum byte size needed to make progress in flushing, and pass that into may_commit_transaction. From there we can make a smarter decision on whether to commit the transaction or not. This fixes the failure in generic/273. From Nikolai, IOW: when we go to the final stage of deciding whether to do trans commit, instead of passing all the reservations from all tickets we just pass the reservation for the current ticket. Otherwise, in case all reservations exceed pinned space, then we don't commit transaction and fail prematurely. Before we passed num_bytes from flush_space, where num_bytes was the sum of all pending reserverations, but now all we do is take the first ticket and commit the trans if we can satisfy that. Fixes: 957780e ("Btrfs: introduce ticketed enospc infrastructure") Cc: [email protected] # 4.8 Reported-by: Nikolay Borisov <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Tested-by: Nikolay Borisov <[email protected]> [ added Nikolai's comment ] Signed-off-by: David Sterba <[email protected]>

Was added in: c8b9781 "Btrfs: Add zlib compression support" Survive to near time (from 08.10.2008). Because 'start' checked for zero before branch, so it's safe to remove that subtraction. Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: Satoru Takeuchi <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Current check_leaf() function does a good job checking key order and item offset/size. However it only checks from slot 0 to the last but one slot, this is good but makes later expansion hard. So this refactoring iterates from slot 0 to the last slot. For key comparison, it uses a key with all 0 as initial key, so all valid keys should be larger than that. And for item size/offset checks, it compares current item end with previous item offset. For slot 0, use leaf end as a special case. This makes later item/key offset checks and item size checks easier to be implemented. Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate error. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Function check_leaf() checks if any item pointer points outside of the leaf, but it doesn't check if the pointer overlaps with the item itself. Normally only the last item may be the victim, but adding such check is never a bad idea anyway. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Add extra checks for item with EXTENT_DATA type. This checks the following thing: 0) Key offset All key offsets must be aligned to sectorsize. Inline extent must have 0 for key offset. 1) Item size Uncompressed inline file extent size must match item size. (Compressed inline file extent has no information about its on-disk size.) Regular/preallocated file extent size must be a fixed value. 2) Every member of regular file extent item Including alignment for bytenr and offset, possible value for compression/encryption/type. 3) Type/compression/encode must be one of the valid values. This should be the most comprehensive and strict check in the context of btrfs_item for EXTENT_DATA. Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> [ switch to BTRFS_FILE_EXTENT_TYPES, similar to what BTRFS_COMPRESS_TYPES does ] Signed-off-by: David Sterba <[email protected]>

EXTENT_CSUM checker is a relatively easy one, only needs to check: 1) Objectid Fixed to BTRFS_EXTENT_CSUM_OBJECTID 2) Key offset alignment Must be aligned to sectorsize 3) Item size alignedment Must be aligned to csum size Signed-off-by: Qu Wenruo <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

When new device is being added to seed FS, seed FS is marked writable, but when we fail to bring in the new device, we missed to undo the writable part. This patch fixes it. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

Instead of BUG_ON return error to the caller. And handle the fail condition by calling the abort transaction and going through the error path. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs_init_new_device() calls btrfs_attach_transaction() to commit sys chunks, and it should error out if it fails. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs_udpate_root can fail and it aborts the transaction, the correct way to handle an aborted transaction is to explicitly end with btrfs_end_transaction. Even now the code is correct since btrfs_commit_transaction would handle an aborted transaction but this is more of an implementation detail. So let's be explicit in handling failure in btrfs_update_root. Furthermore btrfs_commit_transaction can also fail and by ignoring it's return value we could have left the in-memory copy of the root item in an inconsistent state. So capture the error value which allows us to correctly revert the RO/RW flags in case of commit failure. Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

If btrfs_transaction_commit fails it will proceed to call cleanup_transaction, which in turn already does btrfs_abort_transaction. So let's remove the unnecessary code duplication. Also let's be explicit about handling failure of btrfs_uuid_tree_add by calling btrfs_end_transaction. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Bool initializations should use true and false. Bool tests don't need comparisons. Signed-off-by: Thomas Meyer <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Currently struct names for sysfs are generated only based on the attribute names. This means that attribute names cannot be reused in multiple places throughout the complete btrfs sysfs hierarchy. E.g. allocation/data/total_bytes and allocation/data/single/total_bytes result in the same struct name btrfs_attr_total_bytes. A workaround for this case was made in the past by ad hoc creating an extra macro wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct name. Instead of polluting sysfs.h with such kind of extra macro definitions, and only doing so when there are collisions, use a prefix which gets inserted in the struct name, so we keep everything nicely grouped together by default. Current collections of attributes are: * (the toplevel, empty prefix) * allocation * space_info * raid * features Signed-off-by: Hans van Kranenburg <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Commit a53f4f8 ("btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.") started using internal calls and we replace them with more suitable ones. Signed-off-by: Rakesh Pandit <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Remove dead assigment of num_bytes. Also as num_bytes only used in the will_compress block as copy of total_in just replace that with total_in and drop num_bytes entirely. Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

It's no doubt the comprehensive tree block checker will become larger, so moving them into their own files is quite reasonable. Signed-off-by: Qu Wenruo <[email protected]> [ wording adjustments ] Signed-off-by: David Sterba <[email protected]>

Use inline function to replace macro since we don't need stringification. (Macro still exists until all callers get updated) And add more info about the error, and replace EIO with EUCLEAN. For nr_items error, report if it's too large or too small, and output the valid value range. For node block pointer, added a new alignment checker. For key order, also output the next key to make the problem more obvious. Signed-off-by: Qu Wenruo <[email protected]> [ wording adjustments, unindented long strings ] Signed-off-by: David Sterba <[email protected]>

Enhance the output to print: 1) the eason 2) the ad value, if reason is not sufficient 3) good value (range) Signed-off-by: Qu Wenruo <[email protected]> [ wording, unidented long strings ] Signed-off-by: David Sterba <[email protected]>

Output the bad value and expected good value (or its alignment). Signed-off-by: Qu Wenruo <[email protected]> [ unindent long strings ] Signed-off-by: David Sterba <[email protected]>

Output the invalid member name and its bad value, along with its expected value range or alignment. Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>

This was intended to congest higher layers to not send bios, but as 1) the congested bit has been taken by writeback Async bios come from buffered writes and DIO writes. For DIO writes, we want to submit them ASAP, while for buffered writes, writeback uses balance_dirty_pages() to throttle how much dirty pages we can have. 2) and no one is waiting for %nr_async_bios down to zero, Historically, it was introduced along with changes which let checksumming workload spread accross different cpus. And at that time, pdflush was used instead of per-bdi flushing, perhaps pdflush did not have the necessary information for writeback to do throttling. We can safely remove them now. Signed-off-by: Liu Bo <[email protected]> [ additional explanation from mails, removed unused variable 'limit' ] Signed-off-by: David Sterba <[email protected]>

By setting compression for a defrag task, the task will start IO at the end of defrag. After the combo of filemap_flush(), we've already made sure that dirty pages have made progress via async compress thread because the second filemap_flush() will wait for page lock, which won't be unlocked until those pages have been marked as writeback and ordered extents have been queued. And this is for per-inode defrag, it's not helpful to wait on a global %async_delalloc_pages and %nr_async_submits from fs_info. Although waiting on %nr_async_submits means that all bios are submitted down to per-device schedule IO lists, it doesn't wait for their completions, thus users still need to do fsync/sync to make sure the data is on disk. While with this change, it makes sure that pages are marked with writeback bits and will be submitted asynchronously shortly, therefore, the behavior of defrag option '-c' remains unchanged. Signed-off-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

Now that we have the combo of flushing twice, which can make sure IO have started since the second flush will wait for page lock which won't be unlocked unless setting page writeback and queuing ordered extents, we don't need %async_submit_draining, %async_delalloc_pages and %nr_async_submits to tell whether the IO has actually started. Moreover, all the flushers in use are followed by functions that wait for ordered extents to complete, so %nr_async_submits, which tracks whether bio's async submit has made progress, doesn't really make sense. However, %async_delalloc_pages is still required by shrink_delalloc() as that function doesn't flush twice in the normal case (just issues a writeback with WB_REASON_FS_FREE_SPACE). Signed-off-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

We now get a harmless compile-time on 32-bit architectures: fs/btrfs/tree-checker.c: In function 'check_extent_data_item': fs/btrfs/tree-checker.c:189:70: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'unsigned int' [-Werror=format=] This changes the format string to use %zu instead of %lu for size_t. Fixes: c1f6520 ("btrfs: tree-checker: Enhance output for check_extent_data_item") Signed-off-by: Arnd Bergmann <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Remove variables 'start' and 'end', which are set but never used. Signed-off-by: Christos Gkekas <[email protected]> Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: David Sterba <[email protected]>

add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can be used to check for the actual error that occurred in the function. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Liu Bo <[email protected]> [ minor error message adjustment ] Signed-off-by: David Sterba <[email protected]>

EIO is only for the IO failure to the device, avoid it. Use ENOENT as that's the closest error code describing what happened. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update changelog ] Signed-off-by: David Sterba <[email protected]>

Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>

When device is missing without the -o degraded option then its an error so report it as an error instead of a warning. And when -o degraded option is provided, log the missing device as warning. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: David Sterba <[email protected]> [ switch error to bool ] Signed-off-by: David Sterba <[email protected]>

We pass in a pointer in our send arg struct, this means the struct size doesn't match with 32bit user space and 64bit kernel space. Fix this by adding a compat mode and doing the appropriate conversion. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ move structure to the beginning, next to receive 32bit compat ] Signed-off-by: David Sterba <[email protected]>

The use of sector_t is not necessry, it's just for a warning. Switch to u64 and rename the variable and use byte units instead of 512b, ie. dropping the >> 9 shifts. The messages are adjusted as well. Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

We're going to remove sector_t and will use 'offset', so this patch frees the name. Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

The use of sector_t in the callchain of submit_extent_page is not necessary. Switch to u64 and rename the variable and use byte units instead of 512b, ie. dropping the >> 9 shifts and avoiding the con(tro)versions of sector_t. Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

This adds the infrastructure for turning ref verify on and off for a mount, to be used by a later patch. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ] Signed-off-by: David Sterba <[email protected]>

We need the actual root for the ref verifier tool to work, so change these functions to pass the root around instead. This will be used in a subsequent patch. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

We were having corruption issues that were tied back to problems with the extent tree. In order to track them down I built this tool to try and find the culprit, which was pretty successful. If you compile with this tool on it will live verify every ref update that the fs makes and make sure it is consistent and valid. I've run this through with xfstests and haven't gotten any false positives. Thanks, Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update error messages, add fixup from Dan Carpenter to handle errors of read_tree_block ] Signed-off-by: David Sterba <[email protected]>

We were only doing btrfs_check_space_for_delayed_refs() if the metadata space was full, ie we couldn't allocate chunks. This assumes we'll be able to allocate chunks during transaction commit, but since nothing does a LIMIT flush during the transaction commit this won't actually happen unless we happen to run shy of actual space. We already take into account a full fs in btrfs_check_space_for_delayed_refs() so just kill this extra check to make sure we're ending the transaction when we need to. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Simplify the error handling in __btrfs_run_delayed_refs by breaking out the code used to return a head back to the delayed_refs tree for processing into a helper function. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Move the extent_op cleanup for an empty head ref to a helper function to help simplify __btrfs_run_delayed_refs. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Move this code out to a helper function to further simplivy __btrfs_run_delayed_refs. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

We only use this logic if our ref isn't a ref_head, so move it up into the if (ref) case since we know that this is a normal ref and not a delayed ref head. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We do a couple different cleanup operations on the ref head. We adjust counters, we'll free any reserved space if we didn't end up using the ref, and we clear the pending csum bytes. Move all these disparate things into cleanup_ref_head and clean up the logic in __btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner, as well as making run_one_delayed_ref() only deal with real refs and not the ref head. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

This is just excessive information in the ref_head, and makes the code complicated. It is a relic from when we had the heads and the refs in the same tree, which is no longer the case. With this removal I've cleaned up a bunch of the cruft around this old assumption as well. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

We can get this from the ref we've passed in. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

These are useful for debugging problems where we mess with trans->block_rsv to make sure we're not screwing something up. Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

At few places we could use BLK_STS_OK and BLK_STS_NOSUPP. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Satoru Taekeuchi <[email protected]> Reviewed-by: David Sterba <[email protected]> [ dropped first hunk btrfs_endio_direct_read ] Signed-off-by: David Sterba <[email protected]>

Code cleanup for better understanding: Variable needs_unlock to be called extent_locked to show state as opposed to action. Changed the type to int, to reduce code in the critical path. Signed-off-by: Goldwyn Rodrigues <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

A cleanup patch, use need_full_stripe() to replace the open code. Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Qu Wenruo <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

When one of the device is missing, bbio_error() takes care of setting the error status. And if its only IO that is pending in that stripe, it fails to check the status of the other IO at %bbio_error before setting the error %bi_status for the %orig_bio. Fix this by checking if %bbio->error has exceeded the %bbio->max_errors. Reproducer as below fdatasync error is seen intermittently. mount -o degraded /dev/sdc /btrfs dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error The reason for the intermittences of the problem is because the following conditions have to be met, which depends on timing: In btrfs_map_bio() - the RAID1 the missing device has to be at %dev_nr = 1 In bbio_error() . before bbio_error() is called the bio of the not-missing device at %dev_nr = 0 must be completed so that the below condition is true if (atomic_dec_and_test(&bbio->stripes_pending)) { Signed-off-by: Anand Jain <[email protected]> Reviewed-by: Liu Bo <[email protected]> Signed-off-by: David Sterba <[email protected]>

Fix missing change from commit f8f84b2 ("btrfs: index check-integrity state hash by a dev_t"). Function btrfsic_dev_state_hashtable_lookup uses dev_t to generate hashval when look in up a btrfsic_dev_state in hash table. So when we add a btrfsic_dev_state into the hash table, it should also use dev_t. Reproducer of this bug: Use MOUNT_OPTIONS="-o check_int" when running xfstest, device can not be mounted successfully. So xfstest can not run. Signed-off-by: Gu JinXiang <[email protected]> Reviewed-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h. Let's unifiy the code base to always use the symbolic constants. No functional changes Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

Preliminary support for setting compression level for zlib, the following works: $ mount -o compess=zlib # default $ mount -o compess=zlib0 # same $ mount -o compess=zlib9 # level 9, slower sync, less data $ mount -o compess=zlib1 # level 1, faster sync, more data $ mount -o remount,compress=zlib3 # level set by remount The compress-force works the same as compress'. The level is visible in the same format in /proc/mounts. Level set via file property does not work yet. Required patch: "btrfs: prepare for extensions in compression options" Signed-off-by: David Sterba <[email protected]>

This is bikeshedding, but it seems people are drastically more likely to understand "zlib:9" as compression level rather than an algorithm version compared to "zlib9". Based on feedback on the mailinglist, the ":9" will be the only accepted syntax. The level must be a single digit. Unrecognized format will result to the default, for forward compatibility in a similar way the compression algorithm specifier was relaxed in commit a7164fa ("btrfs: prepare for extensions in compression options"). Signed-off-by: Adam Borowski <[email protected]> Reviewed-by: David Sterba <[email protected]> [ tighten the accepted format ] Signed-off-by: David Sterba <[email protected]>

That was only an extra check to tackle a few bugs around this area, now its safe to remove it. Replace it by an ASSERT. Signed-off-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>

This code was first introduced in 31db9f7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive") and it was not functional, then it got slightly refactored in e938c8a ("Btrfs: code cleanups for send/receive"), alas it was still dead. So let's remove it for good! Signed-off-by: Nikolay Borisov <[email protected]> Signed-off-by: David Sterba <[email protected]>

…efs for uncompressed extents The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor coding style fixes ] Signed-off-by: David Sterba <[email protected]>

Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the existing reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. The lack of a zero check also means that new programs have no way to know whether the kernel is honoring the flags field. To avoid these problems, define a new ioctl LOGICAL_INO_V2. We can use the same argument layout as LOGICAL_INO, but shorten the reserved[] array by one element and turn it into the 'flags' field. The V2 ioctl explicitly checks that reserved fields and unsupported flag bits are zero so that userspace can negotiate future feature bits as they are defined. Since the memory layouts of the two ioctls' arguments are compatible, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search where the layout and code are quite different). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Motivation and background, copied from the patchset cover letter: Suppose we have a file with one extent: root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a root@tester:~# sync Split the extent by overwriting it in the middle: root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a We should now have 3 extent refs to 2 extents, with one block unreachable. The extent tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 2 [...] item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53 extent refs 2 gen 29 flags DATA extent data backref root 5 objectid 261 offset 0 count 2 [...] item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53 extent refs 1 gen 30 flags DATA extent data backref root 5 objectid 261 offset 8192 count 1 [...] and the ref tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 5 [...] item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 0 nr 8192 ram 73728 extent compression(none) item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53 extent data disk byte 1103175680 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 12288 nr 61440 ram 73728 extent compression(none) [...] There are two references to the same extent with different, non-overlapping byte offsets: [------------------72K extent at 1103101952----------------------] [--8K----------------|--4K unreachable----|--60K-----------------] ^ ^ | | [--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--] | v [-----4K extent-----] at 1103175680 We want to find all of the references to extent bytenr 1103101952. Without the patch (and without running btrfs-debug-tree), we have to do it with 18 LOGICAL_INO calls: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO inode 261 offset 0 root 5 root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode inode 261 offset 0 root 5 inode 261 offset 4096 root 5 <- same extent ref as offset 0 (offset 8192 returns empty set, not reachable) inode 261 offset 12288 root 5 inode 261 offset 16384 root 5 \ inode 261 offset 20480 root 5 | inode 261 offset 24576 root 5 | inode 261 offset 28672 root 5 | inode 261 offset 32768 root 5 | inode 261 offset 36864 root 5 \ inode 261 offset 40960 root 5 > all the same extent ref as offset 12288. inode 261 offset 45056 root 5 / More processing required in userspace inode 261 offset 49152 root 5 | to figure out these are all duplicates. inode 261 offset 53248 root 5 | inode 261 offset 57344 root 5 | inode 261 offset 61440 root 5 | inode 261 offset 65536 root 5 | inode 261 offset 69632 root 5 / In the worst case the extents are 128MB long, and we have to do 32768 iterations of the loop to find one 4K extent ref. With the patch, we just use one call to map all refs to the extent at once: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO_V2 inode 261 offset 0 root 5 inode 261 offset 12288 root 5 The TREE_SEARCH ioctl allows userspace to retrieve the offset and extent bytenr fields easily once the root, inode and offset are known. This is sufficient information to build a complete map of the extent and all of its references. Userspace can use this information to make better choices to dedup or defrag. Signed-off-by: Zygo Blaxell <[email protected]> Reviewed-by: Hans van Kranenburg <[email protected]> Tested-by: Hans van Kranenburg <[email protected]> [ copy background and motivation from cover letter ] Signed-off-by: David Sterba <[email protected]>

Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell <[email protected]> Reviewed-by: Hans van Kranenburg <[email protected]> Tested-by: Hans van Kranenburg <[email protected]> Signed-off-by: David Sterba <[email protected]>

Right now we do a lot of weird hoops around outstanding_extents in order to keep the extent count consistent. This is because we logically transfer the outstanding_extent count from the initial reservation through the set_delalloc_bits. This makes it pretty difficult to get a handle on how and when we need to mess with outstanding_extents. Fix this by revamping the rules of how we deal with outstanding_extents. Now instead everybody that is holding on to a delalloc extent is required to increase the outstanding extents count for itself. This means we'll have something like this btrfs_delalloc_reserve_metadata - outstanding_extents = 1 btrfs_set_extent_delalloc - outstanding_extents = 2 btrfs_release_delalloc_extents - outstanding_extents = 1 for an initial file write. Now take the append write where we extend an existing delalloc range but still under the maximum extent size btrfs_delalloc_reserve_metadata - outstanding_extents = 2 btrfs_set_extent_delalloc btrfs_set_bit_hook - outstanding_extents = 3 btrfs_merge_extent_hook - outstanding_extents = 2 btrfs_delalloc_release_extents - outstanding_extnets = 1 In order to make the ordered extent transition we of course must now make ordered extents carry their own outstanding_extent reservation, so for cow_file_range we end up with btrfs_add_ordered_extent - outstanding_extents = 2 clear_extent_bit - outstanding_extents = 1 btrfs_remove_ordered_extent - outstanding_extents = 0 This makes all manipulations of outstanding_extents much more explicit. Every successful call to btrfs_delalloc_reserve_metadata _must_ now be combined with btrfs_release_delalloc_extents, even in the error case, as that is the only function that actually modifies the outstanding_extents counter. The drawback to this is now we are much more likely to have transient cases where outstanding_extents is much larger than it actually should be. This could happen before as we manipulated the delalloc bits, but now it happens basically at every write. This may put more pressure on the ENOSPC flushing code, but I think making this code simpler is worth the cost. I have another change coming to mitigate this side-effect somewhat. I also added trace points for the counter manipulation. These were used by a bpf script I wrote to help track down leak issues. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

This is handy for tracing problems with modifying the outstanding extents counters. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

The way we handle delalloc metadata reservations has gotten progressively more complicated over the years. There is so much cruft and weirdness around keeping the reserved count and outstanding counters consistent and handling the error cases that it's impossible to understand. Fix this by making the delalloc block rsv per-inode. This way we can calculate the actual size of the outstanding metadata reservations every time we make a change, and then reserve the delta based on that amount. This greatly simplifies the code everywhere, and makes the error handling in btrfs_delalloc_reserve_metadata far less terrifying. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

Make it more consistent, we want the inserted ref to be compared against what's already in there. This will make the order go from lowest seq -> highest seq, which will make us more likely to make forward progress if there's a seqlock currently held. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

Instead of open-coding the delayed ref comparisons, add a helper to do the comparisons generically and use that everywhere. We compare sequence numbers last for following patches. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

If we get a significant amount of delayed refs for a single block (think modifying multiple snapshots) we can end up spending an ungodly amount of time looping through all of the entries trying to see if they can be merged. This is because we only add them to a list, so we have O(2n) for every ref head. This doesn't make any sense as we likely have refs for different roots, and so they cannot be merged. Tracking in a tree will allow us to break as soon as we hit an entry that doesn't match, making our worst case O(n). With this we can also merge entries more easily. Before we had to hope that matching refs were on the ends of our list, but with the tree we can search down to exact matches and merge them at insert time. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

We're holding the sb_start_intwrite lock at this point, and doing async filemap_flush of the inodes will result in a deadlock if we freeze the fs during this operation. This is because we could do a btrfs_join_transaction() in the thread we are waiting on which would block at sb_start_intwrite, and thus deadlock. Using writeback_inodes_sb() side steps the problem by not introducing all of these extra locking dependencies. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

Since we do a delalloc reserve in btrfs_truncate_block we can deadlock with freeze. If somebody else is trying to allocate metadata for this inode and it gets stuck in start_delalloc_inodes because of freeze we will deadlock. Be safe and move this outside of a trans handle. This also has a side-effect of making sure that we're not leaving stale data behind in the other_encoding or encryption case. Not an issue now since nobody uses it, but it would be a problem in the future. Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]>

Compression heuristic itself is not a compression type, as current infrastructure provides workspaces for several compression types, it's difficult to just add heuristic workspace. Just refactor the code to support compression/heuristic workspaces with maximum code sharing and minimum changes in it. Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: David Sterba <[email protected]> [ coding style fixes ] Signed-off-by: David Sterba <[email protected]>

Add basic defines and structures for data sampling. Added macros: - For future sampling algo - For bucket size Heuristic workspace: - Add bucket for storing byte type counters - Add sample array for storing partial copy of input data range - Add counter for store current sample size to workspace Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor coding style fixes, comments updated ] Signed-off-by: David Sterba <[email protected]>

Copy sample data from the input data range to sample buffer then calculate byte value count for that sample into bucket. Signed-off-by: Timofey Titovets <[email protected]> [ minor comment updates ] Signed-off-by: David Sterba <[email protected]>

Walk over data sample and use memcmp to detect repeated patterns, like zeros, but a bit more general. Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: David Sterba <[email protected]> [ minor coding style fixes ] Signed-off-by: David Sterba <[email protected]>

Calculate byte set size for data sample: - calculate how many unique bytes have been in the sample - for all bytes count > 0, check if we're still in the low count range (~25%), such data are easily compressible, otherwise furhter analysis is needed Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update comments ] Signed-off-by: David Sterba <[email protected]>

Calculate byte core set for data sample: - sort buckets' numbers in decreasing order - count how many values cover 90% of the sample If the core set size is low (<=25%), data are easily compressible. If the core set size is high (>=80%), data are not compressible. Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update comments ] Signed-off-by: David Sterba <[email protected]>

Byte distribution check in heuristic will filter edge data cases and some time fail to classify input data. Let's fix that by adding Shannon entropy calculation, that will cover classification of most other data types. As Shannon entropy needs log2 with some precision to work, let's use ilog2(N) and for increased precision, by do ilog2(pow(N, 4)). Shannon entropy has been slightly changed to avoid signed numbers and division. The calculation is direct by the formula, successor of precalculated table or chains of if-else. The accuracy errors of ilog2 are compensated by @ENTROPY_LVL_ACEPTABLE 70 -> 65 @ENTROPY_LVL_HIGH 85 -> 80 Signed-off-by: Timofey Titovets <[email protected]> Reviewed-by: David Sterba <[email protected]> [ update comments ] Signed-off-by: David Sterba <[email protected]>

Fix bug of commit 74d4699 ("block: replace bi_bdev with a gendisk pointer and partitions index"). bio_dev(bio) is used to find the dev state in function __btrfsic_submit_bio. But when dev_state is added to the hashtable, it is using dev_t of block_device. bio_dev(bio) returns a dev_t of part0 which is different from dev_t in block_device(bd_dev). bd_dev in block_device represents the exact partition. block_device.bd_dev = bio->bi_partno (same as block_device.bd_partno) + bio_dev(bio). When adding a dev_state into hashtable, we use the exact partition dev_t. So when looking it up, it should also use the exact partition dev_t. Reproducer of this bug: Use MOUNT_OPTIONS="-o check_int" and run btrfs/001 in fstests. Then there will be WARNING like below. WARNING: btrfs: attempt to write superblock which references block M @29523968 (sda7 /1111654400/2) which is never written! Signed-off-by: Gu JinXiang <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

Compression code path has only flaged bios with REQ_OP_WRITE no matter where the bios come from, but it could be a sync write if fsync starts this writeback or a normal writeback write if wb kthread starts a periodic writeback. It breaks the rule that sync writes and writeback writes need to be differentiated from each other, because from the POV of block layer, all bios need to be recognized by these flags in order to do some management, e.g. throttlling. This passes writeback_control to compression write path so that it can send bios with proper flags to block layer. Signed-off-by: Liu Bo <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

btrfs_rm_dev_item calls several function under an active transaction, however it fails to abort it if an error happens. Fix this by adding explicit btrfs_abort_transaction/btrfs_end_transaction calls. Signed-off-by: Nikolay Borisov <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>

This fixes potential bio leaks, in several error paths. Unfortunatelly the device structure freeing is opencoded in many places and I missed them when introducing the flush_bio. Most of the time, devices get freed through call_rcu(..., free_device), so it at least it's not that easy to hit the leak, but it's still possible through the path that frees stale devices. Fixes: e0ae999 ("btrfs: preallocate device flush bio") Reviewed-by: Nikolay Borisov <[email protected]> Reviewed-by: Anand Jain <[email protected]> Signed-off-by: David Sterba <[email protected]>

The dev_alloc_list list could be protected by various mutexes, depending on the context. The list tracks devices that can take part of allocating new chunks, so the closest mutex is chunk_mutex. Adding a new device from inside the ADD_DEV ioctl will need device_list_mutex and registering a new device from the ioctl needs uuid_mutex. All mutexes naturally guarantee exclusivity against the same context. The device ownership can move between the contexts and the exclusivity is guaranteed by other means, eg. during the mount with the uuid_mutex. There's no RCU involved for dev_alloc_list. Signed-off-by: David Sterba <[email protected]>

If a file's DIR_ITEM key is invalid (due to memory errors) and gets written to disk, a future lookup_path can end up with kernel panic due to BUG_ON(). This gets rid of the BUG_ON(), meanwhile output the corrupted key and return ENOENT if it's invalid. Signed-off-by: Liu Bo <[email protected]> Reported-by: Guillaume Bouchard <[email protected]> Signed-off-by: David Sterba <[email protected]>

Move the definition of the function btrfs_find_new_delalloc_bytes() closer to the function btrfs_dirty_pages(), because in a future commit it will be used exclusively by btrfs_dirty_pages(). This just moves the function's definition, with no functional changes at all. Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>

The patch from commit a7e3b97 ("Btrfs: fix reported number of inode blocks") introduced a regression where if we do a buffered write starting at position equal to or greater than the file's size and then stat(2) the file before writeback is triggered, the number of used blocks does not change (unless there's a prealloc/unwritten extent). Example: $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar $ du -h foobar 0 foobar $ sync $ du -h foobar 64K foobar The first version of that patch didn't had this regression and the second version, which was the one committed, was made only to address some performance regression detected by the intel test robots using fs_mark. This fixes the regression by setting the new delaloc bit in the range, and doing it at btrfs_dirty_pages() while setting the regular dealloc bit as well, so that this way we set both bits at once avoiding navigation of the inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most meaninful place, as we should set the new dellaloc bit when if we set the delalloc bit, which happens only if we copied bytes into the pages at __btrfs_buffered_write(). This was making some of LTP's du tests fail, which can be quickly run using a command line like the following: $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt Fixes: a7e3b97 ("Btrfs: fix reported number of inode blocks") Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull request for 4.15 #7

Pull request for 4.15 #7

Commits on Oct 30, 2017

Commits on Nov 1, 2017

Commits on Nov 15, 2017