-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Dedup: “flat” DDT entry format #15893
Conversation
[Fast dedup stack rebased to master 3c941d1] |
FYI, added one commit, that adds a birth time field to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous questions are still open, but they are minor, while otherwise this part looks good to me.
[Fast dedup stack rebased to master c98295e] |
05f309f
to
b5dfe43
Compare
I think this is good to go; I'm just waiting for confirmation internally that this is all we need for the on-disk format changes. |
0cb2a35
to
062633d
Compare
No format changes, but there is a good chunk of code change to make the split between traditional and flat entries a bit easier to work with. Earlier versions of this code assumed We believe this is the final on-disk format, and likely the final code change to support it. We've got some stress testing to do but that's it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some cosmetics:
2540fe6
to
ec7acc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very familiar with the DDT code, but I don't see any surface level issues. Please rebase on master though to pull in the latest Fedora ZTS fixes.
596597b
to
785f991
Compare
3852876
to
2a46268
Compare
This is the supporting infrastructure for the upcoming dedup features. Traditionally, dedup objects live directly in the MOS root. While their details vary (checksum, type and class), they are all the same "kind" of thing - a store of dedup entries. The new features are more varied than that, and are better thought of as a set of related stores for the overall state of a dedup table. This adds a new feature flag, SPA_FEATURE_FAST_DEDUP. Enabling this will cause new DDTs to be created as a ZAP in the MOS root, named DDT-<checksum>. The is used as the root object for the normal type/class store objects, but will also be a place for any storage required by new features. This commit adds two new fields to ddt_t, for version and flags. These are intended to describe the structure and features of the overall dedup table, and are stored as-is in the DDT root. In this commit, flags are always zero, but the intent is that they can be used to hang optional logic or state onto for new dedup features. Version is always 1. For a "legacy" dedup table, where no DDT root directory exists, the version will be 0. ddt_configure() is expected to determine the version and flags features currently in operation based on whether or not the fast_dedup feature is enabled, and from what's available on disk. In this way, its possible to support both old and new tables. This also provides a migration path. A legacy setup can be upgraded to FDT by creating the DDT root ZAP, moving the existing objects into it, and setting version and flags appropriately. There's no support for that here, but it would be straightforward to add later and allows the possibility that newer features could be applied to existing dedup tables. Co-authored-by: Allan Jude <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Very basic coverage to make sure things appear to work, have the right format on disk, and pool upgrades and mixed table types work as expected. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
The upcoming dedup features break the long held assumption that all blocks on disk with a 'D' dedup bit will always be present in the DDT, or will have the same set of DVA allocations on disk as in the DDT. If the DDT is no longer a complete picture of all the dedup blocks that will be and should be on disk, then it does us no good to walk and prime it up front, since it won't necessarily match up with every block we'll see anyway. Instead, we rework things here to be more like the BRT checks. When we see a dedup'd block, we look it up in the DDT, consume a refcount, and for the second-or-later instances, count them as duplicates. The DDT and BRT are moved ahead of the space accounting. This will become important for the "flat" feature, which may need to count a modified version of the block. Co-authored-by: Allan Jude <[email protected]> Co-authored-by: Don Brady <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc.
The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes #15893
The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
The "flat phys" feature will use only a single phys slot for all entries, which means the old "single", "double" etc naming now makes no sense, and more importantly, means that choosing the right slot for a given block pointer will depend on how many slots are in use for a given DDT. This removes the old names, and adds accessor macros to decouple specific phys array indexes from any particular meaning. (These macros look strange in isolation, mainly in the way they take the ddt_t* as an arg but don't use it. This is mostly a separate commit to introduce the concept to the reader before the "flat phys" commit extends it). Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
The idea here is that sometimes you need the contents of an entry with no intent to modify it, and/or from a place where its difficult to get hold of its originating ddt_t to know how to interpret it. A lightweight entry contains everything you might need to "read" an entry - its key, type and phys contents - but none of the extras for modifying it or using it in a larger context. It also has the full complement of phys slots, so it can represent any kind of dedup entry without having to know the specific configuration of the table it came from. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
This slims down the in-memory entry to as small as it can be. The IO-related parts are made into a separate entry, since they're relatively rarely needed. The variable allocation for dde_phys is to support the upcoming flat format. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
Traditional dedup keeps a separate ddt_phys_t "type" for each possible count of DVAs (that is, copies=) parameter. Each of these are tracked independently of each other, and have their own set of DVAs. This leads to an (admittedly rare) situation where you can create as many as six copies of the data, by changing the copies= parameter between copying. This is both a waste of storage on disk, but also a waste of space in the stored DDT entries, since there never needs to be more than three DVAs to handle all possible values of copies=. This commit adds a new FDT feature, DDT_FLAG_FLAT. When active, only the first ddt_phys_t is used. Each time a block is written with the dedup bit set, this single phys is checked to see if it has enough DVAs to fulfill the request. If it does, the block is filled with the saved DVAs as normal. If not, an adjusted write is issued to create as many extra copies as are needed to fulfill the request, which are then saved into the entry too. Because a single phys is no longer an all-or-nothing, but can be transitioning from fewer to more DVAs, the write path now has to keep a copy of the previous "known good" DVA set so we can revert to it in case an error occurs. zio_ddt_write() has been restructured and heavily commented to make it much easier to see what's happening. Backwards compatibility is maintained simply by allocating four ddt_phys_t when the DDT_FLAG_FLAT flag is not set, and updating the phys selection macros to check the flag. In the old arrangement, each number of copies gets a whole phys, so it will always have either zero or all necessary DVAs filled, with no in-between, so the old behaviour naturally falls out of the new code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Co-authored-by: Don Brady <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: iXsystems, Inc. Closes openzfs#15893
Motivation and Context
The on-disk and in-memory dedup entry structures are larger than they need to be. Any reduction we can make reduces memory and IO overheads for every entry, which in large dedup tables can be huge.
Description
This slims down the in-memory
ddt_entry_t
, partly by reorganizing the structure and using narrower types, and partly by moving rarely-used parts out.This then adds a new variant of the entry format. The traditional format keeps a complete set of 4x DVAs for each possible value of
copies=
(plus one for the deprecated ditto blocks feature), which makes the in-memory and on-disk entry mostly empty, which is significant wasted overhead. This adds a new “flat” format which only has a single set of DVAs, but can “extend” them if a write requests more (eg writing a block withcopies=1
, settingcopies=2
, then copying the block).How Has This Been Tested?
Types of changes
Checklist:
Signed-off-by
.