Fix generation of non-standard "scan" in object directory. #2097

dweeezil · 2014-02-02T14:46:32Z

Commit 1421c89 expanded the size of a zbookmark_t from 24 to 25 64-bit
values which similarly expands the size of the "scan" entry in the pool's
object directory and causes the pool to become un-importable by other
OpenZFS implementations.

This commit replaces the zbookmark_t member of dsl_scan_phys_t with 4
new members representing the contents of the zbookmark_t which insulates
the physical layout from changes to that structure.

NOTE: The same could be done with the ddt_bookmark_t in dsl-scan_phys_t.

Add a comptibility shim to dsl_scan_init() to support importing broken
pools that have the larger-than-normal "scan" entry.

Fixes #2094

dweeezil · 2014-02-02T14:53:29Z

This patch fixes the underlying problem, but does so at the cost of a major diversion in source compatibility with other OpenZFS implementations. There is a typo in the commit message: "dsl-scan_phys_t" should be "dsl_scan_phys_t".

The main issue remaining is the proper mechanism for fixing broken pools (pools that have been resilvered/scrubbed with the > 1421c89 code). This patch will allow such pools to be imported and then they will be automatically "fixed" by running a scan (resilver/scrub) on them (which likely doesn't even have to be allowed to finish). An argument could be made that we should come up with a way of automatically fixing broken pools or, at least, to display some sort of message when such a pool is imported. This issue is certainly worth further discussion.

Commit 1421c89 expanded the size of a zbookmark_t from 24 to 25 integers which causes any pool that has been resilvered subseqently to the patch to have a "scan" entry in the object directory with 25 integers and to become un-importable by ZFS implementations expecting only 24 integers. This patch is a work-in-progress first cut at a fix. It does two things: first, it creates a new structure, zbookmark_phys_t which represents the desired on-disk format of the bookmark within the directory's "scan" element. Second, it relaxes the size check during import to alow an EOVERFLOW which would otherwise cause the import to fail. A "broken" pool can be fixed by importing and then re-resilvering (scrub, etc.).

dweeezil · 2014-02-02T19:24:08Z

I figured I'd add this little testing script to check whether your pool is broken:

zdb -dddd tank 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w

Replace "tank" with your pool name. It should print 24. Broken pools will print 25.

EDIT: I just fixed one of my important pools with this patch. Imported it, started a scrub and then cancelled the scrub. The newly-written "scan" had the proper 24 integers.

EDIT2: the grep above should really be a bit tighter so I've done that in this comment.

dweeezil · 2014-02-03T13:11:39Z

On 02/02/2014 11:49 PM, P.SCH wrote:

I tested your zdb line with two zpools, once it reported "110" and the
other "113" . ????

There's a chance the "grep" is finding more than it should. We want to look at the line beginning with 2 tabs followed by the string "scan = ". I've updated the grep in #2097 to reflect that. If you still get large numbers, I suppose you should remove the "wc" pipe and see what's actually being displayed by the "zdb".

ryao · 2014-02-03T18:07:42Z

@dweeezil I like the direction that you took with this. You have had much more time to think about this over the weekend than I did and I am still playing catch-up with you in that regard. However, I have the following initial thoughts:

This should be split into two patches so that the refactoring can be merged into other Open ZFS implementations to ease code sharing without requiring them to accept the compatibility code. This would also make it easy for us to remove the compatibility hack several releases from now when we feel confident that all of the affected pools have been repaired should maintaining it become a problem.
Future changes to the ZoL code could unexpectedly break this compatibility code. We need to add some assertions like ASSERT(SCAN_PHYS_NUMINTS*(sizeof(uint64_t)==(sizeof(old_zbookmark_t))); to make this easy to catch.
This code has the potential to be very damaging should a pool affected by a similar mistake in another Open ZFS platform be repaired by it. We need to do as much validation as we can in the compatibility code to minimize the potential for this scenario. Whether more validation is possible is something that needs thought, but doing less validation than is possible is a concern of mine.

@pyavdr Cache effects are likely responsible for your observations, rather than the code in this pull request.

ryao · 2014-02-03T18:42:03Z

include/sys/dsl_scan.h

@@ -62,7 +62,13 @@
 	uint64_t scn_errors;	/* scan I/O error count */
 	uint64_t scn_ddt_class_max;
 	ddt_bookmark_t scn_ddt_bookmark;
-	zbookmark_t scn_bookmark;


I think it would be better to make a clear definition between the on-disk zbookmark_t and the in-core zbookmark_t. Perhaps we could go with zbookmark_phys_t and zbookmark_core_t.

ryao · 2014-02-03T18:49:56Z

This patch assumes that the only on-disk data structure that includes zbookmark_t is dsl_scan_phys_t. bptree_entry_phys_t also includes it. It is used when asynchronous destruction of datasets, snapshots or zvols is being done and there is a chance someone will import a pool that was in the middle of one of those. If that is the case, things will not go well. Anyway, I am still playing catch-up, but these are my in-progress thoughts.

aarcane · 2014-02-03T19:02:33Z

I'm going to go out on a limb and propose that the, while the damaged pools can and should be rescued, this repair code shouldn't ever make it into master. Anybody affected by this problem should be encouraged to apply the patch, scrub their pool, the return to the release or master branch.

Surely, this is a reasonable accommodation for people affected by a development only temporary bug. Also, it has the advantage of solving the long term compatibility problem.

dweeezil · 2014-02-03T19:06:27Z

@ryao I'll be following your notes throughout the day and will give the patch a refresh once you're done looking it over. I had been meaning to check whether the zbookmark_t was used elsewhere but I didn't notice it in my initial looks because the bptree_obj member normally isn't seen and I hadn't yet searched through the code. A quick glance at the code shows that the exact same problem would occur during an import with an async destroy in progress. Ugh.

I do think I'll go back to the _phys scheme for the bookmarks because that's clearly the right way to do things, especially given that they're used in more than one other _phys structure.

FransUrbo · 2014-02-03T19:06:37Z

I'd like to agree with @aarcane, but I didn't want to be the first to say it :).

ryao · 2014-02-03T19:58:18Z

I am still analyzing this, but I believe some variation of this code can safely go into HEAD. That being said, let me provide a little write-up of my present understanding:

zbookmark_t is included in two structures that are written out to disk. Specifically, dsl_scan_phys_t and bptree_entry_phys_t. dsl_scan_phys_t describes information about the last scrub/resilver and is validated on every pool import, which is what I and @AndCycle happened to catch independently of one another. dsl_scan_phys_t is a singleton, which means that only 1 of them exist at any given time, and it only changes on each transaction group commit. To be clear, there are actually at least two identical copies for each transaction group commit, but for the purpose of understanding what went wrong and how to fix it, the number of identical copies is not relevant. bptree_entry_phys_t is a structure that records information about an in-progress asynchronous destroy of any datasets, snapshots or zvols. It is rare that a pool with something like this is imported, so @dweeezil's preliminary testing of his patch suggested it resolved the problems.

@dweeezil's work on fixing this focused on dsl_scan_phys_t. Due to how the ZFS code is structured, the dsl_scan_phys_t is stored inside a ZAP object, which has a valid checksum. Until 1421c89, all Open ZFS implementations expected dsl_scan_phys_t to be 24 uint64_t variables. Pools scrubbed by ZoL with 1421c89 now expect this to be 25 uint64_t variables. Importing a damaged pool on a normal ZFS implementation fails because the ZAP code reports an overflow when it is asked for 24 variables and sees 25 variables. Amazingly, the reverse is not true. In specific, the ZAP code does not complain when asked for 25 variables when there are only 24 of them. In that case, I expect that the 25th value becomes 0, which is valid; note that I still need to confirm that. That is what caused this to avoid detection by LLNL's regression tests. My own regression tests did catch this, but unfortunately, I only do them upon preparing a new zfs-kmod ebuild for Gentoo, so I was completely unaware until last Monday when I was preparing a new Gentoo patch set. I plan to discuss improvements to LLNL's regression test infrastructure with @behlendorf so that this never happens again.

As for what @dweeezil did, lets take a look at dsl_scan_phys_t. The last two members of dsl_scan_phys_t are scn_bookmark (zbookmark_t) and scn_flags (uint64_t). LLNL's change added a Linux kernel pointer at the end of zbookmark_t. This means that a pointer to Linux kernel code was inserted into the structure right before a 64-bit value. On 64-bit systems, this pointer is 64 bits wide while on 32-bit systems, this pointer is 32-bits wide. @dweeezil appears to have not realized the 32-bit case was possible, so his work focused on the 64-bit case, which is likely what affects most users anyway.

@dweeezil's patch attempts to address the 64-bit case by detecting the overflow error from the zap lookup and then trying again to see if the addition of 1 value is enough to make the zap lookup happy. If it is, he copies the 25th value to the place where it was supposed to be and returns normally. As I stated earlier, I think this needs more validation and having done some analysis since saying that, I believe that it is possible to check for a specific fingerprint to verify that this damage was caused on ZoL. In specific, the 24th and 25th values in this case have unique characteristics:

The 24th value is a Linux kernel function pointer. Its value should be a function in the ZFS Linux kernel module, which only falls into a very specific range. In specific, it should be be any value for which the Linux kernel function is_vmalloc_or_module_addr returns 1.
The 25th value, which is what should have been the 24th one, only uses the least most significant bit. In specific, it is always either 0 or 1. This might change in future ZFS implementations, but only the affected pools will only have either a 0 or 1 here.

The combination of these things are very unique. More work needs to be done, but I am increasingly confident that it is possible to uniquely identify affected pools and repair them on the fly with little risk to normal pools.

…ctory. Commit 1421c89 expanded the size of a zbookmark_t from 24 to 25 64-bit values which similarly expands the size of the "scan" entry in the pool's object directory and causes the pool to become un-importable by other OpenZFS implementations. This commit renames "struct zbookmark" to "struct zbookmark_phys" since it is related to an on-disk format and adds a new "struct zbookmark" that contains the former as its first member. The effect is that the "struct zbookmark" items written to the object directory in both the "scan" and "bptree_obj" entries contain only the correct subset of the bookmark. Fixes openzfs#2094

dweeezil · 2014-02-03T22:21:35Z

@ryao et. al. I've worked up a more proper patch which converts the existing zbookmark structure to zbookmark_phys and makes a new zbookmark structure containing the _phys variant. As you might imagine, this touches an awful lot of code (but has the benefit of no need to pack/unpack the members and needs no other weird macros). I've tested it and it properly generates a 24-integer "scan" following a scrub and have every reason to believe that an "bptree_obj" entry will also be correctly generated. This commit does not include any "fix-up" attempts. I'll add those in a separate commit as previously suggested. The new commit is dweeezil/zfs@89a6e7c.

dweeezil · 2014-02-03T22:48:52Z

Dealing with the bptree_obj entries will be a pain in the butt because the structs are simply blasted into an object and I'm wondering whether it's even worth trying. It would be nice to know how much uptake our "broken" interim releases have had. It would seem the odds are very low that someone would try to import a pool with an in-progress async destroy.

@ryao Just caught your note. I'm going to re-add the ZoL workaround patch as a separate commit in a minute. All I tested was to make sure it works properly on its own and creates a proper 24-integer scan structure.

…ectory. Add a comptibility shim to dsl_scan_init() to support importing broken pools that have the larger-than-normal "scan" entry.

ryao · 2014-02-03T22:50:44Z

@dweeezil Disregard the previous comment that I made (and subsequently deleted). I misread your statement that this patch does not include any fix-up attempts.

That being said, a cursory reading of your patch does not reveal any problems and I really like the direction that you are taking with this. I will be reading it in greater detail shortly.

ryao · 2014-02-03T23:00:01Z

@dweeezil Sadly, I was not fast enough at deleting that comment. Anyway, I agree with your assessment that the odds of users being affected by this regression when running an in-progress destroy are low. You are also a step ahead of me at studying the effects on bptree_obj entries. I need to examine them in greater depth before I agree that we should avoid handling that case altogether.

ryao · 2014-02-04T00:25:32Z

@dweeezil I had a brief chat with @ahrens and @csiden about this in #openzfs on Freenode. @ahrens invited @behlendorf to join us. I expect there to be more discussion about this. In particular, @ahrens suggested that we need to answer the questions why these changes were made in the first place and why this information was put into zbookmark_t instead of zio_t. I agree that these questions need to be answered. Please join us if/when it is convenient.

ryao · 2014-02-04T01:33:49Z

@dweeezil Brian, Matthew and myself had a chat on IRC. The solution that we seem to have settled on is to move zb_func out of zbookmark_t into zio_t, implement a workaround and rename zbookmark_t to zbookmark_phys_t. Matthew Ahrens is going to handle the rename in Illumos so the change will be quickly propagated to other Open ZFS platforms and we will pull it into ZoL from there. In the case of your workaround, you seem to have neglected to copy the fixed data into the buffer, so it was as if this didn't exist at all, which happened to work. Brian and Matthew would like to instead kick off a scrub. I would like to try to detect the damaged bptree when 25-integer object is detected and fault with a message in zpool status providing a link to a webpage with details rather than trying to fix it with the damaged bptree. This doesn't handle the case when a scrub has never been run on a pool that has an in-progress asynchronous destroy, but it is the best idea that I have at the moment. I will be spending more time on this tomorrow.

behlendorf · 2014-02-04T02:00:46Z

I'll be in IRC this evening, but the take away is this was added to the bookmark because it was convenient in the code. It wasn't fully appreciated this broke the on disk format and none of the automated tests caught it. If this can be moved to the zio and the bookmark restored to it's previous definition I think that is preferable.

That just leaves us with the case of automatically detecting it and fixing it. Doing that with the smallest possible change is desirable. We don't want to carry a large change which makes our code differ from the OpenZFS code. And I think we want to ensure these pools are all fixed quickly so we don't have to carry the fix for to many tags.

lundman · 2014-02-04T02:05:16Z

It is possible some of my users have had this issue related to scrubs as well, but mostly on test pools. So we would appreciate the fix-patch as well.

dweeezil · 2014-02-04T05:23:02Z

Uncloaking from my downtime this afternoon...

I was kinda wondering, myself, why the zb_func wasn't in zio_t, but I suppose it wasn't plainly obvious that the zbookmark_t was actually a phys struct. In any case, it sounds like the Right Thing is going to happen; namely that the phys split will happen upstream and then we can pull it in. I presume you all discussed the fact that the same issue potentially exists with ddt_bookmark_t as well? At the very least least it (ddt_bookmark_t) ought to have some commentary in the source indicating that it's effectivaly a phys struct. Maybe Matthew could make ddt_bookmark_t a phys struct as well?

@ryao As to dealing with the bptree thing; I really think it may not be worth the effort. The entry in the object directory is an object number that's simply stuffed with bptree_entry_phys_t's. That struct contains a blkptr_t, a 64-bit txg and the bookmark. The code in bptree_iterate() would have to be hacked pretty badly to deal with this. Just of the top of my head, it would, at least on its first iteration, need to read something that was sizeof(bte) + sizeof(ulong_t) and then perform some sort of magic heuristic to determine whether it needed to continue to read larger struct or to revert to the normal, smaller ones. This really doesn't sound worth it to me, particularly given the extremely low odds of this affecting anybody. I suppose one trivial hack would be to generate some sort of error if the size of the object isn't a multiple of sizeof(bte).

I guess at this point I'll leave patch as-is. It sounds like we should wait for Matt's work to hit Illumos and then to get it pulled in ASAP. The real question is how to deal with fixing broken pools.

pyavdr · 2014-02-05T16:07:04Z

@dweeezil
I tried to fix one of my production pools (i have a backup) with HEAD and this patches. But i don´t have success. After a short scrub the output of your zdb line stays 25. Tried to go back to 0.6.1 but the pool don´t show up for import. With 0.6.2 i can import the pool again. Any ideas ?

ryao · 2014-02-05T20:45:55Z

@pyavdr 0.6.2 should not be able to import a pool with this problem. Anyway, do not any patches here on production pools. The proper solution will go into HEAD.

dweeezil · 2014-02-05T23:47:29Z

Closing in preference to @ryao's #2109.

ryao reviewed Feb 3, 2014
View reviewed changes

Allow import of pools with corrupted "scan" entries in the object dir…

f29cfc1

…ectory. Add a comptibility shim to dsl_scan_init() to support importing broken pools that have the larger-than-normal "scan" entry.

ryao mentioned this pull request Feb 5, 2014

Issue 2094 #2109

Closed

dweeezil closed this Feb 5, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix generation of non-standard "scan" in object directory. #2097

Fix generation of non-standard "scan" in object directory. #2097

dweeezil commented Feb 2, 2014

dweeezil commented Feb 2, 2014

dweeezil commented Feb 2, 2014

dweeezil commented Feb 3, 2014

ryao commented Feb 3, 2014

ryao Feb 3, 2014

ryao commented Feb 3, 2014

aarcane commented Feb 3, 2014

dweeezil commented Feb 3, 2014

FransUrbo commented Feb 3, 2014

ryao commented Feb 3, 2014

dweeezil commented Feb 3, 2014

dweeezil commented Feb 3, 2014

ryao commented Feb 3, 2014

ryao commented Feb 3, 2014

ryao commented Feb 4, 2014

ryao commented Feb 4, 2014

behlendorf commented Feb 4, 2014

lundman commented Feb 4, 2014

dweeezil commented Feb 4, 2014

pyavdr commented Feb 5, 2014

ryao commented Feb 5, 2014

dweeezil commented Feb 5, 2014

Fix generation of non-standard "scan" in object directory. #2097

Fix generation of non-standard "scan" in object directory. #2097

Conversation

dweeezil commented Feb 2, 2014

dweeezil commented Feb 2, 2014

dweeezil commented Feb 2, 2014

dweeezil commented Feb 3, 2014

ryao commented Feb 3, 2014

ryao Feb 3, 2014

Choose a reason for hiding this comment

ryao commented Feb 3, 2014

aarcane commented Feb 3, 2014

dweeezil commented Feb 3, 2014

FransUrbo commented Feb 3, 2014

ryao commented Feb 3, 2014

dweeezil commented Feb 3, 2014

dweeezil commented Feb 3, 2014

ryao commented Feb 3, 2014

ryao commented Feb 3, 2014

ryao commented Feb 4, 2014

ryao commented Feb 4, 2014

behlendorf commented Feb 4, 2014

lundman commented Feb 4, 2014

dweeezil commented Feb 4, 2014

pyavdr commented Feb 5, 2014

ryao commented Feb 5, 2014

dweeezil commented Feb 5, 2014