-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correctly Backup/Restore Hard Links #183
Comments
Yupp. Note that it can optionally generate hardlinks when extracting if this is requested (similar to how it normally generates reflinks when it can). Hence if hardlink support is added care needs to be taken that either hardlinks are included in the image or they are used as optimization technique when extracting, but not both. Including hardlinks in the image is not entirely trivial. That's because casync's serialization format puts an emphasis on being "composable", i.e that the serialization of a directory is strictly the concatenation of everything included plus metadata about itself. But that means a hardlink cannot be encoded inside each individual dir, but must be encoded in the closest enclosing directory, i.e. the common parent of each pair of hardlinks pointing to the same inode. Moreover, each occasion of the same inode needs to be serialized with full payload, so that things remain entirely composable. This sounds wasteful, but is actually not that bad, given that the usual chunking logic should recognize this anyway, and hence dedup this again. |
I can't see why this property is a hard requirement? I would simply store hardlinks like symlinks, using an extra 'HARDLINK' tag. This is easy to implement, and I cannot see any real drawbacks? |
Well, the design focusses on composability and reproducability. and that means that if you serialize a directory /foo/a and then a directory /foo/b and then /foo as a whole, then /foo should really reference the same blocks as /foo/a and /foo/b... I mean, if you handle hardlinks (as you propose) the algorithm would be something like this: Serialization: For each inode we encounter with st_nlink > 1 we'd create a db entry somewhere, with a lookup key of ino_t+dev_t and a lookup value of the fs path. If the db entry is created new you'd serialize things as a regular file, if it already exists you'd instead serialize things as your new proposed HARDLINK object. Deserialization: whenever HARDLINK is encountered, you just call link(), trivially. Now, if we retain the composability the algorithm would be changed like this: Serialization: As before, for each inode with st_nlink > 1 we create a db entry somewhere, as before. regardless if the db entry already exists or not, we'd serialize the file normally (i.e. as regular file, no "HARDLINK" object involved here). This time the db would use ino_t+dev_t as key (as above), but the value would not be a single path, but a set of paths, i.e. all paths we discovered the inode under. Then, as we continue serializating, whenever we go one recursion level up, (i.e. return from one file/dir to the dir it is contained in), we check if any of the db entries have two or more paths with a prefix that matches our dir. If so, we generate a HARDLINK object with all listed paths. Deserialization: we deserialize normally. whenever we hit a HARDLINK object we replace the all listed objects with a hardlink to the first one listed. So, yes, this is a bit more complex to implement, but I really like the concept of composability, and thankfully the chunking should remove the redundancy this introduces again. The composability is not just a theoretic nicety btw: it also means that when searching for a file inside an archive (the way the fuse logic in casync does it on each file access) is conceptually bisection as we search for it strictly in a constantly shriking search window if you so will... |
This basically stores hardlinked content multiple times, which makes using .catar format quite unefficient if you have many hardlinks. The deduplication layer may reduce space waste for some cases, but is a serious drawback if you use the .catar format without the dedup layerg above ... |
I also stumbled upon this issue looking at casync for container images. Therefore I can understand, that it's not fun viewing how the root-filesystem of a I understand that what @poettering proposes is helpful for the I wonder why @maurerdietmar wants to use I'm looking at casync using the |
It is important that a restore produce an exact copy of the backup-ed data (especially when doing container backup). Seems the current catar implementation does not correctly restore hardlinks, and instead creates multiple copies.
The text was updated successfully, but these errors were encountered: