Skip to content

Commit

Permalink
Switched to traversal-based compact logic
Browse files Browse the repository at this point in the history
This simplifies some of the interactions between reading and writing
inside the commit logic. Unfortunately this change didn't decrease
code size as was initially hoped, but it does offer a nice runtime
improvement for the common case and should improve debugability.

Before, the compact logic required three iterations:
1. iterate through all the ids in a directory
2. scan attrs bound to each id in the directory
3. lookup attrs in the in-progress commit

The code for this, while terse and complicated, did have some nice side
effect. The directory lookup logic could be reused for looking up in the
in-progress commit, and iterating through each id allows us to know
exactly how many ids we can fit during a compact. Giving us a O(n^3)
compact and O(n^3) split.

However, this was complicated by a few things.

First, this compact logic doesn't handle deleted attrs. To work around this,
I added a marker for the last commit (or first based on your perspective)
which would indicate if a delete should be copied over. This worked but was
a bit hacky and meant deletes weren't cleaned up on the first compact.

Second, we can't actually figure out our compacted size until we
compact. This worked ok except for the fact that splits will always have a
failed compact. This means we waste an erase which could very expensive.
It is possible to work around this by keeping our work, but with only a
single prog cache this was very tricky and also somewhat hacky.

Third, the interactions between reading and writing to the same block
were tricky and error-prone. They should mostly be working now, but
seeing this requirement go away does not make me sad.

The new compact logic fixes these issues by moving the complexity into a
general-purpose lfs_dir_traverse function which has much fewer side
effects on the system. We can even use it for dry-runs to precompute our
estimated size.

How does it work?
1. iterate through all attr in the directory
2. for each attr, scan the rest of the directory to figure out the
   attr's history, this will change the attr based on dir modifications
   and may even exit early if the attr was deleted.

The end result is a traversal function that gives us the resulting state
of each attr in only O(n^2). To make this complete, we allow a bounded
recursion into mcu-side move attrs, although this ends up being O(n^3)
unlike moves in the original solution (however moves are less common.

This gives us a nice traversal function we can use for compacts and
moves, handles deletes, and is overall simpler to reason about.

Two minor hiccups:
1. We need to handle create attrs specially, since this algorithm
   doesn't care or id order, which can cause problems since attr
   insertion are order sensitive. We can fix this by simply looking up
   each create (since there is only one per file) in order at the
   beginning of our traversal. This is oddly complimentary to the move
   logic, which also handles create attrs separately.

2. We no longer know exactly how many ids we can write to a dir during
   splits. However, since we can do a dry-run traversal, we can use that
   to simply binary search for the mid-point.

This gives us a O(n^2) compact and O(n^2 log n) split, which is a nice
minor improvement (remember n is bounded by block size).
  • Loading branch information
geky committed Dec 28, 2018
1 parent dc507a7 commit a548ce6
Show file tree
Hide file tree
Showing 2 changed files with 411 additions and 347 deletions.
Loading

0 comments on commit a548ce6

Please sign in to comment.