optimize filestore verify #5286

Stebalien · 2018-07-26T09:44:57Z

This works by spinning up ncores * 2 workers and having them iterate over the
datastore in parallel.

TODO: Benchmark this. It may not be any faster (but it really should be).
- Looks 2x faster. I'd appreciate more benchmarks.
TODO: Consider creating new workers on-demand. The bottleneck here could be
reading from the datastore, not hashing.
- Not worth it. The bottleneck really is hashing (as far as I can tell, at least).

Also:

Removes sorting from the filestore functions. This feels like something that belongs in the commands.
Deduplicates command logic.
Switches to commands-lib 1.0

Also: * Remove sorting from the filestore functions. This feels like something that belongs in the commands. * Deduplicate command logic. * Switch to commands-lib 1.0 License: MIT Signed-off-by: Steven Allen <[email protected]>

kevina · 2018-07-26T16:53:27Z

This is mostly my code, please give me a chance to review this before accepting. Thank you.

kevina · 2018-07-26T16:55:36Z

Removes sorting from the filestore functions. This feels like something that belongs in the commands.

I am not sure I agree with this. What is the reason for this other than "feels like something that belongs in the commands".

Stebalien · 2018-07-26T16:59:08Z

I am not sure I agree with this. What is the reason for this other than "feels like something that belongs in the commands".

It's how we implement all of our other commands that sort output, as far as I know. That is, it feels like a display concern.

we might as well work ahead a bit License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien · 2018-07-26T17:09:51Z

Simple benchmarks indicate that this is about 2x faster on my machine (it only has two real cores so it can probably only do 2 sha256 operations at a time).

kevina

As long as the output is the same, this looks good.

NiKiZe · 2018-07-26T17:26:05Z

The main reason for doing the file-sort is to access files sequentially from disk, see #3922
I think that feature should (read must) be kept, and the help text updated to reflect that

kevina · 2018-07-26T17:30:59Z

The main reason for doing the file-sort is to access files sequentially from disk, see #3922
I think that feature should (read must) be kept, and the help text updated to reflect that

I forgot about that. Blocking until this is addressed.

kevina

It is important that we keep the functional of reading files sequentially to minimize seeking. Because of the this is feels wrong for the client to handle this. See #3922.

Stebalien · 2018-07-26T17:55:45Z

I see. I assumed that the filestore would internally take care of this (caching, etc.) but I guess that would fail for random access on large datasets.

It would be nice if we could avoid opening files multiple times but I guess we can't do that :(.

NiKiZe · 2018-07-26T18:16:19Z

I think that the most important thing with sequential access here is that it is done "one file at a time", rather then "one IPFS block at a time, all over the place" which is what I observed when running without file-sort
I don't know anything about the internals , but something like;

grab a block
find all blocks for that file
sort the blocks for that file
remove all those blocks from "to be checked" list
return to 1 to grab the next until the "to be checked" list is empty

kevina · 2018-07-26T20:11:45Z

@NiKiZe

I think that the most important thing with sequential access here is that it is done "one file at a time"

Unfortunately, we don't store enough information do to that. We only index blocks by there hash, not the underlying file.

Stebalien · 2018-07-26T21:45:37Z

Unfortunately, we don't store enough information do to that. We only index blocks by there hash, not the underlying file.

But we can construct this by sorting. That's what I'm currently working on. However, the index will have to fit in memory, for now. We could also try adding an on-disk reverse-index but that feels expensive.

kevina · 2018-07-26T21:47:51Z

But we can construct this by sorting.

And that is exactly what listAllFileOrder does or am I missing something.

Stebalien · 2018-07-26T22:07:38Z

And that is exactly what listAllFileOrder does or am I missing something.

The missing pieces are:

Not opening each file once per chunk.
Hashing in parallel.

My plan is to have one thread do a sequential read while a set of workers hash the file chunks in parallel.

However, after thinking about this, maybe we should just build the reverse index. We could also use it a lot of work when calling add --nocopy on the same set of files multiple times (a common use-case, as far as I can tell.

kevina · 2018-07-26T23:49:54Z

Not opening each file once per chunk.

we should measure is this is really a problem. My guess is that the overhead should be negligible, but I could be wrong.

Stebalien · 2018-07-27T00:26:01Z

For the filestore, it may be (given the 256kib chunks) for the urlstore, I'm not so sure. My primary concern is really hashing in parallel.

NiKiZe · 2018-07-27T05:05:38Z

From ipfs/notes#296 (comment)

with 0.4.15 running with --file-order it is done in around 56minutes (guestimating 70-120MB/s disk usage)

with 0.4.15 and running without --file-order I canceled after 2hours 30minutes (guestimating ~16-30MB/s disk usage)

running this branch, commit 23f5cd4 (--file-order used as well) done after 3hours 36minutes (25-33MB/s disk usage was seen)

So for me it is a slowdown (My guess is that BTRFS is mainly to blame for this difference, I might be able to recreate this on EXT4 instead)

Another thing (that could be seen as an regression) is that the output was now done all in the end (which makes sense since the sorting is now done last), while previously we got output when blocks where done - which also served as a progress indicator. One might have taken actions via piped output which now might be affected.

Stebalien · 2019-01-24T21:47:15Z

Closing due to bitrot (and brainrot, mine specifically).

Stebalien requested a review from Kubuxu as a code owner July 26, 2018 09:44

ghost assigned Stebalien Jul 26, 2018

ghost added the status/in-progress In progress label Jul 26, 2018

Stebalien mentioned this pull request Jul 26, 2018

IPFS and Gentoo Portage (distfiles) ipfs/notes#296

Open

Stebalien force-pushed the feat/faster-filestore-verify branch 2 times, most recently from c81ec24 to f113ed6 Compare July 26, 2018 16:39

optimize filestore verify

98d204b

Also: * Remove sorting from the filestore functions. This feels like something that belongs in the commands. * Deduplicate command logic. * Switch to commands-lib 1.0 License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien force-pushed the feat/faster-filestore-verify branch from f113ed6 to 98d204b Compare July 26, 2018 16:40

kevina self-requested a review July 26, 2018 16:51

buffer filestore verify/list output

23f5cd4

we might as well work ahead a bit License: MIT Signed-off-by: Steven Allen <[email protected]>

Stebalien added the need/review Needs a review label Jul 26, 2018

kevina reviewed Jul 26, 2018

View reviewed changes

kevina suggested changes Jul 26, 2018

View reviewed changes

Stebalien added the status/blocked Unable to be worked further until needs are met label Jul 26, 2018

Stebalien closed this Jan 24, 2019

ghost removed the status/in-progress In progress label Jan 24, 2019

hacdias deleted the feat/faster-filestore-verify branch May 9, 2023 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize filestore verify #5286

optimize filestore verify #5286

Stebalien commented Jul 26, 2018 •

edited

Loading

kevina commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 26, 2018

Stebalien commented Jul 26, 2018

kevina left a comment

NiKiZe commented Jul 26, 2018

kevina commented Jul 26, 2018

kevina left a comment

Stebalien commented Jul 26, 2018

NiKiZe commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 27, 2018

NiKiZe commented Jul 27, 2018

Stebalien commented Jan 24, 2019

optimize filestore verify #5286

optimize filestore verify #5286

Conversation

Stebalien commented Jul 26, 2018 • edited Loading

kevina commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 26, 2018

Stebalien commented Jul 26, 2018

kevina left a comment

Choose a reason for hiding this comment

NiKiZe commented Jul 26, 2018

kevina commented Jul 26, 2018

kevina left a comment

Choose a reason for hiding this comment

Stebalien commented Jul 26, 2018

NiKiZe commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 26, 2018

kevina commented Jul 26, 2018

Stebalien commented Jul 27, 2018

NiKiZe commented Jul 27, 2018

Stebalien commented Jan 24, 2019

Stebalien commented Jul 26, 2018 •

edited

Loading