Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize filestore verify #5286

Closed
wants to merge 2 commits into from
Closed

Conversation

Stebalien
Copy link
Member

@Stebalien Stebalien commented Jul 26, 2018

This works by spinning up ncores * 2 workers and having them iterate over the
datastore in parallel.

  • TODO: Benchmark this. It may not be any faster (but it really should be).
    • Looks 2x faster. I'd appreciate more benchmarks.
  • TODO: Consider creating new workers on-demand. The bottleneck here could be
    reading from the datastore, not hashing.
    • Not worth it. The bottleneck really is hashing (as far as I can tell, at least).

Also:

  • Removes sorting from the filestore functions. This feels like something that belongs in the commands.
  • Deduplicates command logic.
  • Switches to commands-lib 1.0

@Stebalien Stebalien requested a review from Kubuxu as a code owner July 26, 2018 09:44
@ghost ghost assigned Stebalien Jul 26, 2018
@ghost ghost added the status/in-progress In progress label Jul 26, 2018
@Stebalien Stebalien force-pushed the feat/faster-filestore-verify branch 2 times, most recently from c81ec24 to f113ed6 Compare July 26, 2018 16:39
Also:

* Remove sorting from the filestore functions. This feels like something that
  belongs in the commands.
* Deduplicate command logic.
* Switch to commands-lib 1.0

License: MIT
Signed-off-by: Steven Allen <[email protected]>
@Stebalien Stebalien force-pushed the feat/faster-filestore-verify branch from f113ed6 to 98d204b Compare July 26, 2018 16:40
@kevina kevina self-requested a review July 26, 2018 16:51
@kevina
Copy link
Contributor

kevina commented Jul 26, 2018

This is mostly my code, please give me a chance to review this before accepting. Thank you.

@kevina
Copy link
Contributor

kevina commented Jul 26, 2018

Removes sorting from the filestore functions. This feels like something that belongs in the commands.

I am not sure I agree with this. What is the reason for this other than "feels like something that belongs in the commands".

@Stebalien
Copy link
Member Author

I am not sure I agree with this. What is the reason for this other than "feels like something that belongs in the commands".

It's how we implement all of our other commands that sort output, as far as I know. That is, it feels like a display concern.

we might as well work ahead a bit

License: MIT
Signed-off-by: Steven Allen <[email protected]>
@Stebalien
Copy link
Member Author

Simple benchmarks indicate that this is about 2x faster on my machine (it only has two real cores so it can probably only do 2 sha256 operations at a time).

@Stebalien Stebalien added the need/review Needs a review label Jul 26, 2018
Copy link
Contributor

@kevina kevina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the output is the same, this looks good.

@NiKiZe
Copy link

NiKiZe commented Jul 26, 2018

The main reason for doing the file-sort is to access files sequentially from disk, see #3922
I think that feature should (read must) be kept, and the help text updated to reflect that

@kevina
Copy link
Contributor

kevina commented Jul 26, 2018

The main reason for doing the file-sort is to access files sequentially from disk, see #3922
I think that feature should (read must) be kept, and the help text updated to reflect that

I forgot about that. Blocking until this is addressed.

Copy link
Contributor

@kevina kevina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is important that we keep the functional of reading files sequentially to minimize seeking. Because of the this is feels wrong for the client to handle this. See #3922.

@Stebalien Stebalien added the status/blocked Unable to be worked further until needs are met label Jul 26, 2018
@Stebalien
Copy link
Member Author

I see. I assumed that the filestore would internally take care of this (caching, etc.) but I guess that would fail for random access on large datasets.

It would be nice if we could avoid opening files multiple times but I guess we can't do that :(.

@NiKiZe
Copy link

NiKiZe commented Jul 26, 2018

I think that the most important thing with sequential access here is that it is done "one file at a time", rather then "one IPFS block at a time, all over the place" which is what I observed when running without file-sort
I don't know anything about the internals , but something like;

  1. grab a block
  2. find all blocks for that file
  3. sort the blocks for that file
  4. remove all those blocks from "to be checked" list
  5. return to 1 to grab the next until the "to be checked" list is empty

@kevina
Copy link
Contributor

kevina commented Jul 26, 2018

@NiKiZe

I think that the most important thing with sequential access here is that it is done "one file at a time"

Unfortunately, we don't store enough information do to that. We only index blocks by there hash, not the underlying file.

@Stebalien
Copy link
Member Author

Unfortunately, we don't store enough information do to that. We only index blocks by there hash, not the underlying file.

But we can construct this by sorting. That's what I'm currently working on. However, the index will have to fit in memory, for now. We could also try adding an on-disk reverse-index but that feels expensive.

@kevina
Copy link
Contributor

kevina commented Jul 26, 2018

But we can construct this by sorting.

And that is exactly what listAllFileOrder does or am I missing something.

@Stebalien
Copy link
Member Author

And that is exactly what listAllFileOrder does or am I missing something.

The missing pieces are:

  1. Not opening each file once per chunk.
  2. Hashing in parallel.

My plan is to have one thread do a sequential read while a set of workers hash the file chunks in parallel.


However, after thinking about this, maybe we should just build the reverse index. We could also use it a lot of work when calling add --nocopy on the same set of files multiple times (a common use-case, as far as I can tell.

@kevina
Copy link
Contributor

kevina commented Jul 26, 2018

Not opening each file once per chunk.

we should measure is this is really a problem. My guess is that the overhead should be negligible, but I could be wrong.

@Stebalien
Copy link
Member Author

For the filestore, it may be (given the 256kib chunks) for the urlstore, I'm not so sure. My primary concern is really hashing in parallel.

@NiKiZe
Copy link

NiKiZe commented Jul 27, 2018

From ipfs/notes#296 (comment)

  • with 0.4.15 running with --file-order it is done in around 56minutes (guestimating 70-120MB/s disk usage)
  • with 0.4.15 and running without --file-order I canceled after 2hours 30minutes (guestimating ~16-30MB/s disk usage)
  • running this branch, commit 23f5cd4 (--file-order used as well) done after 3hours 36minutes (25-33MB/s disk usage was seen)

So for me it is a slowdown (My guess is that BTRFS is mainly to blame for this difference, I might be able to recreate this on EXT4 instead)

  • Another thing (that could be seen as an regression) is that the output was now done all in the end (which makes sense since the sorting is now done last), while previously we got output when blocks where done - which also served as a progress indicator. One might have taken actions via piped output which now might be affected.

@Stebalien
Copy link
Member Author

Closing due to bitrot (and brainrot, mine specifically).

@Stebalien Stebalien closed this Jan 24, 2019
@ghost ghost removed the status/in-progress In progress label Jan 24, 2019
@hacdias hacdias deleted the feat/faster-filestore-verify branch May 9, 2023 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/review Needs a review status/blocked Unable to be worked further until needs are met
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants