Make `ipfs filestore verify` process files sequentially #3922

dokterbob · 2017-05-13T10:25:26Z

Version information:

go-ipfs version: 0.4.8-
Repo version: 5
System version: amd64/darwin
Golang version: go1.8

Type: Enhancement

Severity: Medium

Description:

Currently, ipfs filestore verify verifies blocks ordered by hash. This causes an immense amount of unnecessary seeks which, especially on devices with rotating disks, makes verification much slower than it could be and unncessarily stresses system resources as well as causing wear.

If, instead, blocks would be orderd by file (and order therein) this would cause sequential reads which ought to be much faster.

Having no knowledge of the underlying implementation, the question would be: what's necessary to implement this? Are there any architectural features limiting the implementation of this? Is it realistic to do a full sort before the verify procedure and, if not, what could be alternatives (i.e. partial sorting, multiple sorting queues etc)?

The text was updated successfully, but these errors were encountered:

kevina · 2017-05-13T18:00:40Z

The easiest way to implement this would be to read all the hashes, then do a full sort based on the filename and offset. It will be memory intensive but could be implemented as an option.

dokterbob · 2017-05-15T10:04:44Z

Good idea to make it optional. Might it not be easier/better to do the sorting on the datastore side, though?

kevina · 2017-05-18T03:31:43Z

Sorting on the datastore side would effectively mean sorting based on the value of the key, so there isn't any real benefit there.

@whyrusleeping should I just go ahead and implement a sort?

whyrusleeping · 2017-05-18T04:21:01Z

@kevina Hrm... I think theres something here that we can do better. Given the root hash of a file stored in the filestore, we could do a graph traversal to iterate over the blocks on disk sequentially. That seems like the right way to approach this, but now we need a way to get those hashes. This could be done by a scrape of all the hashes in the repo, checking if each is a unixfs file, and then for each unixfs file we find, check if it contains at least one filestore node.

I'm not sure if that would be more efficient than enumerating and sorting everything in the filestore, but thats my thoughts on the matter

dokterbob · 2017-05-18T08:16:52Z

Even though theoretically more efficient (and sensible, perhaps) it might create unobvious implicit dependencies. For example, no one would expect that the amount of non-filestore object affects performance of a filestore operation. Perhaps it would be best to implement a 'simple' sort first, write tests for it (or even a benchmark) and then see whether it can be optimized later - perhaps at a time when there is a list of file -> hash references alongside the list of hashes of filestore objects themselves.

kevina · 2017-05-19T02:15:21Z

@whyrusleeping yes I considered that but rejected it.

The first problem is that we don't store the roots so finding them will be rather expensive. We can't tell by just the keys so we have to read the value from the flatfs datastore to determine if it is a unixfs file and then have to check if the children are in the filestore: until raw leaves becomes very common, this could be very expensive. There is also the problem @dokterbob pointed out that the performance will depend on what is stored outside the filestore.

But even if we have a list of roots stored somewhere there is another problem. This second problem is that to do it the way you suggested we could be checking the same key twice, to avoid that we need to keep some sort of data structure in memory of the keys we visited, this negates much of the memory improvident of doing a graph traversal.

This it is my opinion simply scanning the entire filestore, storing the path and key in a vector or list and then sorting that list based on the path would be the best way to handle it with our current implementation.

whyrusleeping · 2017-05-19T17:43:10Z

Alright, you guys have convinced me. Sorting that way seems good

kevina · 2017-05-23T04:19:04Z

@dokterbob could you try out #3938 and let me know if it works okay for you. the usage would be "ipfs filestore verify --file-order". I image this will only make a difference on devices with rotating disks on SDD it might be slower as seeking isn't an issue.

dokterbob · 2017-05-24T16:01:34Z

Thanks guys! 🥇

kevina self-assigned this May 18, 2017

kevina mentioned this issue May 23, 2017

filestore: add "--file-order" option to "filestore ls" and "verify" #3938

Merged

kevina added the status/in-progress In progress label May 23, 2017

whyrusleeping closed this as completed in #3938 May 24, 2017

whyrusleeping removed status/in-progress In progress labels May 24, 2017

NiKiZe mentioned this issue Jul 26, 2018

optimize filestore verify #5286

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `ipfs filestore verify` process files sequentially #3922

Make `ipfs filestore verify` process files sequentially #3922

dokterbob commented May 13, 2017

kevina commented May 13, 2017

dokterbob commented May 15, 2017

kevina commented May 18, 2017

whyrusleeping commented May 18, 2017

dokterbob commented May 18, 2017 via email

kevina commented May 19, 2017 •

edited

Loading

whyrusleeping commented May 19, 2017

kevina commented May 23, 2017

dokterbob commented May 24, 2017

Make ipfs filestore verify process files sequentially #3922

Make ipfs filestore verify process files sequentially #3922

Comments

dokterbob commented May 13, 2017

Version information:

Type: Enhancement

Severity: Medium

Description:

kevina commented May 13, 2017

dokterbob commented May 15, 2017

kevina commented May 18, 2017

whyrusleeping commented May 18, 2017

dokterbob commented May 18, 2017 via email

kevina commented May 19, 2017 • edited Loading

whyrusleeping commented May 19, 2017

kevina commented May 23, 2017

dokterbob commented May 24, 2017

Make `ipfs filestore verify` process files sequentially #3922

Make `ipfs filestore verify` process files sequentially #3922

kevina commented May 19, 2017 •

edited

Loading