-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make ipfs filestore verify
process files sequentially
#3922
Comments
The easiest way to implement this would be to read all the hashes, then do a full sort based on the filename and offset. It will be memory intensive but could be implemented as an option. |
Good idea to make it optional. Might it not be easier/better to do the sorting on the datastore side, though? |
Sorting on the datastore side would effectively mean sorting based on the value of the key, so there isn't any real benefit there. @whyrusleeping should I just go ahead and implement a sort? |
@kevina Hrm... I think theres something here that we can do better. Given the root hash of a file stored in the filestore, we could do a graph traversal to iterate over the blocks on disk sequentially. That seems like the right way to approach this, but now we need a way to get those hashes. This could be done by a scrape of all the hashes in the repo, checking if each is a unixfs file, and then for each unixfs file we find, check if it contains at least one filestore node. I'm not sure if that would be more efficient than enumerating and sorting everything in the filestore, but thats my thoughts on the matter |
Even though theoretically more efficient (and sensible, perhaps) it might create unobvious implicit dependencies. For example, no one would expect that the amount of non-filestore object affects performance of a filestore operation.
Perhaps it would be best to implement a 'simple' sort first, write tests for it (or even a benchmark) and then see whether it can be optimized later - perhaps at a time when there is a list of file -> hash references alongside the list of hashes of filestore objects themselves.
|
@whyrusleeping yes I considered that but rejected it. The first problem is that we don't store the roots so finding them will be rather expensive. We can't tell by just the keys so we have to read the value from the flatfs datastore to determine if it is a unixfs file and then have to check if the children are in the filestore: until raw leaves becomes very common, this could be very expensive. There is also the problem @dokterbob pointed out that the performance will depend on what is stored outside the filestore. But even if we have a list of roots stored somewhere there is another problem. This second problem is that to do it the way you suggested we could be checking the same key twice, to avoid that we need to keep some sort of data structure in memory of the keys we visited, this negates much of the memory improvident of doing a graph traversal. This it is my opinion simply scanning the entire filestore, storing the path and key in a vector or list and then sorting that list based on the path would be the best way to handle it with our current implementation. |
Alright, you guys have convinced me. Sorting that way seems good |
@dokterbob could you try out #3938 and let me know if it works okay for you. the usage would be "ipfs filestore verify --file-order". I image this will only make a difference on devices with rotating disks on SDD it might be slower as seeking isn't an issue. |
Thanks guys! 🥇 |
Version information:
go-ipfs version: 0.4.8-
Repo version: 5
System version: amd64/darwin
Golang version: go1.8
Type: Enhancement
Severity: Medium
Description:
Currently,
ipfs filestore verify
verifies blocks ordered by hash. This causes an immense amount of unnecessary seeks which, especially on devices with rotating disks, makes verification much slower than it could be and unncessarily stresses system resources as well as causing wear.If, instead, blocks would be orderd by file (and order therein) this would cause sequential reads which ought to be much faster.
Having no knowledge of the underlying implementation, the question would be: what's necessary to implement this? Are there any architectural features limiting the implementation of this? Is it realistic to do a full sort before the verify procedure and, if not, what could be alternatives (i.e. partial sorting, multiple sorting queues etc)?
The text was updated successfully, but these errors were encountered: