Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CID mismatch with large files #1518

Open
tomohiro-n opened this issue Jul 5, 2024 · 2 comments
Open

CID mismatch with large files #1518

tomohiro-n opened this issue Jul 5, 2024 · 2 comments

Comments

@tomohiro-n
Copy link

I've noticed that the CID we pre-calculate for a file and one after it's uploaded to your service can be different.
Then I was able to reproduce the exact same mismatch(expected value was what we pre-calculated, actual was the one after upload) with one of your test cases.

Most likely, it depends on the file size. As far as we've checked, the mismatch is produced when the size is > 1.9mb or so.

The uploads a file to the service test in the upload-client package fails by changing as follows.

-    const bytes = await randomBytes(128)
+    const bytes = fs.readFileSync('/path/to/large-file')

I've confirmed that the test passes with a 400kb file.

@tomohiro-n tomohiro-n changed the title CID mismatch for large files CID mismatch with large files Jul 5, 2024
@tomohiro-n
Copy link
Author

More easily, changing to const bytes = await randomBytes(2_000_000) is enough for the test to fail.

@StefanoDeVuono
Copy link

StefanoDeVuono commented Aug 17, 2024

Hi! I'm not too familiar with your code base, I took a look at this and found a couple things:

TLDR:
Problem
Actual code and test code create different sized data streams which then become CARs with different IDs.

Fix (PR 2532):
Use UnixFS module's createFileEncoderStream in test helper and actual code and test code will create same sized data streams resulting in same CAR IDs on large files.


Details:
The actual code calls CarWriter.create while the test helper calls toCAR. For small byte sizes, like 128, the expected cid and actual cid are the same as desired. Moreover the number of bytes in the expected CAR instance is the same and the actual CAR instance are the same:
128k before

If we use more bytes, the expected CAR instance (2000098) is smaller than the actual one (2000283)
image

The test helper's toCAR method does not chunk the bytes, while the actual code's underlying the UnixFS module, with a max chunk size of 1024 * 1024, splits 2_000_000 into three chunks. So, in the real code data is added to each chunk.

By using the UnixFS module's createFileEncoderStream method to make a chunked stream before making a CAR object, the same headers get added to each chunk and the same CAR gets created (see PR 2532). The tests then pass at both 128 bytes and 2_000_000 bytes.

Hopefully, this is helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants