CID mismatch with large files #1518

tomohiro-n · 2024-07-05T15:29:39Z

I've noticed that the CID we pre-calculate for a file and one after it's uploaded to your service can be different.
Then I was able to reproduce the exact same mismatch(expected value was what we pre-calculated, actual was the one after upload) with one of your test cases.

Most likely, it depends on the file size. As far as we've checked, the mismatch is produced when the size is > 1.9mb or so.

The uploads a file to the service test in the upload-client package fails by changing as follows.

-    const bytes = await randomBytes(128)
+    const bytes = fs.readFileSync('/path/to/large-file')

I've confirmed that the test passes with a 400kb file.

The text was updated successfully, but these errors were encountered:

tomohiro-n · 2024-07-08T04:29:49Z

More easily, changing to const bytes = await randomBytes(2_000_000) is enough for the test to fail.

StefanoDeVuono · 2024-08-17T04:34:53Z

Hi! I'm not too familiar with your code base, I took a look at this and found a couple things:

TLDR:
Problem
Actual code and test code create different sized data streams which then become CARs with different IDs.

Fix (PR 2532):
Use UnixFS module's createFileEncoderStream in test helper and actual code and test code will create same sized data streams resulting in same CAR IDs on large files.

Details:
The actual code calls CarWriter.create while the test helper calls toCAR. For small byte sizes, like 128, the expected cid and actual cid are the same as desired. Moreover the number of bytes in the expected CAR instance is the same and the actual CAR instance are the same:

If we use more bytes, the expected CAR instance (2000098) is smaller than the actual one (2000283)

The test helper's toCAR method does not chunk the bytes, while the actual code's underlying the UnixFS module, with a max chunk size of 1024 * 1024, splits 2_000_000 into three chunks. So, in the real code data is added to each chunk.

By using the UnixFS module's createFileEncoderStream method to make a chunked stream before making a CAR object, the same headers get added to each chunk and the same CAR gets created (see PR 2532). The tests then pass at both 128 bytes and 2_000_000 bytes.

Hopefully, this is helpful!

tomohiro-n changed the title ~~CID mismatch for large files~~ CID mismatch with large files Jul 5, 2024

StefanoDeVuono mentioned this issue Aug 17, 2024

test(upload-client): allow large byte uploads #1532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CID mismatch with large files #1518

CID mismatch with large files #1518

tomohiro-n commented Jul 5, 2024

tomohiro-n commented Jul 8, 2024

StefanoDeVuono commented Aug 17, 2024 •

edited

Loading

CID mismatch with large files #1518

CID mismatch with large files #1518

Comments

tomohiro-n commented Jul 5, 2024

tomohiro-n commented Jul 8, 2024

StefanoDeVuono commented Aug 17, 2024 • edited Loading

StefanoDeVuono commented Aug 17, 2024 •

edited

Loading