Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating rolling hashes #29903

Closed
overlookmotel opened this issue Oct 9, 2019 · 8 comments
Closed

Calculating rolling hashes #29903

overlookmotel opened this issue Oct 9, 2019 · 8 comments
Labels
crypto Issues and PRs related to the crypto subsystem. feature request Issues that request new features to be added to Node.js.

Comments

@overlookmotel
Copy link

overlookmotel commented Oct 9, 2019

Is your feature request related to a problem? Please describe.

When uploading a large file to Google Drive via Google Drive API, it is recommended to upload in chunks, so that sending can be resumed if transfer of a particular chunk fails (Google API docs).

After transfer of each chunk, the API returns the MD5 hash of all data transferred to date (i.e. of all chunks up to and including the last one). NB This is not documented, but appears in an x-range-md5 HTTP response header.

It would be useful to be able to verify that hash after transfer of each chunk, in order to know if a chunk has been corrupted in transmission. It would then be possible to transfer the chunk again.

This is not feasible to do in Node at present. You can verify the final hash after the very last chunk and make sure it matches for the entire file. That ensures data integrity. But if it doesn't match, you don't know which chunk of the file was corrupted, and have to start the upload again from the beginning (expensive when the file is 100GB!)

Describe the solution you'd like

Some way to calculate "rolling" hashes. i.e. call hash.digest() but then still be able to do further calls to hash.update() and call hash.digest() again.

Possible ways to achieve this:

  1. Keep the internal state of the hash after call to .digest(), so it can be reused.
  2. Add a hash.copy() method which clones the hash so you can call .digest() on the clone, and still retain a "live" hash which you can continue to .update().
  3. As (1) but only enable this feature if crypto.createHash() called with a reuseable option.

@sam-github raised the possibility of a .copy() method in #25857 (comment).

Here's how that would work for my use case:

// NB Simplified code
const CHUNK_SIZE = 256 * 1024; // 256 KiB

async function upload(path, size) {
  const hash = crypto.createHash('md5');

  for (const start = 0; start < size; start += CHUNK_SIZE) {
    const end = start + CHUNK_SIZE - 1;
    const stream = fs.createReadStream(path, {start, end})
      .pipe(new stream.Transform({
        transform(data, encoding, cb) {
          hash.update(data);
          this.push(data);
          cb();
        }
      });

    const md5FromApi = await transferChunkToGoogleDrive(stream, start, end);
    const md5Actual = hash.copy().digest('hex');
    if (md5FromApi !== md5Actual) {
      // Rather than throwing, we could transfer last chunk again.
      // This logic omitted to keep example short.
      throw new Error(`Transfer failed on chunk ${start}-${end}`);
    }
}

The hash.copy() call will no doubt have a performance penalty, but this is outweighed in this case by the cost of having to start an upload again from the beginning if the file is large.

Describe alternatives you've considered

An alternative is to use a JS implementation of MD5, where you can access the internal state of a hash and clone it. I suspect performance would be much worse than Node's native methods though.

@overlookmotel overlookmotel changed the title Clone/copy hashes for calculating rolling hashes Calculating rolling hashes Oct 9, 2019
@bnoordhuis bnoordhuis added crypto Issues and PRs related to the crypto subsystem. feature request Issues that request new features to be added to Node.js. labels Oct 9, 2019
@bnoordhuis
Copy link
Member

I don't think there's real opposition to a .copy() method. PR welcome.

@bnoordhuis
Copy link
Member

#29910 - decided to do it myself because it's a little less obvious than I initially thought it was. :-)

It's not super complex but neither is it super clear where to make the changes if you're not already familiar with the code base.

@overlookmotel
Copy link
Author

@bnoordhuis Amazing! Thank you.

I was going to attempt this myself, but not having written a line of C/C++ in my life, it was going to be a challenge. I started going through the code and trying to figure it out, but would have taken me weeks. So I am very relieved you've stepped in and I can remain in the comforting cocoon of Javascript!

bnoordhuis added a commit to bnoordhuis/io.js that referenced this issue Oct 15, 2019
Make it possible to clone the internal state of a Hash object
into a new Hash object, i.e., to fork the state of the object.

Fixes: nodejs#29903
@Trott Trott closed this as completed in 9f203f9 Oct 16, 2019
MylesBorins pushed a commit that referenced this issue Oct 23, 2019
Make it possible to clone the internal state of a Hash object
into a new Hash object, i.e., to fork the state of the object.

Fixes: #29903

PR-URL: #29910
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Oct 23, 2019
Make it possible to clone the internal state of a Hash object
into a new Hash object, i.e., to fork the state of the object.

Fixes: #29903

PR-URL: #29910
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
targos pushed a commit that referenced this issue Jan 8, 2020
Make it possible to clone the internal state of a Hash object
into a new Hash object, i.e., to fork the state of the object.

Fixes: #29903

PR-URL: #29910
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@webdeb
Copy link

webdeb commented Jan 29, 2020

Hello guys, can copy be used with createHmac as well ?

In our scenario, we would like to reuse an hmac instance. Because we are using hmac inside our graphql resolvers to produce signed image urls:

const imageTypes = {
   thumb: () => signUrl("300x200", url),
   bigPanel: () => signUrl("600x300", url),
}

Right now the signer must be created each time signUrl is called.

const signUrl = (op, url) => {
     const hmac = crypto.createHmac("sha1", "secret") // would be good to reuse that part

     const base64 = hmac.update(`${op}/${imageUrl}`).digest("base64")
     const signature = base64url.fromBase64(encrypted)
     ....
}  

The problem is, that you can have a lot of different imageTypes so not only its created for each request, the hmac has to be created for each imageType the user requested.

@overlookmotel
Copy link
Author

overlookmotel commented Jan 29, 2020

@webdeb According to the docs, HMAC class doesn't have a .copy() method. Only Hash class does.

If it did, I guess you could do this:

const hmac1 = crypto.createHmac('sha1', 'secret');
const hmac2 = hmac.copy();

const hash1 = hmac.update('300x200/url').digest('base64');
const hash2 = hmac.update('600x300/url').digest('base64');

However, I'm not sure if .copy() would be any faster than .createHmac() or not.

If you want to propose an hmac.copy() method, you might want to open a new issue to discuss that.

@webdeb
Copy link

webdeb commented Jan 29, 2020

thank you @overlookmotel I haven't used the latest node version to check it, just wanted to mention the scenario. However, I don't think, that creating an issue without an actual need, makes sense. As long as it's not the ultimate bottleneck. :)

@bnoordhuis
Copy link
Member

@webdeb #25857 (comment) - tl;dr no, because node would have to retain the secret (which it doesn't and shouldn't do.)

@webdeb
Copy link

webdeb commented Jan 29, 2020

@bnoordhuis thank you for clarification. In the meanwhile I tested the latencies, for original Image Url (without signing) and a bunch of signed, I don't think that it will have any significant impact (compared to network for example..). I guess it's just a matter of microseconds. :D

When my app scales up to Hugillions of requests, I'll come back :)

BethGriggs pushed a commit that referenced this issue Feb 6, 2020
Make it possible to clone the internal state of a Hash object
into a new Hash object, i.e., to fork the state of the object.

Fixes: #29903

PR-URL: #29910
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: Sam Roberts <[email protected]>
Reviewed-By: Tobias Nießen <[email protected]>
Reviewed-By: David Carlier <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crypto Issues and PRs related to the crypto subsystem. feature request Issues that request new features to be added to Node.js.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants