Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/hash serialization #1382

Merged

Conversation

subzey
Copy link
Contributor

@subzey subzey commented Oct 1, 2021

This PR contains a:

  • bugfix
  • new feature
  • code refactor
  • test update
  • typo fix
  • metadata update

Motivation / Use-Case

Current implementation incorrectly uses the hash buffer as an entropy source:

  const localIdentHash = hash
    .digest(hashDigest)
    .slice(0, hashDigestLength)
    .replace(/[/+]/g, "_")
    .replace(/^\d/g, "_");

Instead of tossing away unacceptable values they are replaced with other ones. This results a probability "skew" that is most noticeable for a first character. For base64 the alphabet (to pick the pseudorandom value from)

A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z a b c d e f
g h i j k l m n o p q r s t u v
w x y z 0 1 2 3 4 5 6 7 8 9 + /

becomes

A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z a b c d e f
g h i j k l m n o p q r s t u v
w x y z _ _ _ _ _ _ _ _ _ _ _ _

As you can see, the underscore has significantly more chances to be picked. It raises the collision probability for the first character from 1.56% (1 / 64) up to 4.79%!

All the subsequent characters are affected, too. They are drawn form the alphabet with two underscores:

A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z a b c d e f
g h i j k l m n o p q r s t u v
w x y z 0 1 2 3 4 5 6 7 8 9 _ _

The proposed solution is to:

  • Throw away the unacceptable leading digits instead of trying to replace them.
  • Replace /_ and +- (instead of both / and +_)

As the parts of digest are tossed away, there is a (small) chance it won't be enough to generate the string of the desired length.

To overcome this issue the string is now hashed several times until it's long enough. Each hashing tier has its unique sequential salt. For MD4 / MD5 the max string length generated this way is 68 719 476 736. That should be enough to throw away any amount of unacceptable characters.

This is essentially a PBKDF / PBKDF2 but with 1 round and indefinite length: the length grows on demand.

Most of the time only one hashing is required so this change should not significantly impact the performance.

Breaking Changes

⚠️ The localIdentHash now may contain dashes (-), including leading and trailing ones. It should be okay for classnames.

⚠️ The hashing has changed and all the classnames gets another hashes. (The snapshots were updated accordingly). This may come as a surprise for users.

Additional Info

I've made a synthetic test to show this PR actually improves something:

https://gist.github.com/subzey/3f1a0efd1634bdf0b03ba2a2e744bc3e

Both implementations are called until the first collision. This is repeated 101 times then a median value is picked. This value corresponds to a 50% collision chance.

Current implementation can generate ≈ 20200 hashes until the 50% collision chance.

The proposed implementation can generate ≈ 35600 hashes until the 50% collision chance.

The expected value for true random is ≈ 35440.

@alexander-akait
Copy link
Member

warning The localIdentHash now may contain dashes (-), including leading and trailing ones. It should be okay for classnames.

we should avoid it

warning The hashing has changed and all the classnames gets another hashes. (The snapshots were updated accordingly). This may come as a surprise for users.

it is fine

@subzey
Copy link
Contributor Author

subzey commented Oct 1, 2021

we should avoid it [the dash]

Okay, that's actually fairly easy to do - just throw away the dashes with the leading digits :D Give me a second

@subzey
Copy link
Contributor Author

subzey commented Oct 1, 2021

@alexander-akait Done in 101989a (impln) and 817368b (snaps)

@subzey
Copy link
Contributor Author

subzey commented Oct 4, 2021

Looks like the [email protected] introduced a new hash xxhash64 and that's the reason the hex hash in the filenames doesn't match.

On the other hand, xxhash64 doesn't accept the Uint32Array as input data, I'll fix that shortly in this PR.

@subzey
Copy link
Contributor Author

subzey commented Oct 4, 2021

Updated webpack in package.json to actually match webpack@latest in tests. Updated the test to contain a now-default xxhash64 digest.

Fixed defaultGetLocalIdent to work with xxhash64. And a new test that it actually works.

@alexander-akait
Copy link
Member

Can you rebase?

@subzey
Copy link
Contributor Author

subzey commented Oct 7, 2021

Sure, a moment please

@subzey
Copy link
Contributor Author

subzey commented Oct 7, 2021

Here.

I've accidentally squashed the code changes commit and the snaps update commit into a single one while rebasing. I hope that's fine.

@codecov
Copy link

codecov bot commented Oct 7, 2021

Codecov Report

Merging #1382 (725e4a5) into master (a56bd94) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1382   +/-   ##
=======================================
  Coverage   98.43%   98.43%           
=======================================
  Files          12       12           
  Lines        1084     1088    +4     
  Branches      375      375           
=======================================
+ Hits         1067     1071    +4     
  Misses         14       14           
  Partials        3        3           
Impacted Files Coverage Δ
src/utils.js 97.65% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a56bd94...725e4a5. Read the comment docs.

let localIdentHash = "";
for (let tier = 0; localIdentHash.length < hashDigestLength; tier++) {
// eslint-disable-next-line no-underscore-dangle
const hash = loaderContext._compiler.webpack.util.createHash(hashFunction);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid only for perf here, create multiple times hash is not good idea, can you explain why we need create it again and again and do not reuse existing hash?

Copy link
Contributor Author

@subzey subzey Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2nd iteration is only required if the digest length is not enough to generate a pseudorandom string long enough.

If we won't change anything between the iterations then each iteration would generate the same output over and over and the string would just cycle:

If hash(str) = "ABCDE", then it would be "ABCDEABCDEABCDEA". Even if this str is 16 chars long its "uniqueness" is only 5 chars long.


I've ran npm test:only logging the number of times this loop iterated per function call.

1375 calls only ran this loop once.

122 twice.

2 calls took three iterations

Copy link
Contributor Author

@subzey subzey Oct 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upd: My bad, jest spawns 4 processes and they were concurrently writing the same file 😓

1672 - 1 time
125 - 2 times
2 - 3 times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've investigated those two suspicious calls where the loop is iterated 3 times.

It's my own xxhash64 test from this very PR. I'm asking a hex digest of length 20 (= 80bits) from a digest that can only produce 64 bits and have a really high chance to produce a long run of leading digits.

Iteration 0:
digest is 7268299549203d4d
Leading digits are dropped, the resulting string so far is:
7268299549203 d4d

Iteration 1:
digest is 26a899717294ba28 =>
7268299549203 d4d 26a899717294ba28

Iteration 2:
digest is 45400c2599876c30. The string is trimmed at length 20:
7268299549203 d4d 26a899717294ba28 4 5400c2599876c30

src/utils.js Outdated
for (let tier = 0; localIdentHash.length < hashDigestLength; tier++) {
// eslint-disable-next-line no-underscore-dangle
const hash = loaderContext._compiler.webpack.util.createHash(hashFunction);
const { hashSalt } = options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this under loop, because we don't change it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 725e4a5

src/utils.js Outdated
localIdentHash = (localIdentHash + hash.digest(hashDigest))
.replace(/^\d+/, "")
.replace(/\//g, "_")
.replace(/\W+/g, "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I am afraid about these changes, can you explain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop all leading digits
Then replace any / with '_'
Then drop anything except a-zA-Z0-9_ (including + and '=' of base64)

I could change the last one to /[^a-zA-Z0-9_]+/, for better readablility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 725e4a5

@alexander-akait alexander-akait merged commit c7db752 into webpack-contrib:master Oct 9, 2021
@alexander-akait
Copy link
Member

Thanks

@ThiefMaster
Copy link

ThiefMaster commented Oct 29, 2021

FFS, why is this in a patch release? Not even a minor release, a PATCH release. This kind of change deserves a major version bump since it breaks e.g. babel-plugin-react-css-modules.

@alexander-akait
Copy link
Member

@ThiefMaster it is MINOR release - v6.3.0...v6.4.0

@ThiefMaster
Copy link

Indeed, I just realized that I'm still on v5 since I didn't do major version bumps and updated the babel plugin (which did not use a major version bump for this). Sorry for ranting early.

@alexander-akait
Copy link
Member

if babel-plugin-react-css-modules want to be align with out logic they should install us in dev deps and reuse, I can't fix it here and we cannot ignore problems in logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants