Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid locale-dependent string comparisons #3

Merged
merged 1 commit into from
Jul 17, 2023

Conversation

aoakley-roku
Copy link
Contributor

It doesn't seem appropriate to use locale dependent ordering to sort the variants and extensions for normalization. The result of normalization should be consistent.

The Unicode CLDR specficiation is not specific about how these tags are sorted, it just says that they should be in "alphabetical order". They only contain a-z (lower case) and 0-9, so for most locales alphabetical order is the same as the code point order we are now using.

This commit removes a dependency on the ECMAScript internationalization API. This API is not available everywhere, particularly on embedded devices like TVs where storage space for internationalization data is limited. This is important as bcp-47-normalize is now used by dash.js which in turn is used by many HbbTV applications.

@wooorm
Copy link
Owner

wooorm commented Jul 15, 2023

Hey!

The build was failing due to a dev-dependency, I fixed that, and merged with you.
But now coverage is failing, likely due to your work?

The idea looks good, but can you fix the coverage?

And, the style is off compared to the rest of the project, e.g., semicolons and early exits. Something like:

if (left.singleton > right.singleton) {
  return 1
}

if (left.singleton < right.singleton) {
  return -1
}

return 0

@aoakley-roku aoakley-roku force-pushed the string-sorting branch 2 times, most recently from 4297991 to 6efe7a2 Compare July 17, 2023 11:11
@aoakley-roku
Copy link
Contributor Author

Thanks. I hadn't managed to get the test coverage check to run locally, but it works now with the dependency updates.

The missing bit of test coverage was for when there are multiple extensions with the same singleton. e.g. en-a-ext1-a-ext2. The specification says that this is not valid, and I think that bcp47.parse should reject these (but it doesn't at the moment).

Rather than add a test which will fail if bcp47.parse is updated to reject this case, I've removed the return 0 statement and replaced it with a comment. This doesn't change the behavior in any way, but it makes the coverage checker happy.

It doesn't seem appropriate to use locale dependent ordering to sort the
variants and extensions for normalization.  The result of normalization
should be consistent.

The Unicode CLDR specficiation is not specific about how these tags are
sorted, it just says that they should be in "alphabetical order".  They
only contain a-z (lower case) and 0-9, so for most locales alphabetical
order is the same as the code point order we are now using.

This commit removes a dependency on the ECMAScript internationalization
API.  This API is not available everywhere, particularly on embedded
devices like TVs where storage space for internationalization data is
limited.  This is important as bcp-47-normalize is now used by dash.js
which in turn is used by many HbbTV applications.

The new test case isn't strictly valid - there should not be more than
one extension with the same singleton.
@aoakley-roku
Copy link
Contributor Author

Argh, now it's failing some other check.

At this point I'm really not sure what the "right" thing to do is. Options I can see are:

  1. Include the "return 0" line and add a test which isn't really valid, like the en-a-ext1-a-ext2. I think that bcp47.parse should reject this and therefore the test should fail, but at the moment the test will pass.
  2. Not handle the equal case at, making compareSingleton always return +1 or -1. This isn't a "consistent comparator" according to ECMAScript and the sort order is therefore implementation-defined.
  3. Rewrite compareSingleton as a one liner. This silences the coverage checker, but makes the code harder to read.

The first of these options seems the least bad to me, so I've now pushed code which does that.

@wooorm
Copy link
Owner

wooorm commented Jul 17, 2023

yep, 1 is fine

@wooorm
Copy link
Owner

wooorm commented Jul 17, 2023

see also wooorm/bcp-47#23

@wooorm wooorm merged commit ea42b0c into wooorm:main Jul 17, 2023
@wooorm
Copy link
Owner

wooorm commented Jul 17, 2023

Thanks, released in 2.3.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants