Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

javascript处理汉字到unicode的转换 #12

Open
purplebamboo opened this issue Jul 11, 2017 · 0 comments
Open

javascript处理汉字到unicode的转换 #12

purplebamboo opened this issue Jul 11, 2017 · 0 comments

Comments

@purplebamboo
Copy link
Owner

javascript处理汉字到unicode的转换

最近项目中发现个问题,就是javascript的String.fromCharCode对超过两个字节的unicode不能很好的返回对应的字。

测试如下:

String.fromCharCode('0x54c8') //正常返回"哈"

String.fromCharCode('0x20087') //应该返回"𠂇",但是此处返回了""

也就是说 只要超出了两个字节的unicode,js都没有很好的解析。

事实上,不仅仅是fromCharCode,javascript里面大部分字符处理函数都无法很好的处理超过两个字节的字符的问题。

##汉字的unicode区间

我们一直都认为汉字是双字节的,实际上根据unicode5.0规范。汉字的范围如下:

Block名称 开始码位 结束码位 字符数
CJK统一汉字 4E00 9FBB 20924
CJK统一汉字扩充A 3400 4DB5 6582
CJK统一汉字扩充B 20000 2A6D6 42711
CJK兼容汉字 F900 FA2D 302
CJK兼容汉字 FA30 FA6A 59
CJK兼容汉字 FA70 FAD9 106
CJK兼容汉字补充 2F800 2FA1D 542

可以看到 其中 CJK统一汉字扩充B的范围就不是双字节的。上面提到的"𠂇"就在这个范围里。当然大部分的汉字都是双字节的。

CJK统一汉字扩充B的汉字我们可以在这看到个大概:
http://www.chinesecj.com/code/ext-b.php

更加完整的unicode范围,可以参考这个文章,虽然年代有些久远。

##javascript的编码

javascript之所以会有这样的问题,是因为javascript使用了一种叫做UCS-2的编码方式。UCS-2编码只能处理两个字节的字符。而对于0x20087这种不止两个字节的,他会拆成两个双子节的字符。

对于上面的0x20087它会拆成两个双字节 0xd840 ,0xdc87。然后分别解析发现都是"",就造成了上面的现象。具体的转换公式为:

H = Math.floor((c-0x10000) / 0x400)+0xD800

L = (c - 0x10000) % 0x400 + 0xDC00

更加具体的javascript的编码历史,可以参考阮一峰的文章

##es6的解决方案

在es6的规范里,已经针对这种双字节的问题做了处理,提供了几个方法:

String.fromCodePoint():从Unicode码点返回对应字符
String.prototype.codePointAt():从字符返回对应的码点

于是我们可以这样:

String.fromCodePoint('0x20087') //返回'𠂇'

('𠂇'.codePointAt(0)).toString(16) //返回20087

不过很显然支持性很一般。
mdn有相关的兼容处理,具体方法就是

fromCodePoint的实现:

if (!String.fromCodePoint) {
  (function() {
    var defineProperty = (function() {
      // IE 8 only supports `Object.defineProperty` on DOM elements
      try {
        var object = {};
        var $defineProperty = Object.defineProperty;
        var result = $defineProperty(object, object, object) && $defineProperty;
      } catch(error) {}
      return result;
    }());
    var stringFromCharCode = String.fromCharCode;
    var floor = Math.floor;
    var fromCodePoint = function() {
      var MAX_SIZE = 0x4000;
      var codeUnits = [];
      var highSurrogate;
      var lowSurrogate;
      var index = -1;
      var length = arguments.length;
      if (!length) {
        return '';
      }
      var result = '';
      while (++index < length) {
        var codePoint = Number(arguments[index]);
        if (
          !isFinite(codePoint) ||       // `NaN`, `+Infinity`, or `-Infinity`
          codePoint < 0 ||              // not a valid Unicode code point
          codePoint > 0x10FFFF ||       // not a valid Unicode code point
          floor(codePoint) != codePoint // not an integer
        ) {
          throw RangeError('Invalid code point: ' + codePoint);
        }
        if (codePoint <= 0xFFFF) { // BMP code point
          codeUnits.push(codePoint);
        } else { // Astral code point; split in surrogate halves
          // http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
          codePoint -= 0x10000;
          highSurrogate = (codePoint >> 10) + 0xD800;
          lowSurrogate = (codePoint % 0x400) + 0xDC00;
          codeUnits.push(highSurrogate, lowSurrogate);
        }
        if (index + 1 == length || codeUnits.length > MAX_SIZE) {
          result += stringFromCharCode.apply(null, codeUnits);
          codeUnits.length = 0;
        }
      }
      return result;
    };
    if (defineProperty) {
      defineProperty(String, 'fromCodePoint', {
        'value': fromCodePoint,
        'configurable': true,
        'writable': true
      });
    } else {
      String.fromCodePoint = fromCodePoint;
    }
  }());
}

codePointAt的实现:

if (!String.prototype.codePointAt) {
  (function() {
    'use strict'; // needed to support `apply`/`call` with `undefined`/`null`
    var codePointAt = function(position) {
      if (this == null) {
        throw TypeError();
      }
      var string = String(this);
      var size = string.length;
      // `ToInteger`
      var index = position ? Number(position) : 0;
      if (index != index) { // better `isNaN`
        index = 0;
      }
      // Account for out-of-bounds indices:
      if (index < 0 || index >= size) {
        return undefined;
      }
      // Get the first code unit
      var first = string.charCodeAt(index);
      var second;
      if ( // check if it’s the start of a surrogate pair
        first >= 0xD800 && first <= 0xDBFF && // high surrogate
        size > index + 1 // there is a next code unit
      ) {
        second = string.charCodeAt(index + 1);
        if (second >= 0xDC00 && second <= 0xDFFF) { // low surrogate
          // http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
          return (first - 0xD800) * 0x400 + second - 0xDC00 + 0x10000;
        }
      }
      return first;
    };
    if (Object.defineProperty) {
      Object.defineProperty(String.prototype, 'codePointAt', {
        'value': codePointAt,
        'configurable': true,
        'writable': true
      });
    } else {
      String.prototype.codePointAt = codePointAt;
    }
  }());
}

有了上面的兼容处理,我们就可以很好的处理unicode与多字节字符之间的转换了。

##结语

作为10天设计出来的语言,javascript总是有这样那样的坑,实在是让人很无奈。总是要让人花一大堆时间擦屁股。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant