Skip to content
This repository has been archived by the owner on Jun 5, 2020. It is now read-only.
Daniel Wirtz edited this page Jun 28, 2014 · 35 revisions

Welcome to the utfx wiki!

FAQ

  • What's wrong with using binary strings?
    There are two considerations to make when using binary strings. The first is that, in current JS engines, each 8bit value (UTF8 generates 1 to 4 for each code point) will require 16bit of space in memory. The second is that when the binary string has to be post-processed (e.g. written to a buffer), the memory overhead will nearly double until the garbage collector has cleaned the intermediate binary string.

  • What's wrong with using plain arrays?
    Arrays hold each 8bit value as a JavaScript number. The internal representation of these may vary depending on the JS engine used, but assuming that JS numbers wrap a 32bit value (as long as it's not a double), it's even worse than using binary strings (please correct me if this is wrong).

  • So, what's the ideal thing to do?
    Like when writing your own highly use case specific encoder and decoder, the ideal thing to do is to process code points respectively bytes successively, basically eliminating any memory overhead. With utfx this is achieved by providing sources and/or destinations as successively called functions where appropriate.

  • Wait, doesn't JavaScript already use UTF8?
    JavaScript exposes strings as UTF16. This is what you get with String#charCodeAt and why String.fromCodePoint and String#codePointAt have been proposed for ES6. Thus, when encoding from or decoding to a JavaScript string, it's UTF16.

Examples

Using array and string arguments* (with the usual overhead)

  • Converting a standard JavaScript string to UTF8 code points:

    var string = ...;
    var codepoints = []; utfx.UTF16toUTF8(
        utfx.stringSource(string),
        utfx.arrayDestination(codepoints)
    );
  • Decoding an array of UTF8 bytes to UTF8 code points:

    var bytes = [...];
    var codepoints = []; utfx.decodeUTF8(
        utfx.arraySource(bytes),
        utfx.arrayDestination(codepoints)
    );
  • Converting and encoding a standard JavaScript string as UTF8 bytes:

    var string = ...;
    var bytes = []; utfx.UTF16toUTF8Bytes(
        utfx.stringSource(string),
        utfx.arrayDestination(bytes)
    );
  • Decoding and converting an array of UTF8 bytes to a standard JavaScript string:

    var bytes = [...];
    var sd; utfx.UTF8BytesToUTF16(
        utfx.arraySource(bytes),
        sd = utfx.stringDestination()
    );
    var string = sd();

(*) Please note that utfx.arraySource/arrayDestination/stringSource/stringDestination are not included in the embeddable library and that you are encouraged to implement these on your own according to your actual use case for maximum performance. If you need any of them, you may probably want to use the standalone library instead.

Using source and destination functions

  • Converting an arbitrary input source of UTF16 characters to an arbitrary output destination of UTF8 code points:

    var string = ..., i = 0;
    utfx.UTF16toUTF8(function() {
        return i < string.length ? string.charCodeAt(i++) : null;
    }, function(cp) {
        ...
    });
  • Encoding an arbitrary input source of UTF8 code points to an arbitrary output destination of UTF8 bytes:

    var codepoints = [...], i = 0;
    utfx.encodeUTF8(function() {
        return i < codepoints.length ? codepoints[i++] : null;
    }, function(b) {
        ...
    });
  • Actually all of the examples from above when replacing src and dst with custom functions...


* [FAQ](./#faq) * [Examples](./#examples)
Clone this wiki locally