Skip to content

Commit

Permalink
Make the utf-8 decoder match Unicode best practice.
Browse files Browse the repository at this point in the history
  • Loading branch information
annevk committed Nov 16, 2012
1 parent 04d7818 commit ec77e51
Show file tree
Hide file tree
Showing 2 changed files with 88 additions and 60 deletions.
74 changes: 44 additions & 30 deletions Overview.html
Original file line number Diff line number Diff line change
Expand Up @@ -1047,14 +1047,10 @@ <h2 id="the-encoding"><span class="secno">8 </span>The encoding</h2>

<h3 id="utf-8"><span class="secno">8.1 </span><dfn>utf-8</dfn></h3>

<!--
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#utf-8
Unicode
-->

<p>The <dfn id="utf-8-code-point">utf-8 code point</dfn>, <dfn id="utf-8-bytes-seen">utf-8 bytes seen</dfn>,
<dfn id="utf-8-bytes-needed">utf-8 bytes needed</dfn>, and <dfn id="utf-8-lower-boundary">utf-8 lower boundary</dfn> concepts
are all initially 0.
<p>The <dfn id="utf-8-code-point">utf-8 code point</dfn>, <dfn id="utf-8-bytes-seen">utf-8 bytes seen</dfn>, and
<dfn id="utf-8-bytes-needed">utf-8 bytes needed</dfn> concepts are all initially 0. The
<dfn id="utf-8-lower-boundary">utf-8 lower boundary</dfn> is initially 0x80 and the
<dfn id="utf-8-upper-boundary">utf-8 upper boundary</dfn> is initially 0xBF.

<p>The <dfn id="utf-8-decoder">utf-8 decoder</dfn> (<a href="#decoder">decoder</a> for <a href="#utf-8">utf-8</a>) is:

Expand All @@ -1079,19 +1075,34 @@ <h3 id="utf-8"><span class="secno">8.1 </span><dfn>utf-8</dfn></h3>
<dd><p>Emit a code point whose value is <var title="">byte</var>.

<dt>0xC2 to 0xDF
<dd><p>Set <a href="#utf-8-bytes-needed">utf-8 bytes needed</a> to 1,
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0x80, and
<dd><p>Set <a href="#utf-8-bytes-needed">utf-8 bytes needed</a> to 1 and
<a href="#utf-8-code-point">utf-8 code point</a> to <var title="">byte</var> − 0xC0.

<dt>0xE0 to 0xEF
<dd><p>Set <a href="#utf-8-bytes-needed">utf-8 bytes needed</a> to 2,
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0x800, and
<a href="#utf-8-code-point">utf-8 code point</a> to <var title="">byte</var> − 0xE0.
<dd>
<ol>
<li><p>If <var title="">byte</var> is 0xE0, set
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0xA0.

<li><p>If <var title="">byte</var> is 0xED, set
<a href="#utf-8-upper-boundary">utf-8 upper boundary</a> to 0x9F.

<li><p>Set <a href="#utf-8-bytes-needed">utf-8 bytes needed</a> to 2 and
<a href="#utf-8-code-point">utf-8 code point</a> to <var title="">byte</var> − 0xE0.
</ol>

<dt>0xF0 to 0xF4
<dd><p>Set <a href="#utf-8-bytes-needed">utf-8 bytes needed</a> to 3,
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0x10000, and
<a href="#utf-8-code-point">utf-8 code point</a> to <var title="">byte</var> − 0xF0.
<dd>
<ol>
<li><p>If <var title="">byte</var> is 0xF0, set
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0x90.

<li><p>If <var title="">byte</var> is 0xF4, set
<a href="#utf-8-upper-boundary">utf-8 upper boundary</a> to 0x8F.

<li><p>Set <a href="#utf-8-bytes-needed">utf-8 bytes needed</a> to 3 and
<a href="#utf-8-code-point">utf-8 code point</a> to <var title="">byte</var> − 0xF0.
</ol>

<dt>Otherwise
<dd><p>Emit a <a href="#decoder-error">decoder error</a>.
Expand All @@ -1102,19 +1113,24 @@ <h3 id="utf-8"><span class="secno">8.1 </span><dfn>utf-8</dfn></h3>
64<sup><a href="#utf-8-bytes-needed">utf-8 bytes needed</a></sup> and continue.

<li>
<p>If <var title="">byte</var> is not in the range 0x80 to 0xBF, run these
substeps:
<p>If <var title="">byte</var> is not in the range
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to <a href="#utf-8-upper-boundary">utf-8 upper boundary</a>,
run these substeps:

<ol>
<li><p>Set <a href="#utf-8-code-point">utf-8 code point</a>,
<a href="#utf-8-bytes-needed">utf-8 bytes needed</a>, <a href="#utf-8-bytes-seen">utf-8 bytes seen</a>, and
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0.
<a href="#utf-8-bytes-needed">utf-8 bytes needed</a>, and <a href="#utf-8-bytes-seen">utf-8 bytes seen</a> to 0,
set <a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0x80, and set
<a href="#utf-8-upper-boundary">utf-8 upper boundary</a> to 0xBF.

<li><p>Decrease the <a href="#byte-pointer">byte pointer</a> by one.

<li><p>Emit a <a href="#decoder-error">decoder error</a>.
</ol>

<li><p>Set <a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0x80 and
<a href="#utf-8-upper-boundary">utf-8 upper boundary</a> to 0xBF.

<li>
<p>Increase <a href="#utf-8-bytes-seen">utf-8 bytes seen</a> by one and set
<a href="#utf-8-code-point">utf-8 code point</a> to
Expand All @@ -1124,21 +1140,19 @@ <h3 id="utf-8"><span class="secno">8.1 </span><dfn>utf-8</dfn></h3>
<li><p>If <a href="#utf-8-bytes-seen">utf-8 bytes seen</a> is not equal to
<a href="#utf-8-bytes-needed">utf-8 bytes needed</a>, continue.

<li><p>Let <var title="">code point</var> be <a href="#utf-8-code-point">utf-8 code point</a> and
<var title="">lower boundary</var> be <a href="#utf-8-lower-boundary">utf-8 lower boundary</a>.
<li><p>Let <var title="">code point</var> be <a href="#utf-8-code-point">utf-8 code point</a>.

<li><p>Set <a href="#utf-8-code-point">utf-8 code point</a>,
<a href="#utf-8-bytes-needed">utf-8 bytes needed</a>, <a href="#utf-8-bytes-seen">utf-8 bytes seen</a>, and
<a href="#utf-8-lower-boundary">utf-8 lower boundary</a> to 0.
<a href="#utf-8-bytes-needed">utf-8 bytes needed</a>, and <a href="#utf-8-bytes-seen">utf-8 bytes seen</a> to 0.

<li><p>If <var title="">code point</var> is in the range
<var title="">lower boundary</var> to 0x10FFFF and is not
in the range 0xD800 to 0xDFFF, emit a code point whose value is
<var title="">code point</var>.

<li><p>Emit a <a href="#decoder-error">decoder error</a>.
<li><p>Emit a code point whose value is <var title="">code point</var>.
</ol>

<p class="note">The constraints in the <a href="#utf-8-decoder">utf-8 decoder</a> above match
“Best Practices for Using U+FFFD” from the Unicode standard. No other
behavior is permitted per the Encoding Standard (other algorithms that
achieve the same result are obviously fine, even encouraged).


<p>The <dfn id="utf-8-encoder">utf-8 encoder</dfn> (<a href="#encoder">encoder</a> for <a href="#utf-8">utf-8</a>) is:

Expand Down
74 changes: 44 additions & 30 deletions Overview.src.html
Original file line number Diff line number Diff line change
Expand Up @@ -1008,14 +1008,10 @@ <h2>The encoding</h2>

<h3><dfn>utf-8</dfn></h3>

<!--
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#utf-8
Unicode
-->

<p>The <dfn>utf-8 code point</dfn>, <dfn>utf-8 bytes seen</dfn>,
<dfn>utf-8 bytes needed</dfn>, and <dfn>utf-8 lower boundary</dfn> concepts
are all initially 0.
<p>The <dfn>utf-8 code point</dfn>, <dfn>utf-8 bytes seen</dfn>, and
<dfn>utf-8 bytes needed</dfn> concepts are all initially 0. The
<dfn>utf-8 lower boundary</dfn> is initially 0x80 and the
<dfn>utf-8 upper boundary</dfn> is initially 0xBF.

<p>The <dfn>utf-8 decoder</dfn> (<span>decoder</span> for <span>utf-8</span>) is:

Expand All @@ -1040,19 +1036,34 @@ <h3><dfn>utf-8</dfn></h3>
<dd><p>Emit a code point whose value is <var title>byte</var>.

<dt>0xC2 to 0xDF
<dd><p>Set <span>utf-8 bytes needed</span> to 1,
<span>utf-8 lower boundary</span> to 0x80, and
<dd><p>Set <span>utf-8 bytes needed</span> to 1 and
<span>utf-8 code point</span> to <var title>byte</var> &minus; 0xC0.

<dt>0xE0 to 0xEF
<dd><p>Set <span>utf-8 bytes needed</span> to 2,
<span>utf-8 lower boundary</span> to 0x800, and
<span>utf-8 code point</span> to <var title>byte</var> &minus; 0xE0.
<dd>
<ol>
<li><p>If <var title>byte</var> is 0xE0, set
<span>utf-8 lower boundary</span> to 0xA0.

<li><p>If <var title>byte</var> is 0xED, set
<span>utf-8 upper boundary</span> to 0x9F.

<li><p>Set <span>utf-8 bytes needed</span> to 2 and
<span>utf-8 code point</span> to <var title>byte</var> &minus; 0xE0.
</ol>

<dt>0xF0 to 0xF4
<dd><p>Set <span>utf-8 bytes needed</span> to 3,
<span>utf-8 lower boundary</span> to 0x10000, and
<span>utf-8 code point</span> to <var title>byte</var> &minus; 0xF0.
<dd>
<ol>
<li><p>If <var title>byte</var> is 0xF0, set
<span>utf-8 lower boundary</span> to 0x90.

<li><p>If <var title>byte</var> is 0xF4, set
<span>utf-8 upper boundary</span> to 0x8F.

<li><p>Set <span>utf-8 bytes needed</span> to 3 and
<span>utf-8 code point</span> to <var title>byte</var> &minus; 0xF0.
</ol>

<dt>Otherwise
<dd><p>Emit a <span>decoder error</span>.
Expand All @@ -1063,19 +1074,24 @@ <h3><dfn>utf-8</dfn></h3>
64<sup><span>utf-8 bytes needed</span></sup> and continue.

<li>
<p>If <var title>byte</var> is not in the range 0x80 to 0xBF, run these
substeps:
<p>If <var title>byte</var> is not in the range
<span>utf-8 lower boundary</span> to <span>utf-8 upper boundary</span>,
run these substeps:

<ol>
<li><p>Set <span>utf-8 code point</span>,
<span>utf-8 bytes needed</span>, <span>utf-8 bytes seen</span>, and
<span>utf-8 lower boundary</span> to 0.
<span>utf-8 bytes needed</span>, and <span>utf-8 bytes seen</span> to 0,
set <span>utf-8 lower boundary</span> to 0x80, and set
<span>utf-8 upper boundary</span> to 0xBF.

<li><p>Decrease the <span>byte pointer</span> by one.

<li><p>Emit a <span>decoder error</span>.
</ol>

<li><p>Set <span>utf-8 lower boundary</span> to 0x80 and
<span>utf-8 upper boundary</span> to 0xBF.

<li>
<p>Increase <span>utf-8 bytes seen</span> by one and set
<span>utf-8 code point</span> to
Expand All @@ -1085,21 +1101,19 @@ <h3><dfn>utf-8</dfn></h3>
<li><p>If <span>utf-8 bytes seen</span> is not equal to
<span>utf-8 bytes needed</span>, continue.

<li><p>Let <var title>code point</var> be <span>utf-8 code point</span> and
<var title>lower boundary</var> be <span>utf-8 lower boundary</span>.
<li><p>Let <var title>code point</var> be <span>utf-8 code point</span>.

<li><p>Set <span>utf-8 code point</span>,
<span>utf-8 bytes needed</span>, <span>utf-8 bytes seen</span>, and
<span>utf-8 lower boundary</span> to 0.
<span>utf-8 bytes needed</span>, and <span>utf-8 bytes seen</span> to 0.

<li><p>If <var title>code point</var> is in the range
<var title>lower boundary</var> to 0x10FFFF and is not
in the range 0xD800 to 0xDFFF, emit a code point whose value is
<var title>code point</var>.

<li><p>Emit a <span>decoder error</span>.
<li><p>Emit a code point whose value is <var title>code point</var>.
</ol>

<p class=note>The constraints in the <span>utf-8 decoder</span> above match
“Best Practices for Using U+FFFD” from the Unicode standard. No other
behavior is permitted per the Encoding Standard (other algorithms that
achieve the same result are obviously fine, even encouraged).


<p>The <dfn>utf-8 encoder</dfn> (<span>encoder</span> for <span>utf-8</span>) is:

Expand Down

0 comments on commit ec77e51

Please sign in to comment.