Add JSON commands to AwkwardForth. #1159

jpivarski · 2021-11-18T23:33:09Z

skipws: moves the input pointer past zero or move whitespace characters (as defined by JSON spec)
textint->: integer; sent to stack or output
textfloat->: floating-point number; sent to output only
quotedstr->: quoted string with escape sequences; unquoted, unescaped string sent to output only and the length of the unescaped string is put on the stack. A machine-wide parameter max_string_size defines a reusable buffer for the string (not sent directly to output to avoid regrow-checks with every character). Raises error if the unescaped string exceeds max_string_size.
enum string1" string2": literal string1, string2, etc. (arbitrarily many, at least one) is checked against the input, and a 0, 1, etc. is put on the stack for the string that matches; if none match, puts -1 on the stack. In JSON, all of the string1, string2, etc. would be quoted in the Forth code (except null, true, false) so that they'll match
peek: reports the next character on the stack (as an ASCII integer) but does not move the input past it. Raises error if the input is at its end.
Standard Forth CASE .. OF .. ENDOF .. ENDCASE: https://forth-standard.org/standard/core/CASE and https://lists.gnu.org/archive/html/gforth/2010-03/msg00024.html The general case can be turned into equivalent IF .. THEN .. ELSE (there's always an ELSE because it consumes an item from the stack in every case, even an undefined default case)
But a regular case, in which every OF is preceded by literal integers from zero until the number of OF cases, would get a special CODE_CASE_REGULAR instruction that dispatches to pseudofunctions by integer value.

General CASE:

CASE
test1 OF ... ENDOF           test1 OVER = IF DROP ... ELSE
test2 OF ... ENDOF           test2 OVER = IF DROP ... ELSE
testn OF ... ENDOF           testn OVER = IF DROP ... ELSE
... ( default case )         ...
ENDCASE                      DROP THEN [THEN [THEN ...]]

Specialized CASE removes the tests (as sequential integers starting with zero, they contain no information) and jumps to the "..." corresponding to the top value of the stack, with the same DROP logic for numbered consequents and the catch-all default case.

I'd just like to say that the placement of DROP before numbered consequents and after the catch-all default makes no sense, but it's described here and confirmed with gforth.

codecov · 2021-11-18T23:46:18Z

Codecov Report

Merging #1159 (c821111) into main (f795658) will increase coverage by 0.98%.
The diff coverage is 76.33%.

Impacted Files	Coverage Δ
src/awkward/_v2/forms/bitmaskedform.py	`75.64% <ø> (ø)`
src/awkward/_v2/forms/bytemaskedform.py	`76.05% <ø> (ø)`
src/awkward/_v2/forms/emptyform.py	`78.84% <ø> (+1.92%)`	⬆️
src/awkward/_v2/forms/indexedoptionform.py	`76.81% <ø> (ø)`
src/awkward/_v2/forms/listform.py	`75.94% <ø> (ø)`
src/awkward/_v2/forms/listoffsetform.py	`80.55% <ø> (ø)`
src/awkward/_v2/forms/recordform.py	`66.46% <ø> (ø)`
src/awkward/_v2/forms/regularform.py	`75.34% <ø> (ø)`
src/awkward/_v2/forms/unionform.py	`76.19% <ø> (ø)`
src/awkward/_v2/forms/unmaskedform.py	`74.57% <ø> (ø)`
... and 67 more

… yet.

…alized 'case' for a jump-table is next.

jpivarski · 2021-11-20T22:28:48Z

After some performance tests, too many to summarize, I've found that these new commands are usually a little faster (15% in most cases, a factor of 2 in some extremes) than their equivalents in np.fromstring, RapidJSON + ArrayBuilder (simple structure), and Python's json module.

They all have scaling dependencies that make sense: for instance Python's json module is a little faster than AwkwardForth (25%) for a non-nested list of booleans because it doesn't need to allocate Python objects (True and False are built-ins). np.fromstring is exactly as fast as textint->, but several times slower for floating-point numbers (textfloat-> is pay-as-you-go: decimals and exponents cost more, but not as much as in NumPy). RapidJSON also has pay-as-you-go floating point handling, and it might only be slower than the AwkwardForth commands because the test had it coupled in with ArrayBuilder (through ak._ext.fromjson).

As expected, the cost of enum scales with the number of strings to check—specifically, the number of strings before the ones that match. That's because the current algorithm just cycles through all the strings, doing a strncmp for each. There were some hints that it depended on the lengths of those failing strings—it shouldn't, because strncmp ought to give up on the first non-matching character—but those hints didn't hold up to more extreme cases. strncmp behaves as expected. If this is ever used for JSON objects with many fields (I expect it to), then the cycle-through-strings algorithm needs to be replaced with a trie. It might even make a difference to ensure that the string contents are allocated near each other in memory: the hint that went away might be pointer-chasing. But if we go to that much trouble, we might as well implement a trie (compact in memory after the set of strings is fully known).

The regular case statement (jump-table) costs 11 to 18 ns per invocation (2 cases vs 12 cases). I'd have to think about it, but it might actually be 2 instructions, which fits with the "5 ns per instruction" rule of thumb. Irregular case statements are as slow as would be expected: 12 cases is 24.6 times slower. (That probably corresponds to an instruction count, too.)

On the whole, the value of this is not that it's a faster JSON parser—it is a little bit, but not enough to get excited about—but that we can bypass ArrayBuilder's type agnosticism. For instance, an array of booleans one level deep (ArrayBuilder only needs to step down one level of a tree for each entry), AwkwardForth is only about 2× faster than RapidJSON + ArrayBuilder:

>>> import time
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.forth import ForthMachine32
>>> data = b"[" + (br'true,false,' * 100000000)[:-1] + b"]"   # shallow!
>>> data2 = {"x": np.array([data])}
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.373785972595215
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.3505895137786865
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.420138835906982
>>> vm = ForthMachine32(r'input x output y uint8 100000000 0 do 1 x skip x enum s" false" s" true" y <- stack loop')
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
3.983327627182007
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.056304454803467
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.007813453674316

But just put these same booleans inside 10 levels of list nesting, and now ArrayBuilder has to walk down 10 levels of a tree with each boolean, checking the types of the nodes all the way down. Now AwkwardForth is about 7× faster.

>>> import time
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.forth import ForthMachine32
>>> data = b"[[[[[[[[[[" + (br'true,false,' * 100000000)[:-1] + b"]]]]]]]]]]"   # deep!
>>> data2 = {"x": np.array([data])}
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
30.746235132217407
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
31.118453979492188
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
31.27355456352234
>>> vm = ForthMachine32(r'input x output y uint8 9 x skip 100000000 0 do 1 x skip x enum s" false" s" true" y <- stack loop')
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.135768890380859
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.166326284408569
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.207798480987549

List-descent is a relatively inexpensive part of ArrayBuilder—finding the right field in a record that can be out of order (JSON object keys can easily be out of order!) is a lot more costly, but harder to set up the test. A JSON-to-Awkward reader leveraging JSON schemas will benefit for reasons of avoiding ArrayBuilder more than just being a slightly faster JSON parser.

Added 'textint->' word to AwkwardForth.

4e62f14

jpivarski added 15 commits November 18, 2021 17:52

Added 'skip-ws', haven't removed whitespace-skipping from 'textint->'…

745cb8c

… yet.

MacOS segfaults?.

e4ba505

Added the 'textfloat->' word.

f1c307d

Moved most parsing code into ForthInputBuffer.

590287d

Implemented 'quotedstr->'.

dde3357

Got Unicode right in 'quotedstr->'.

11a8d72

Inserted CODE_ definitions for enum, peek, case.

954b675

Implemented 'peek'.

b22ae31

Half-implemented 'enum'. Parser is there, but not runtime.

153c426

The 'enum' built-in is done.

f6dd7cc

Almost, almost done with parsing general 'case'.

72493d0

Finished general 'case', which rearranges into 'if' statements. Speci…

22b3977

…alized 'case' for a jump-table is next.

Detect specialized case.

ecf5fe8

Decompilation for specialized case.

afaa707

Regular 'case' is done; the PR is done.

c821111

jpivarski enabled auto-merge (squash) November 20, 2021 16:53

jpivarski merged commit ea9aa7a into main Nov 20, 2021

jpivarski deleted the jpivarski/add-json-commands-to-awkwardforth branch November 20, 2021 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON commands to AwkwardForth. #1159

Add JSON commands to AwkwardForth. #1159

jpivarski commented Nov 18, 2021 •

edited

Loading

codecov bot commented Nov 18, 2021 •

edited

Loading

jpivarski commented Nov 20, 2021

Add JSON commands to AwkwardForth. #1159

Add JSON commands to AwkwardForth. #1159

Conversation

jpivarski commented Nov 18, 2021 • edited Loading

codecov bot commented Nov 18, 2021 • edited Loading

Codecov Report

jpivarski commented Nov 20, 2021

jpivarski commented Nov 18, 2021 •

edited

Loading

codecov bot commented Nov 18, 2021 •

edited

Loading