Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add JSON commands to AwkwardForth. #1159

Merged
merged 16 commits into from
Nov 20, 2021

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Nov 18, 2021

  • skipws: moves the input pointer past zero or move whitespace characters (as defined by JSON spec)
  • textint->: integer; sent to stack or output
  • textfloat->: floating-point number; sent to output only
  • quotedstr->: quoted string with escape sequences; unquoted, unescaped string sent to output only and the length of the unescaped string is put on the stack. A machine-wide parameter max_string_size defines a reusable buffer for the string (not sent directly to output to avoid regrow-checks with every character). Raises error if the unescaped string exceeds max_string_size.
  • enum string1" string2": literal string1, string2, etc. (arbitrarily many, at least one) is checked against the input, and a 0, 1, etc. is put on the stack for the string that matches; if none match, puts -1 on the stack. In JSON, all of the string1, string2, etc. would be quoted in the Forth code (except null, true, false) so that they'll match
  • peek: reports the next character on the stack (as an ASCII integer) but does not move the input past it. Raises error if the input is at its end.
  • Standard Forth CASE .. OF .. ENDOF .. ENDCASE: https://forth-standard.org/standard/core/CASE and https://lists.gnu.org/archive/html/gforth/2010-03/msg00024.html The general case can be turned into equivalent IF .. THEN .. ELSE (there's always an ELSE because it consumes an item from the stack in every case, even an undefined default case)
  • But a regular case, in which every OF is preceded by literal integers from zero until the number of OF cases, would get a special CODE_CASE_REGULAR instruction that dispatches to pseudofunctions by integer value.

General CASE:

CASE
test1 OF ... ENDOF           test1 OVER = IF DROP ... ELSE
test2 OF ... ENDOF           test2 OVER = IF DROP ... ELSE
testn OF ... ENDOF           testn OVER = IF DROP ... ELSE
... ( default case )         ...
ENDCASE                      DROP THEN [THEN [THEN ...]]

Specialized CASE removes the tests (as sequential integers starting with zero, they contain no information) and jumps to the "..." corresponding to the top value of the stack, with the same DROP logic for numbered consequents and the catch-all default case.

I'd just like to say that the placement of DROP before numbered consequents and after the catch-all default makes no sense, but it's described here and confirmed with gforth.

@codecov
Copy link

codecov bot commented Nov 18, 2021

Codecov Report

Merging #1159 (c821111) into main (f795658) will increase coverage by 0.98%.
The diff coverage is 76.33%.

Impacted Files Coverage Δ
src/awkward/_v2/forms/bitmaskedform.py 75.64% <ø> (ø)
src/awkward/_v2/forms/bytemaskedform.py 76.05% <ø> (ø)
src/awkward/_v2/forms/emptyform.py 78.84% <ø> (+1.92%) ⬆️
src/awkward/_v2/forms/indexedoptionform.py 76.81% <ø> (ø)
src/awkward/_v2/forms/listform.py 75.94% <ø> (ø)
src/awkward/_v2/forms/listoffsetform.py 80.55% <ø> (ø)
src/awkward/_v2/forms/recordform.py 66.46% <ø> (ø)
src/awkward/_v2/forms/regularform.py 75.34% <ø> (ø)
src/awkward/_v2/forms/unionform.py 76.19% <ø> (ø)
src/awkward/_v2/forms/unmaskedform.py 74.57% <ø> (ø)
... and 67 more

@jpivarski jpivarski enabled auto-merge (squash) November 20, 2021 16:53
@jpivarski jpivarski merged commit ea9aa7a into main Nov 20, 2021
@jpivarski jpivarski deleted the jpivarski/add-json-commands-to-awkwardforth branch November 20, 2021 17:35
@jpivarski
Copy link
Member Author

After some performance tests, too many to summarize, I've found that these new commands are usually a little faster (15% in most cases, a factor of 2 in some extremes) than their equivalents in np.fromstring, RapidJSON + ArrayBuilder (simple structure), and Python's json module.

They all have scaling dependencies that make sense: for instance Python's json module is a little faster than AwkwardForth (25%) for a non-nested list of booleans because it doesn't need to allocate Python objects (True and False are built-ins). np.fromstring is exactly as fast as textint->, but several times slower for floating-point numbers (textfloat-> is pay-as-you-go: decimals and exponents cost more, but not as much as in NumPy). RapidJSON also has pay-as-you-go floating point handling, and it might only be slower than the AwkwardForth commands because the test had it coupled in with ArrayBuilder (through ak._ext.fromjson).

As expected, the cost of enum scales with the number of strings to check—specifically, the number of strings before the ones that match. That's because the current algorithm just cycles through all the strings, doing a strncmp for each. There were some hints that it depended on the lengths of those failing strings—it shouldn't, because strncmp ought to give up on the first non-matching character—but those hints didn't hold up to more extreme cases. strncmp behaves as expected. If this is ever used for JSON objects with many fields (I expect it to), then the cycle-through-strings algorithm needs to be replaced with a trie. It might even make a difference to ensure that the string contents are allocated near each other in memory: the hint that went away might be pointer-chasing. But if we go to that much trouble, we might as well implement a trie (compact in memory after the set of strings is fully known).

The regular case statement (jump-table) costs 11 to 18 ns per invocation (2 cases vs 12 cases). I'd have to think about it, but it might actually be 2 instructions, which fits with the "5 ns per instruction" rule of thumb. Irregular case statements are as slow as would be expected: 12 cases is 24.6 times slower. (That probably corresponds to an instruction count, too.)

On the whole, the value of this is not that it's a faster JSON parser—it is a little bit, but not enough to get excited about—but that we can bypass ArrayBuilder's type agnosticism. For instance, an array of booleans one level deep (ArrayBuilder only needs to step down one level of a tree for each entry), AwkwardForth is only about 2× faster than RapidJSON + ArrayBuilder:

>>> import time
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.forth import ForthMachine32
>>> data = b"[" + (br'true,false,' * 100000000)[:-1] + b"]"   # shallow!
>>> data2 = {"x": np.array([data])}
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.373785972595215
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.3505895137786865
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.420138835906982
>>> vm = ForthMachine32(r'input x output y uint8 100000000 0 do 1 x skip x enum s" false" s" true" y <- stack loop')
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
3.983327627182007
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.056304454803467
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.007813453674316

But just put these same booleans inside 10 levels of list nesting, and now ArrayBuilder has to walk down 10 levels of a tree with each boolean, checking the types of the nodes all the way down. Now AwkwardForth is about 7× faster.

>>> import time
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.forth import ForthMachine32
>>> data = b"[[[[[[[[[[" + (br'true,false,' * 100000000)[:-1] + b"]]]]]]]]]]"   # deep!
>>> data2 = {"x": np.array([data])}
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
30.746235132217407
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
31.118453979492188
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
31.27355456352234
>>> vm = ForthMachine32(r'input x output y uint8 9 x skip 100000000 0 do 1 x skip x enum s" false" s" true" y <- stack loop')
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.135768890380859
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.166326284408569
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.207798480987549

List-descent is a relatively inexpensive part of ArrayBuilder—finding the right field in a record that can be out of order (JSON object keys can easily be out of order!) is a lot more costly, but harder to set up the test. A JSON-to-Awkward reader leveraging JSON schemas will benefit for reasons of avoiding ArrayBuilder more than just being a slightly faster JSON parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant