-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JSON commands to AwkwardForth. #1159
Conversation
Codecov Report
|
…alized 'case' for a jump-table is next.
After some performance tests, too many to summarize, I've found that these new commands are usually a little faster (15% in most cases, a factor of 2 in some extremes) than their equivalents in They all have scaling dependencies that make sense: for instance Python's As expected, the cost of The regular On the whole, the value of this is not that it's a faster JSON parser—it is a little bit, but not enough to get excited about—but that we can bypass ArrayBuilder's type agnosticism. For instance, an array of booleans one level deep (ArrayBuilder only needs to step down one level of a tree for each entry), AwkwardForth is only about 2× faster than RapidJSON + ArrayBuilder: >>> import time
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.forth import ForthMachine32
>>> data = b"[" + (br'true,false,' * 100000000)[:-1] + b"]" # shallow!
>>> data2 = {"x": np.array([data])}
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.373785972595215
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.3505895137786865
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
7.420138835906982
>>> vm = ForthMachine32(r'input x output y uint8 100000000 0 do 1 x skip x enum s" false" s" true" y <- stack loop')
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
3.983327627182007
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.056304454803467
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.007813453674316 But just put these same booleans inside 10 levels of list nesting, and now ArrayBuilder has to walk down 10 levels of a tree with each boolean, checking the types of the nodes all the way down. Now AwkwardForth is about 7× faster. >>> import time
>>> import numpy as np
>>> import awkward as ak
>>> from awkward.forth import ForthMachine32
>>> data = b"[[[[[[[[[[" + (br'true,false,' * 100000000)[:-1] + b"]]]]]]]]]]" # deep!
>>> data2 = {"x": np.array([data])}
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
30.746235132217407
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
31.118453979492188
>>> starttime = time.time(); tmp = ak._ext.fromjson(data); time.time() - starttime
31.27355456352234
>>> vm = ForthMachine32(r'input x output y uint8 9 x skip 100000000 0 do 1 x skip x enum s" false" s" true" y <- stack loop')
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.135768890380859
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.166326284408569
>>> starttime = time.time(); vm.run(data2); time.time() - starttime
4.207798480987549 List-descent is a relatively inexpensive part of ArrayBuilder—finding the right field in a record that can be out of order (JSON object keys can easily be out of order!) is a lot more costly, but harder to set up the test. A JSON-to-Awkward reader leveraging JSON schemas will benefit for reasons of avoiding ArrayBuilder more than just being a slightly faster JSON parser. |
skipws
: moves the input pointer past zero or move whitespace characters (as defined by JSON spec)textint->
: integer; sent to stack or outputtextfloat->
: floating-point number; sent to output onlyquotedstr->
: quoted string with escape sequences; unquoted, unescaped string sent to output only and the length of the unescaped string is put on the stack. A machine-wide parametermax_string_size
defines a reusable buffer for the string (not sent directly to output to avoid regrow-checks with every character). Raises error if the unescaped string exceedsmax_string_size
.enum string1" string2"
: literalstring1
,string2
, etc. (arbitrarily many, at least one) is checked against the input, and a 0, 1, etc. is put on the stack for the string that matches; if none match, puts -1 on the stack. In JSON, all of thestring1
,string2
, etc. would be quoted in the Forth code (exceptnull
,true
,false
) so that they'll matchpeek
: reports the next character on the stack (as an ASCII integer) but does not move the input past it. Raises error if the input is at its end.CASE .. OF .. ENDOF .. ENDCASE
: https://forth-standard.org/standard/core/CASE and https://lists.gnu.org/archive/html/gforth/2010-03/msg00024.html The general case can be turned into equivalentIF .. THEN .. ELSE
(there's always anELSE
because it consumes an item from the stack in every case, even an undefined default case)OF
is preceded by literal integers from zero until the number ofOF
cases, would get a specialCODE_CASE_REGULAR
instruction that dispatches to pseudofunctions by integer value.General
CASE
:Specialized
CASE
removes the tests (as sequential integers starting with zero, they contain no information) and jumps to the "...
" corresponding to the top value of the stack, with the sameDROP
logic for numbered consequents and the catch-all default case.I'd just like to say that the placement of
DROP
before numbered consequents and after the catch-all default makes no sense, but it's described here and confirmed with gforth.