-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ refactoring: to and from json #1164
Conversation
Codecov Report
|
@jpivarski - I've moved the snapshot() generation out of the json functions. The changes are minimal, however this is causing a problem. Please, have a look at a failing It looks like that the contents order is not preserved thus a generated new vs old: < [('get', 'kitty-node13-tags'), ('get', 'kitty-node13-index'), ('get', 'kitty-node16-offsets'), ('get', 'kitty-node15-data'), ('get', 'kitty-node2-data'), ('get', 'kitty-node3-data')]
---
> [('get', 'kitty-node11-tags'), ('get', 'kitty-node11-index'), ('get', 'kitty-node14-offsets'), ('get', 'kitty-node13-data'), ('get', 'kitty-node6-data'), ('get', 'kitty-node7-data')] |
const ContentPtr | ||
const std::pair<int, const BuilderPtr> | ||
do_parse(HANDLER& handler, rj::Reader& reader, STREAM& stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Builder knows its own length()
, so I think this extra integer is not necessary, which would simplify these signatures quite a lot. Also, we don't use bare int
in Awkward C++; we always want to know the integer size (and integers outside of arrays are always int64_t
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the integer is needed because it is an indicator of multiple JSON streams for the following cases: tests/test_0437-stream-of-many-json-files.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I saw of the code, I thought the integer was the length of the Builder. If so, Builder::length
can produce that on demand.
If it's not the length, what is it? The JSON streams, or "JSONS" or "ndjson" would produce a Builder with a length
just as the "one big JSON array" case would. Is the integer 0 for non-stream, 1 for stream, or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it counts the number of JSON streams. There is an extra wrapping of the result in a list since it's not known in advance if it's going to be one or many JSON fragments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the "json" versus "ndjson" interfaces are well separated, the number of JSON streams is something that only applies to the "ndjson" case, and then it is the ArrayBuilder::length
. So in the final form of this PR (separating those interfaces is a goal of this PR, right?), the extra integer won't be necessary.
All we need to know at the end is the length of the array that is produced (ak.from_buffers
needs it), not the lengths of any substructure within that.
The JSON representation of RecordForms is an object (curly brackets); I wish I had chosen two JSON arrays, but it has to remain backward-compatible. When that passes through Python dicts in Python < 3.6, the order may be lost. However, you're seeing errors in all the Python versions and the stage at which In I can't see what's going wrong here, but there must be an exact point in the process when it goes wrong. Are the records already out of order in the |
Yes, mode debugging is needed... I'm on it. |
…rdForm-contents' into ianna/to-from-json-refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make the separation of "json" and "ndjson" (either through a boolean argument or through two methods) be a separate PR, after this one. When that's done, the std::pair<int, const BuilderPtr>
return type can become just BuilderPtr
, as discussed in a previous comment (above, not part of this review).
The only thing I'd ask for in this PR is to avoid duplicating the NumpyBuffersContainer class. It can go in a header file (with implementation). It's more than a code-cleanliness/DRY thing (although that is important): when working on the specialized JSON parser last week, I found that Windows randomly segfaulted when two classes in different parts of the codebase had the same name. It was a hard-to-diagnose crash because it was failing in a test for the other class with this name—the mere existence of the new class crashed it (because it was existing with a bad name!).
No description provided.