Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ refactoring: to and from json #1164

Merged
merged 17 commits into from
Nov 30, 2021
Merged

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Nov 23, 2021

No description provided.

@ianna ianna marked this pull request as draft November 23, 2021 14:31
@codecov
Copy link

codecov bot commented Nov 23, 2021

Codecov Report

Merging #1164 (279770c) into main (e4b6640) will decrease coverage by 0.03%.
The diff coverage is 87.85%.

Impacted Files Coverage Δ
...ward/_v2/operations/convert/ak_from_json_schema.py 85.54% <87.85%> (-0.51%) ⬇️

@ianna
Copy link
Collaborator Author

ianna commented Nov 23, 2021

@jpivarski - I've moved the snapshot() generation out of the json functions. The changes are minimal, however this is causing a problem. Please, have a look at a failing tests/test_0384-lazy-arrayset.py.

It looks like that the contents order is not preserved thus a generated form_key is different.
Screenshot 2021-11-23 at 18 11 36

new vs old:

< [('get', 'kitty-node13-tags'), ('get', 'kitty-node13-index'), ('get', 'kitty-node16-offsets'), ('get', 'kitty-node15-data'), ('get', 'kitty-node2-data'), ('get', 'kitty-node3-data')]
---
> [('get', 'kitty-node11-tags'), ('get', 'kitty-node11-index'), ('get', 'kitty-node14-offsets'), ('get', 'kitty-node13-data'), ('get', 'kitty-node6-data'), ('get', 'kitty-node7-data')]

Comment on lines -775 to 776
const ContentPtr
const std::pair<int, const BuilderPtr>
do_parse(HANDLER& handler, rj::Reader& reader, STREAM& stream) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Builder knows its own length(), so I think this extra integer is not necessary, which would simplify these signatures quite a lot. Also, we don't use bare int in Awkward C++; we always want to know the integer size (and integers outside of arrays are always int64_t).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the integer is needed because it is an indicator of multiple JSON streams for the following cases: tests/test_0437-stream-of-many-json-files.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I saw of the code, I thought the integer was the length of the Builder. If so, Builder::length can produce that on demand.

If it's not the length, what is it? The JSON streams, or "JSONS" or "ndjson" would produce a Builder with a length just as the "one big JSON array" case would. Is the integer 0 for non-stream, 1 for stream, or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it counts the number of JSON streams. There is an extra wrapping of the result in a list since it's not known in advance if it's going to be one or many JSON fragments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the "json" versus "ndjson" interfaces are well separated, the number of JSON streams is something that only applies to the "ndjson" case, and then it is the ArrayBuilder::length. So in the final form of this PR (separating those interfaces is a goal of this PR, right?), the extra integer won't be necessary.

All we need to know at the end is the length of the array that is produced (ak.from_buffers needs it), not the lengths of any substructure within that.

@jpivarski
Copy link
Member

The JSON representation of RecordForms is an object (curly brackets); I wish I had chosen two JSON arrays, but it has to remain backward-compatible. When that passes through Python dicts in Python < 3.6, the order may be lost. However, you're seeing errors in all the Python versions and the stage at which form_keys are associated with these Forms is much earlier. This one way in which the order might not be preserved is not what's going on in the latest round of failing tests.

In test_0384-lazy-arrayset.py:test_lazy_buffers, the JSON has "listcollection" before "collection", so the ArrayBuilder must parse it first and it must be the first key in the RecordBuilder. RecordBuilder, RecordForm (when not serialized), and RecordArray all have a well-defined order for their keys and contents. Of your two screenshots, the one on the left is wrong.

I can't see what's going wrong here, but there must be an exact point in the process when it goes wrong. Are the records already out of order in the array defined on lines 25-65 (e.g. "collection" is before "listcollection" in the array.layout.recordlookup)? If so, at what point do they get out of order? They ought to all be in order in the RecordBuilder; the only code that has changed is code between the RecordBuilder and the RecordArray. What about the RecordForm that comes between those two? (As opposed to the various Forms that the later ak.from_buffers call makes on line 76.)

@ianna
Copy link
Collaborator Author

ianna commented Nov 23, 2021

Yes, mode debugging is needed... I'm on it.

src/python/io.cpp Outdated Show resolved Hide resolved
@ianna ianna marked this pull request as ready for review November 29, 2021 17:24
@ianna ianna requested a review from jpivarski November 29, 2021 17:24
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make the separation of "json" and "ndjson" (either through a boolean argument or through two methods) be a separate PR, after this one. When that's done, the std::pair<int, const BuilderPtr> return type can become just BuilderPtr, as discussed in a previous comment (above, not part of this review).

The only thing I'd ask for in this PR is to avoid duplicating the NumpyBuffersContainer class. It can go in a header file (with implementation). It's more than a code-cleanliness/DRY thing (although that is important): when working on the specialized JSON parser last week, I found that Windows randomly segfaulted when two classes in different parts of the codebase had the same name. It was a hard-to-diagnose crash because it was failing in a test for the other class with this name—the mere existence of the new class crashed it (because it was existing with a bad name!).

src/python/io.cpp Outdated Show resolved Hide resolved
@jpivarski jpivarski merged commit a95a558 into main Nov 30, 2021
@jpivarski jpivarski deleted the ianna/to-from-json-refactoring branch November 30, 2021 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants