Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: all AwkwardForth Forms now agree with awkward_form method output. #790

Conversation

jpivarski
Copy link
Member

Quoting #754,

If they don't, make them the same.

They are now; the differences were only minor and some of the errors were in the non-Forth implementation.

The "uproot" parameters should be removed.

They were already gone before I started.

Some of the "__record__" parameters have the C++ name; this is a good idea. We should make sure that all of them do.

They all do now.

A user-visible difference is that we're excluding all of the TObject header and numbytes_version header stuff (used to be marked with @, now gone). It's still possible to get them back, but you have to run a non-Forth read and pass non-default parameters to the awkward_forth method. I don't think we'll ever need it.

Comment on lines 360 to +373
def _awkward_json_to_array(awkward, form, array):
if form["class"] == "NumpyArray":
form = awkward.forms.from_json(json.dumps(form))
dtype = awkward.types.numpytype.primitive_to_dtype(form.primitive)
if isinstance(array, awkward.contents.EmptyArray):
form = awkward.forms.from_json(json.dumps(form))
dtype = awkward.types.numpytype.primitive_to_dtype(form.primitive)
return awkward.contents.NumpyArray(numpy.empty(0, dtype=dtype))
return awkward.contents.NumpyArray(
numpy.empty(0, dtype=dtype),
parameters=form.parameters,
)
else:
return array
return awkward.contents.NumpyArray(
numpy.asarray(array.data, dtype=dtype),
parameters=form.parameters,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This corrects an error in the old, non-Forth method. The old method ran Python objects through ak.from_iter (which makes all Python integers int64 and floats float64), then fixes them to agree with the dtypes in the actual Form. Many node types were fixed, but not non-empty leaf nodes. Now those are fixed as well.

Comment on lines 75 to +76
header=False,
tobject_header=True,
tobject_header=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header data that users generally don't want to see used to be obscured as field names starting with @, but the Forth code doesn't even read them in, so the Form (by default) should not include them, either.

Comment on lines -159 to +162
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{}}, "form_key": "node{form_keys[0]}"}}'
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{"__array__": "string"}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{"__record__": "TRefArray"}}, "form_key": "node{form_keys[0]}"}}'
forth_obj.add_form(json.loads(temp_aform))
forth_stash.add_to_header(
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int64\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int64\n"
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int32\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int32\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a correction in Forth: the TRefArray data are 32-bit, but the Forth converted the 32-bit numbers into 64-bit numbers. Now it doesn't.

Also, a C++ __record__ name was missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also adds a string __array__ parameter, which looks right (I don't know which key is the content and parent, but we do need a string and char array, which are now there.

f' content[{self.name!r}] = {{"class": "ListOffsetArray", "offsets": "i64", "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}',
f' content[{self.name!r}] = {{"class": "RegularArray", "size": {self.array_length}, "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular-sized arrays were being read in as ListOffsetArrays. The Forth code must still be generating fake offsets; I didn't remove that. However, the output is now correct: we get RegularArrays.

@@ -18,6 +18,7 @@ def test_00(is_forth):
py = branch.array(interp, library="ak", entry_stop=2)
assert py[0]["0"][0] == "expskin_FluxUnisim"
# py[-1] == <STLMap {'expskin_FluxUnisim': [0.944759093019904, 1.0890682745548674, ..., 1.1035170311451232, 0.8873957186284592], ...} at 0x7fbc4c1325e0>
assert py.layout.form == interp.awkward_form(branch.file)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every test that can read data with library="ak" checks the array's final Form with awkward_form's predicted Form, and they're all exactly equal. (These are Forms without form_keys.)

@jpivarski
Copy link
Member Author

@aryan26roy, if you want to take a look at this PR, please do, though I know you're busy.

@agoose77, I've annotated the changes to explain them.

When this is merged, Uproot will have all of the next-release updates in place, except for #788, which is just a name change adjustment to follow updates in Awkward. When this is in, I'll focus on Awkward.

Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look motivated enough to me. I don't yet have the deep knowledge to make a more reasoned contribution but equally, these are relatively small tweaks in the scheme of things :)

Comment on lines -159 to +162
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{}}, "form_key": "node{form_keys[0]}"}}'
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{"__array__": "string"}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{"__record__": "TRefArray"}}, "form_key": "node{form_keys[0]}"}}'
forth_obj.add_form(json.loads(temp_aform))
forth_stash.add_to_header(
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int64\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int64\n"
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int32\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int32\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also adds a string __array__ parameter, which looks right (I don't know which key is the content and parent, but we do need a string and char array, which are now there.

@jpivarski
Copy link
Member Author

You're right that a "char" NumpyArray was missing the "string" parameter on its containing ListOffsetArray. That was another one of the corrections.

@@ -1064,7 +1064,7 @@ def class_code(
f' forth_stash.add_to_header(f"output node{{key}}-data {{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}\\n")',
' forth_stash.add_to_header(f"output node{key2}-offsets int64\\n")',
' forth_stash.add_to_init(f"0 node{key2}-offsets <- stack\\n")',
f' content[{self.name!r}] = {{"class": "ListOffsetArray", "offsets": "i64", "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}',
f' content[{self.name!r}] = {{"class": "RegularArray", "size": {self.array_length}, "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}',
f' forth_stash.add_to_pre(f"{self.array_length} dup node{{key2}}-offsets +<- stack \\n stream #!{{uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}]}}-> node{{key}}-data\\n")\n',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpivarski what I did here was hard-code the array length into the forth code and use it to read the file by putting and pulling it off the stack. I think this can be easily replaced with a similar hardcoded version without generating the offset array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "hard coded," do you mean that you took the value of self.array_length and included that as a literal in the Forth code? (That would be fine, of course.) You don't mean that you took the length of this particular example, such that it wouldn't work with a file with a different array length, right? (Just checking, because of the way it was worded.)

We can remove the offsets-generation at any time, because that counts as an optimization—the output won't change when it gets fixed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one. It changes depending on the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, great! (I thought so, just checking.)

And that reminds me that I should merge this now. Thanks for looking out over!

@jpivarski jpivarski merged commit 150cf6c into main Nov 30, 2022
@jpivarski jpivarski deleted the jpivarski/awkwardforth-forms-must-agree-with-awkward_form-method branch November 30, 2022 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that the Forms produced by the AwkwardForth generation agree with the awkward_form methods.
3 participants