-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: all AwkwardForth Forms now agree with awkward_form method output. #790
fix: all AwkwardForth Forms now agree with awkward_form method output. #790
Conversation
def _awkward_json_to_array(awkward, form, array): | ||
if form["class"] == "NumpyArray": | ||
form = awkward.forms.from_json(json.dumps(form)) | ||
dtype = awkward.types.numpytype.primitive_to_dtype(form.primitive) | ||
if isinstance(array, awkward.contents.EmptyArray): | ||
form = awkward.forms.from_json(json.dumps(form)) | ||
dtype = awkward.types.numpytype.primitive_to_dtype(form.primitive) | ||
return awkward.contents.NumpyArray(numpy.empty(0, dtype=dtype)) | ||
return awkward.contents.NumpyArray( | ||
numpy.empty(0, dtype=dtype), | ||
parameters=form.parameters, | ||
) | ||
else: | ||
return array | ||
return awkward.contents.NumpyArray( | ||
numpy.asarray(array.data, dtype=dtype), | ||
parameters=form.parameters, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This corrects an error in the old, non-Forth method. The old method ran Python objects through ak.from_iter
(which makes all Python integers int64
and floats float64
), then fixes them to agree with the dtypes in the actual Form. Many node types were fixed, but not non-empty leaf nodes. Now those are fixed as well.
header=False, | ||
tobject_header=True, | ||
tobject_header=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header data that users generally don't want to see used to be obscured as field names starting with @
, but the Forth code doesn't even read them in, so the Form (by default) should not include them, either.
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{}}, "form_key": "node{form_keys[0]}"}}' | ||
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{"__array__": "string"}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{"__record__": "TRefArray"}}, "form_key": "node{form_keys[0]}"}}' | ||
forth_obj.add_form(json.loads(temp_aform)) | ||
forth_stash.add_to_header( | ||
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int64\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int64\n" | ||
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int32\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int32\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a correction in Forth: the TRefArray data are 32-bit, but the Forth converted the 32-bit numbers into 64-bit numbers. Now it doesn't.
Also, a C++ __record__
name was missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also adds a string __array__
parameter, which looks right (I don't know which key is the content and parent, but we do need a string and char array, which are now there.
f' content[{self.name!r}] = {{"class": "ListOffsetArray", "offsets": "i64", "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}', | ||
f' content[{self.name!r}] = {{"class": "RegularArray", "size": {self.array_length}, "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regular-sized arrays were being read in as ListOffsetArrays. The Forth code must still be generating fake offsets; I didn't remove that. However, the output is now correct: we get RegularArrays.
@@ -18,6 +18,7 @@ def test_00(is_forth): | |||
py = branch.array(interp, library="ak", entry_stop=2) | |||
assert py[0]["0"][0] == "expskin_FluxUnisim" | |||
# py[-1] == <STLMap {'expskin_FluxUnisim': [0.944759093019904, 1.0890682745548674, ..., 1.1035170311451232, 0.8873957186284592], ...} at 0x7fbc4c1325e0> | |||
assert py.layout.form == interp.awkward_form(branch.file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every test that can read data with library="ak"
checks the array's final Form with awkward_form
's predicted Form, and they're all exactly equal. (These are Forms without form_keys
.)
…-awkward_form-method
@aryan26roy, if you want to take a look at this PR, please do, though I know you're busy. @agoose77, I've annotated the changes to explain them. When this is merged, Uproot will have all of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look motivated enough to me. I don't yet have the deep knowledge to make a more reasoned contribution but equally, these are relatively small tweaks in the scheme of things :)
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int64", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{}}, "form_key": "node{form_keys[0]}"}}' | ||
temp_aform = f'{{"class": "RecordArray", "contents": {{"fName": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "uint8", "inner_shape": [], "parameters": {{"__array__": "char"}}, "form_key": "node{form_keys[2]}"}}, "parameters": {{"__array__": "string"}}, "form_key": "node{form_keys[1]}"}}, "fSize": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[3]}"}}, "refs": {{"class": "ListOffsetArray", "offsets": "i64", "content": {{"class": "NumpyArray", "primitive": "int32", "inner_shape": [], "parameters": {{}}, "form_key": "node{form_keys[5]}"}}, "parameters": {{}}, "form_key": "node{form_keys[4]}"}}}}, "parameters": {{"__record__": "TRefArray"}}, "form_key": "node{form_keys[0]}"}}' | ||
forth_obj.add_form(json.loads(temp_aform)) | ||
forth_stash.add_to_header( | ||
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int64\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int64\n" | ||
f"output node{form_keys[1]}-offsets int64\noutput node{form_keys[2]}-data uint8\noutput node{form_keys[3]}-data int32\noutput node{form_keys[4]}-offsets int64\noutput node{form_keys[5]}-data int32\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also adds a string __array__
parameter, which looks right (I don't know which key is the content and parent, but we do need a string and char array, which are now there.
You're right that a |
@@ -1064,7 +1064,7 @@ def class_code( | |||
f' forth_stash.add_to_header(f"output node{{key}}-data {{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}\\n")', | |||
' forth_stash.add_to_header(f"output node{key2}-offsets int64\\n")', | |||
' forth_stash.add_to_init(f"0 node{key2}-offsets <- stack\\n")', | |||
f' content[{self.name!r}] = {{"class": "ListOffsetArray", "offsets": "i64", "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}', | |||
f' content[{self.name!r}] = {{"class": "RegularArray", "size": {self.array_length}, "content": {{ "class": "NumpyArray", "primitive": f"{{uproot._awkward_forth.convert_dtype(uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}])}}", "inner_shape": [], "parameters": {{}}, "form_key": f"node{{key}}"}}, "form_key": f"node{{key2}}"}}', | |||
f' forth_stash.add_to_pre(f"{self.array_length} dup node{{key2}}-offsets +<- stack \\n stream #!{{uproot._awkward_forth.symbol_dict[self._dtype{len(dtypes)}]}}-> node{{key}}-data\\n")\n', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpivarski what I did here was hard-code the array length into the forth code and use it to read the file by putting and pulling it off the stack. I think this can be easily replaced with a similar hardcoded version without generating the offset array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "hard coded," do you mean that you took the value of self.array_length
and included that as a literal in the Forth code? (That would be fine, of course.) You don't mean that you took the length of this particular example, such that it wouldn't work with a file with a different array length, right? (Just checking, because of the way it was worded.)
We can remove the offsets-generation at any time, because that counts as an optimization—the output won't change when it gets fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one. It changes depending on the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, great! (I thought so, just checking.)
And that reminds me that I should merge this now. Thanks for looking out over!
Quoting #754,
They are now; the differences were only minor and some of the errors were in the non-Forth implementation.
They were already gone before I started.
They all do now.
A user-visible difference is that we're excluding all of the TObject header and numbytes_version header stuff (used to be marked with
@
, now gone). It's still possible to get them back, but you have to run a non-Forth read and pass non-default parameters to theawkward_forth
method. I don't think we'll ever need it.