-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array __repr__
and __str__
can produce misleading text
#838
Comments
NumPy doesn't always show the whole array, either, but its cut-off is much higher: at 1000 entries or so. Because of its regular structure, NumPy can also cut the array in a mood logical place, whereas there's no guarantee that an Awkward Array is going to have a non-misleading place to cut. Independent of that, I also wanted the string representations to fit on one line—the maximum length of the line was chosen so that it wouldn't create scroll bars on GitHub or StackOverflow. That makes it possible to just paste examples without making the code examples hard to navigate. They could, like NumPy, have multiple lines that never exceed a certain width, but that garbles dicts and lists of arrays. I'm surprised there aren't more complaints about that with NumPy—it can be very hard to read. Another consideration in creating string representations is that the array might contain virtual arrays and we don't want to trigger materialization too easily. Any array elements that get printed must be materialized (usually that means reading from disk or over a network), and the virtual arrays are usually set up such that each record field is independent and there are large row chunks that are independent. Since the string representation includes both ends of the array, the first and last chunk will be read, and if there's only room for two or three fields, then only those will be read. Making the string representation wider will mean more up-front data reading. The Perhaps the right thing to do is to give you more control over the representation. The I'm going to label this a feature request. |
The problem here is that it's choosing confusing truncation points based solely on the string. It should not truncate across different level lists anymore than it should truncate the middle of a number. If truncating deep in a list is required, it should truncate each, like |
PS: The test I mentioned only catches the first example, the inside lists being merged in truncation would not be caught by that test. Edit, actually I didn't read your final paragraph very well, looks like you already mentioned this. Oh, well, now it's here twice. |
Just to add a bit to the list of unexpected behavior here--the ellipsis can even occur before any field names or data, which looks very strange to me: >>> type(a)
awkward.highlevel.Record
>>> a.tolist()
{'nMuon': 2,
'Muon_pt': [10.763696670532227, 15.736522674560547],
'Muon_eta': [1.0668272972106934, -0.563786506652832],
'Muon_phi': [-0.03427272289991379, 2.5426154136657715],
'Muon_mass': [0.10565836727619171, 0.10565836727619171],
'Muon_charge': [-1, -1]}
>>> a
<Record ... 0.106], Muon_charge: [-1, -1]} type='{"nMuon": uint32, "Muon_pt": va...'>
>>> print(a)
... Muon_phi: [-0.0343, 2.54], Muon_mass: [0.106, 0.106], Muon_charge: [-1, -1]} I would expect an ellipsis to be at the end or in the middle, but never at the beginning... |
Okay, so I'm finally addressing this, but in the context of Awkward 2.x, rather than 1.x. PR #1122 introduces high-level (Historical note: Awkward 0.x ignored screen space as well. It cut the output to prevent large arrays from inundating the terminal, but it did so at the "number of objects" level, which can use a surprisingly large or small amount of screen space. From our experience of v0, we learned that getting output that fits in a GitHub or StackOverflow preformatted block without scrolling is a big deal!) We can preserve screen space and structure by allowing for more than one ellipsis ( Here's what it looks like, using an array in v1 and v2 (both are accessible in >>> import awkward as ak
>>> import awkward._v2.highlevel
>>> from awkward._v2.tmp_for_testing import v1_to_v2
>>> import numpy as np
>>> layout = ak.from_iter([[0.0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]], highlevel=False)
>>> ak.highlevel.Array(layout)
<Array [[0, 1.1, 2.2], ... 6.6, 7.7, 8.8, 9.9]] type='5 * var * float64'>
>>> ak._v2.highlevel.Array(v1_to_v2(layout))
<Array [[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]] type='5 * var * flo...'> The v1 array has a single ellipsis, but it only hints at the structure with unbalanced brackets. The v2 array has two ellipses, and the brackets are guaranteed to be balanced, so the structure is more clear. Here's a bigger example that demonstrates the new multi-line >>> layout = ak.from_iter([
... {"x": i * 1.1, "y": np.arange(i).tolist()} for i in range(100)
... ], highlevel=False)
>>> ak.highlevel.Array(layout)
<Array [{x: 0, y: []}, ... 95, 96, 97, 98]}] type='100 * {"x": float64, "y": var...'>
>>> ak._v2.highlevel.Array(v1_to_v2(layout))
<Array [{x: 0, y: []}, {...}, ..., {x: 109, y: [...]}] type='100 * {x: float...'>
>>> ak._v2.highlevel.Array(v1_to_v2(layout)).show(type=True)
type: 100 * {x: float64, y: var * int64}
[{x: 0, y: []},
{x: 1.1, y: [0]},
{x: 2.2, y: [0, 1]},
{x: 3.3, y: [0, 1, 2]},
{x: 4.4, y: [0, 1, 2, 3]},
{x: 5.5, y: [0, 1, 2, 3, 4]},
{x: 6.6, y: [0, 1, 2, 3, 4, 5]},
{x: 7.7, y: [0, 1, 2, 3, 4, 5, 6]},
{x: 8.8, y: [0, 1, 2, 3, 4, 5, 6, 7]},
{x: 9.9, y: [0, 1, 2, 3, ..., 5, 6, 7, 8]},
...,
{x: 100, y: [0, 1, 2, 3, ..., 88, 89, 90]},
{x: 101, y: [0, 1, 2, 3, ..., 89, 90, 91]},
{x: 102, y: [0, 1, 2, 3, ..., 90, 91, 92]},
{x: 103, y: [0, 1, 2, 3, ..., 91, 92, 93]},
{x: 105, y: [0, 1, 2, 3, ..., 92, 93, 94]},
{x: 106, y: [0, 1, 2, 3, ..., 93, 94, 95]},
{x: 107, y: [0, 1, 2, 3, ..., 94, 95, 96]},
{x: 108, y: [0, 1, 2, 3, ..., 95, 96, 97]},
{x: 109, y: [0, 1, 2, 3, ..., 96, 97, 98]}] The new algorithm is guaranteed to satisfy two constraints: (1) the brackets are balanced, however deeply it manages to go, and (2) the output fits within the space budget, unless that budget is less than 5 (the length of >>> layout = ak.from_iter([[0.0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]], highlevel=False)
>>> a = ak._v2.highlevel.Array(v1_to_v2(layout))
>>> for width in range(80, -1, -1):
... print("-" * width)
... print(a.show(limit_rows=1, limit_cols=width, stream=None))
...
--------------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
---------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
--------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
-------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
-----------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
----------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
---------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
--------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
-------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
-----------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
----------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
---------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
--------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
-------------------------------
[[0, 1.1, 2.2], ..., [...]]
------------------------------
[[0, 1.1, 2.2], ..., [...]]
-----------------------------
[[0, 1.1, 2.2], ..., [...]]
----------------------------
[[0, 1.1, 2.2], ..., [...]]
---------------------------
[[0, 1.1, 2.2], ..., [...]]
--------------------------
[[0, 1.1, 2.2], ...]
-------------------------
[[0, 1.1, 2.2], ...]
------------------------
[[0, ..., 2.2], ...]
-----------------------
[[0, ..., 2.2], ...]
----------------------
[[0, ..., 2.2], ...]
---------------------
[[0, ..., 2.2], ...]
--------------------
[[0, ..., 2.2], ...]
-------------------
[[0, ...], ...]
------------------
[[0, ...], ...]
-----------------
[[0, ...], ...]
----------------
[[0, ...], ...]
---------------
[[0, ...], ...]
--------------
[[...], ...]
-------------
[[...], ...]
------------
[[...], ...]
-----------
[...]
----------
[...]
---------
[...]
--------
[...]
-------
[...]
------
[...]
-----
[...]
----
[...]
---
[...]
--
[...]
-
[...]
[...] Since this is getting merged into v2 now, I'll close this issue so that I don't forget to later. v2 isn't usable for data analysis, but when it is, this will be one of the enticements to switch over to it. (Note: the v1 → v2 differences are much smaller than the v0 → v1 differences and we won't be changing PyPI names like we did last time. The new transition will be much easier, I promise!) |
If you're interested in comparing them, the old algorithm walks over tokens from the left and from the right, adding them to the output until adding one more would exceed the budget. Tokens can be individual brackets, such as Here's the old code: The new algorithm walks from the left and from the right of each list, but only from the left for records. It walks over the tree and thus can guarantee bracket matching. Instead of having a single space budget applied to one forward iterator and one backward iterator, it has a space budget that it subdivides to children. As described in the previous comment, it divides up the budget greedily: if one list uses it up, there's nothing left for the next one. If a list has only one element, the whole budget is given to that list; if there's more than one, half of the remaining budget is given to each, which gives exponentially decreasing weight to items in the list: the first-visited gets the most, the second-visited gets less, the third-visited gets less, on down in powers of 2. When walking over a list, the order of visiting alternates between the left-to-right and the right-to-left, so lists put most of the emphasis on the first and last element. Here's the new code: Tests are done; time to merge the PR and close this issue. |
* Working on the high-level ak.Array for v2. * Set coverage target to 20% so that it 'almost never' fails (only if it drops to 0% for some technical reason). * Working on a pretty-printer that addresses #838. * Check in partial work. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Get rid of codecov annotations in GitHub diffs. * Check in partial work. * Basically functioning v2 Array and Record classes. * Complete enough to merge the PR. A lot more work is needed thereafter. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
I think I would call at least the first case below a bug even though it doesn't actually affect any array operations.
There are certain
Array
s that have string representations that seem misleading:The comma after the second
True
makes it seem like there are more elements in that third list. I would have expected something like<Array [[True], [], [True], ... [True], []] type='6 * var * bool'>
instead.Less inaccurate but still potentially misleading are cases like:
This makes it look like
[0, 1, 2, ... 0]
is all the same list. You can tell the difference by counting the number of entries and comparing to the length of 7, but I would have expected something like<Array [[0], [], [0, 1, 2, ... ], ... [ ... 0], [0], [], []] type='7 * var * int64'>
.(Actually, it's always confused me why
__repr__
and__str__
don't just show the whole array like numpy. I guessak.to_list()
always works for that at least.)The text was updated successfully, but these errors were encountered: