Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array __repr__ and __str__ can produce misleading text #838

Closed
masonproffitt opened this issue Apr 16, 2021 · 6 comments · Fixed by #1122
Closed

Array __repr__ and __str__ can produce misleading text #838

masonproffitt opened this issue Apr 16, 2021 · 6 comments · Fixed by #1122
Labels
feature New feature or request

Comments

@masonproffitt
Copy link

I think I would call at least the first case below a bug even though it doesn't actually affect any array operations.

There are certain Arrays that have string representations that seem misleading:

>>> import awkward as ak
>>> ak.__version__
'1.2.2'
>>> ak.Array([[True], [], [True], [True], [True], []])
<Array [[True], [], [True, ... [True], []] type='6 * var * bool'>

The comma after the second True makes it seem like there are more elements in that third list. I would have expected something like <Array [[True], [], [True], ... [True], []] type='6 * var * bool'> instead.

Less inaccurate but still potentially misleading are cases like:

>>> ak.Array([[0], [], [0, 1, 2, 3], [1, 0], [0], [], []])
<Array [[0], [], [0, 1, 2, ... 0], [0], [], []] type='7 * var * int64'>

This makes it look like [0, 1, 2, ... 0] is all the same list. You can tell the difference by counting the number of entries and comparing to the length of 7, but I would have expected something like <Array [[0], [], [0, 1, 2, ... ], ... [ ... 0], [0], [], []] type='7 * var * int64'>.

(Actually, it's always confused me why __repr__ and __str__ don't just show the whole array like numpy. I guess ak.to_list() always works for that at least.)

@masonproffitt masonproffitt added the bug (unverified) The problem described would be a bug, but needs to be triaged label Apr 16, 2021
@jpivarski
Copy link
Member

NumPy doesn't always show the whole array, either, but its cut-off is much higher: at 1000 entries or so. Because of its regular structure, NumPy can also cut the array in a mood logical place, whereas there's no guarantee that an Awkward Array is going to have a non-misleading place to cut.

Independent of that, I also wanted the string representations to fit on one line—the maximum length of the line was chosen so that it wouldn't create scroll bars on GitHub or StackOverflow. That makes it possible to just paste examples without making the code examples hard to navigate. They could, like NumPy, have multiple lines that never exceed a certain width, but that garbles dicts and lists of arrays. I'm surprised there aren't more complaints about that with NumPy—it can be very hard to read.

Another consideration in creating string representations is that the array might contain virtual arrays and we don't want to trigger materialization too easily. Any array elements that get printed must be materialized (usually that means reading from disk or over a network), and the virtual arrays are usually set up such that each record field is independent and there are large row chunks that are independent. Since the string representation includes both ends of the array, the first and last chunk will be read, and if there's only room for two or three fields, then only those will be read. Making the string representation wider will mean more up-front data reading.

The tolist method will show everything (and IPython-like things will format it nicely), but for large arrays, you don't want to read everything and show everything.

Perhaps the right thing to do is to give you more control over the representation. The str and repr functions don't take arguments, but maybe there can be a global parameter (or a context manager) that affects the allowed length of the strings. Or maybe a show method that takes arguments. It would be very easy to increase the length, it would be somewhat harder to add line breaks to keep the line width under a maximum (you'd have to get into ak._util.minimally_touching_string), and I don't know how one would ensure that the "..." only appears in logically meaningful places (i.e. balances parentheses). Currently, the algorithm adds tokens until the length would exceed a maximum—to keep the parentheses balanced, you'd have to change the granular unit from tokens to syntactic groups. In many cases, there would be nothing to show because the smallest syntactic unit is too large.

I'm going to label this a feature request.

@jpivarski jpivarski added feature New feature or request and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Apr 16, 2021
@henryiii
Copy link
Member

The problem here is that it's choosing confusing truncation points based solely on the string. It should not truncate across different level lists anymore than it should truncate the middle of a number. If truncating deep in a list is required, it should truncate each, like <Array [[0], [], [0, 1, 2, ...], ..., [0], [], []] type='7 * var * int64'> for example. One "property-based" test would be the number of [ should match the number of ] in the repr for all reprs.

@henryiii
Copy link
Member

henryiii commented Apr 17, 2021

PS: The test I mentioned only catches the first example, the inside lists being merged in truncation would not be caught by that test. Edit, actually I didn't read your final paragraph very well, looks like you already mentioned this. Oh, well, now it's here twice.

@masonproffitt
Copy link
Author

Just to add a bit to the list of unexpected behavior here--the ellipsis can even occur before any field names or data, which looks very strange to me:

>>> type(a)
awkward.highlevel.Record

>>> a.tolist()
{'nMuon': 2,
 'Muon_pt': [10.763696670532227, 15.736522674560547],
 'Muon_eta': [1.0668272972106934, -0.563786506652832],
 'Muon_phi': [-0.03427272289991379, 2.5426154136657715],
 'Muon_mass': [0.10565836727619171, 0.10565836727619171],
 'Muon_charge': [-1, -1]}

>>> a
<Record ... 0.106], Muon_charge: [-1, -1]} type='{"nMuon": uint32, "Muon_pt": va...'>

>>> print(a)
... Muon_phi: [-0.0343, 2.54], Muon_mass: [0.106, 0.106], Muon_charge: [-1, -1]}

I would expect an ellipsis to be at the end or in the middle, but never at the beginning...

@jpivarski
Copy link
Member

Okay, so I'm finally addressing this, but in the context of Awkward 2.x, rather than 1.x. PR #1122 introduces high-level Array and Record for v2, and one of the first things these wrappers need is __repr__/__str__. We have enough experience with the v1 __repr__/__str__ to know that hiding structure for the sake of screen space is a problem—structure is one of the most important features.

(Historical note: Awkward 0.x ignored screen space as well. It cut the output to prevent large arrays from inundating the terminal, but it did so at the "number of objects" level, which can use a surprisingly large or small amount of screen space. From our experience of v0, we learned that getting output that fits in a GitHub or StackOverflow preformatted block without scrolling is a big deal!)

We can preserve screen space and structure by allowing for more than one ellipsis (...). The decision of where to put the ellipses (plural) is an underconstrained problem—with multiple lists taking up the slack, one could take more slack than another. However, assuming that each list is roughly the same size, we can ask a list to (greedily) try to use all the available space when there's only one of them and ask each list to try to use half of the available space when there's more than one of them, and that would have at least the first two using most of the space with an ellipsis for any remaining. Such an algorithm only wastes space if the list sizes are very unbalanced. A better algorithm, one that globally optimizes the use of space, would have to look ahead non-deterministically—it would have to be some sort of fit, which is too complicated for just printing arrays on the screen. The greedy algorithm described above is good enough.

Here's what it looks like, using an array in v1 and v2 (both are accessible in main; version 2.0 is being developed in a submodule).

>>> import awkward as ak
>>> import awkward._v2.highlevel
>>> from awkward._v2.tmp_for_testing import v1_to_v2
>>> import numpy as np

>>> layout = ak.from_iter([[0.0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]], highlevel=False)

>>> ak.highlevel.Array(layout)
<Array [[0, 1.1, 2.2], ... 6.6, 7.7, 8.8, 9.9]] type='5 * var * float64'>

>>> ak._v2.highlevel.Array(v1_to_v2(layout))
<Array [[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]] type='5 * var * flo...'>

The v1 array has a single ellipsis, but it only hints at the structure with unbalanced brackets. The v2 array has two ellipses, and the brackets are guaranteed to be balanced, so the structure is more clear.

Here's a bigger example that demonstrates the new multi-line show method:

>>> layout = ak.from_iter([
...     {"x": i * 1.1, "y": np.arange(i).tolist()} for i in range(100)
... ], highlevel=False)

>>> ak.highlevel.Array(layout)
<Array [{x: 0, y: []}, ... 95, 96, 97, 98]}] type='100 * {"x": float64, "y": var...'>

>>> ak._v2.highlevel.Array(v1_to_v2(layout))
<Array [{x: 0, y: []}, {...}, ..., {x: 109, y: [...]}] type='100 * {x: float...'>

>>> ak._v2.highlevel.Array(v1_to_v2(layout)).show(type=True)
type: 100 * {x: float64, y: var * int64}
[{x: 0, y: []},
 {x: 1.1, y: [0]},
 {x: 2.2, y: [0, 1]},
 {x: 3.3, y: [0, 1, 2]},
 {x: 4.4, y: [0, 1, 2, 3]},
 {x: 5.5, y: [0, 1, 2, 3, 4]},
 {x: 6.6, y: [0, 1, 2, 3, 4, 5]},
 {x: 7.7, y: [0, 1, 2, 3, 4, 5, 6]},
 {x: 8.8, y: [0, 1, 2, 3, 4, 5, 6, 7]},
 {x: 9.9, y: [0, 1, 2, 3, ..., 5, 6, 7, 8]},
 ...,
 {x: 100, y: [0, 1, 2, 3, ..., 88, 89, 90]},
 {x: 101, y: [0, 1, 2, 3, ..., 89, 90, 91]},
 {x: 102, y: [0, 1, 2, 3, ..., 90, 91, 92]},
 {x: 103, y: [0, 1, 2, 3, ..., 91, 92, 93]},
 {x: 105, y: [0, 1, 2, 3, ..., 92, 93, 94]},
 {x: 106, y: [0, 1, 2, 3, ..., 93, 94, 95]},
 {x: 107, y: [0, 1, 2, 3, ..., 94, 95, 96]},
 {x: 108, y: [0, 1, 2, 3, ..., 95, 96, 97]},
 {x: 109, y: [0, 1, 2, 3, ..., 96, 97, 98]}]

The new algorithm is guaranteed to satisfy two constraints: (1) the brackets are balanced, however deeply it manages to go, and (2) the output fits within the space budget, unless that budget is less than 5 (the length of "[...]"). The following demonstrates that using a limit_cols that gradually decreases from 80 to 0.

>>> layout = ak.from_iter([[0.0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]], highlevel=False)
>>> a = ak._v2.highlevel.Array(v1_to_v2(layout))
>>> for width in range(80, -1, -1):
...     print("-" * width)
...     print(a.show(limit_rows=1, limit_cols=width, stream=None))
... 
--------------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], [], [5.5], [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [5.5], [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------------
[[0, 1.1, 2.2], [3.3, 4.4], ..., [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------------
[[0, 1.1, 2.2], [3.3, ...], ..., [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
---------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
--------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
-------------------------------------------------
[[0, 1.1, 2.2], [...], ..., [6.6, 7.7, 8.8, 9.9]]
------------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, 8.8, 9.9]]
-----------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, 8.8, 9.9]]
----------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
---------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
--------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
-------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
------------------------------------------
[[0, 1.1, 2.2], ..., [6.6, 7.7, ..., 9.9]]
-----------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
----------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
---------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
--------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
-------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ..., 9.9]]
------------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
-----------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
----------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
---------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
--------------------------------
[[0, 1.1, 2.2], ..., [6.6, ...]]
-------------------------------
[[0, 1.1, 2.2], ..., [...]]
------------------------------
[[0, 1.1, 2.2], ..., [...]]
-----------------------------
[[0, 1.1, 2.2], ..., [...]]
----------------------------
[[0, 1.1, 2.2], ..., [...]]
---------------------------
[[0, 1.1, 2.2], ..., [...]]
--------------------------
[[0, 1.1, 2.2], ...]
-------------------------
[[0, 1.1, 2.2], ...]
------------------------
[[0, ..., 2.2], ...]
-----------------------
[[0, ..., 2.2], ...]
----------------------
[[0, ..., 2.2], ...]
---------------------
[[0, ..., 2.2], ...]
--------------------
[[0, ..., 2.2], ...]
-------------------
[[0, ...], ...]
------------------
[[0, ...], ...]
-----------------
[[0, ...], ...]
----------------
[[0, ...], ...]
---------------
[[0, ...], ...]
--------------
[[...], ...]
-------------
[[...], ...]
------------
[[...], ...]
-----------
[...]
----------
[...]
---------
[...]
--------
[...]
-------
[...]
------
[...]
-----
[...]
----
[...]
---
[...]
--
[...]
-
[...]

[...]

Since this is getting merged into v2 now, I'll close this issue so that I don't forget to later. v2 isn't usable for data analysis, but when it is, this will be one of the enticements to switch over to it.

(Note: the v1 → v2 differences are much smaller than the v0 → v1 differences and we won't be changing PyPI names like we did last time. The new transition will be much easier, I promise!)

@jpivarski
Copy link
Member

If you're interested in comparing them, the old algorithm walks over tokens from the left and from the right, adding them to the output until adding one more would exceed the budget. Tokens can be individual brackets, such as [ and ], individual numbers, strings, a key, etc. Thus, it can't cut in the middle of a number, but it can cut in the middle of a structure. That's what you all were complaining about.

Here's the old code:

https://github.com/scikit-hep/awkward-1.0/blob/a440328f8097d22c2ba053fd117fed543829afc0/src/awkward/_util.py#L1557-L1770

The new algorithm walks from the left and from the right of each list, but only from the left for records. It walks over the tree and thus can guarantee bracket matching. Instead of having a single space budget applied to one forward iterator and one backward iterator, it has a space budget that it subdivides to children. As described in the previous comment, it divides up the budget greedily: if one list uses it up, there's nothing left for the next one. If a list has only one element, the whole budget is given to that list; if there's more than one, half of the remaining budget is given to each, which gives exponentially decreasing weight to items in the list: the first-visited gets the most, the second-visited gets less, the third-visited gets less, on down in powers of 2. When walking over a list, the order of visiting alternates between the left-to-right and the right-to-left, so lists put most of the emphasis on the first and last element.

Here's the new code:

https://github.com/scikit-hep/awkward-1.0/blob/6eb1e63edcf7b617fca9de5d551d6083f35a77a9/src/awkward/_v2/_prettyprint.py#L12-L239

Tests are done; time to merge the PR and close this issue.

jpivarski added a commit that referenced this issue Oct 21, 2021
* Working on the high-level ak.Array for v2.

* Set coverage target to 20% so that it 'almost never' fails (only if it drops to 0% for some technical reason).

* Working on a pretty-printer that addresses #838.

* Check in partial work.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Get rid of codecov annotations in GitHub diffs.

* Check in partial work.

* Basically functioning v2 Array and Record classes.

* Complete enough to merge the PR. A lot more work is needed thereafter.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants