C++ refactoring: nbytes #1138

ianna · 2021-11-09T13:14:29Z

No description provided.

codecov · 2021-11-09T14:05:49Z

Codecov Report

Merging #1138 (bdf2ab2) into main (0b0fb3f) will increase coverage by 3.69%.
The diff coverage is 74.74%.

Impacted Files	Coverage Δ
src/awkward/_v2/behaviors/string.py	`100.00% <ø> (ø)`
src/awkward/_v2/forms/bitmaskedform.py	`75.64% <ø> (+7.69%)`	⬆️
src/awkward/_v2/forms/bytemaskedform.py	`76.05% <ø> (+7.04%)`	⬆️
src/awkward/_v2/forms/indexedform.py	`78.66% <ø> (+15.50%)`	⬆️
src/awkward/_v2/operations/convert/ak_from_cupy.py	`80.00% <0.00%> (ø)`
src/awkward/_v2/operations/convert/ak_from_jax.py	`80.00% <0.00%> (ø)`
src/awkward/_v2/operations/convert/ak_from_json.py	`80.00% <0.00%> (ø)`
...rc/awkward/_v2/operations/convert/ak_from_numpy.py	`36.53% <ø> (ø)`
src/awkward/_v2/operations/convert/ak_to_cupy.py	`80.00% <0.00%> (ø)`
src/awkward/_v2/operations/convert/ak_to_jax.py	`80.00% <0.00%> (ø)`
... and 152 more

ianna · 2021-11-11T09:15:01Z

@jpivarski - perhaps, related to this discussion - here is a recursive version sys.getsizeof() that is extendable with custom handlers: https://code.activestate.com/recipes/577504/

An example of a NumpyArrayHandler:

def NumpyArrayHandler(obj):
   yield obj._data
   yield obj._data.ctypes.data
   yield obj._data.shape

a test case:

np_data = np.random.random(size=(4, 100 * 1024 * 1024 // 8 // 4))
array = ak.from_numpy(np_data, regulararray=False)
array = v1_to_v2(array.layout)

and a result:

print(total_size(array,{ak._v2.contents.NumpyArray:NumpyArrayHandler},verbose=True))
48 <class 'awkward._v2.contents.numpyarray.NumpyArray'> <NumpyArray d...
</NumpyArray>
112 <class 'numpy.ndarray'> array([[0.798... 0.46882701]])
32 <class 'int'> 140219093352448
56 <class 'tuple'> (4, 3276800)
28 <class 'int'> 4
276

Numpy's nbytes returns:

array.nbytes()
104857600

The array object graph produced with an objgraph is:

jpivarski · 2021-11-11T12:48:10Z

The problem that recipe is solving is that sys.getsizeof is not deep, and most Python objects link to a lot of other objects. The situation is complicated because an object might link to some large data (or a cycle) that would not be deleted when the object gets deleted.

Fortunately, we don't need to solve the problem of figuring out how much memory is held by an ak.Array object graph. The array buffers are much larger than the Python objects, so just as NumPy's nbytes ignores the objects and only reports the data buffers (which aren't even counted in sys.getsizeof, recursive or not), we can ignore all the tuples and dicts in a Content hierarchy and just add up the sizes of the buffers.

As an nbytes property, it would just call NumPy's nbytes for all the Indexes and the NumpyArray data in the hierarchy, assuming no overlap. The function that takes overlap into account (not a method, as it will have to take multiple arrays to define what counts as overlap) will be more interesting.

jpivarski · 2021-11-11T12:50:42Z

In fact, in that overlaps function, two buffers can overlap without sharing the same pointer address. One might start at pointer address 0x1000000000000000 and have length 1024 while another starts at 0x1000000000000010 and have length 64. Overlaps are intervals in a one-dimensional space.

ianna · 2021-11-11T16:49:58Z

@jpivarski - an nbytes property is done and tested. The high-level nbytes property is enabled. Since I'll be starting on a function that takes overlaps into an account in the studies, I think, this PR can be merged.

jpivarski

The largest dict was a half-hearted attempt to not double-count overlaps (it didn't handle all cases). The algorithm also doesn't translate from C++ to Python well, since the feature that made it work in Awkward 1.x has been lost: in Awkward 1.x, an array and a view to that array were guaranteed to have the same base pointer, ptr_.

a = ak.Array([1, 2, 3, 4, 5])
b = a[2:]
assert a.layout.ptr == b.layout.ptr

In the above, a and b's NumpyArray nodes would have the same ptr_ and b would have a non-zero offset_. That way, they'd have the same key in the largest dict and only the largest of the two would count. (That's the half-hearted attempt at computing unions of intervals.)

In Awkward 2.x, we rely on NumPy for tracking views of buffers, and NumPy does not keep track of base pointers and offsets separately; it only keeps track of the start of the data of interest (and has a Python reference "base" for memory management).

>>> a = np.array([1, 2, 3, 4, 5])
>>> b = a[2:]
>>> assert a.ctypes.data + 16 == b.ctypes.data   # pointers differ by 2 * sizeof(int64)

Furthermore, id(self.ptr) is not a pointer to the data, it's a pointer to the Python integer that is equal to the pointer to the data. (And id is only a pointer in CPython, as a non-guaranteed implementation detail.) So this algorithm isn't going to work at all in Awkward v2.

Fortunately, it doesn't need to: NumPy's nbytes is simple-minded. It does not represent the true number of bytes held in memory by an array, for several reasons:

other arrays might be holding the same data, and freeing this one won't free the data
it takes a very naive approach to data with non-trivial strides.

Consider this:

>>> a = np.arange(2*3*5).reshape(2, 3, 5)
>>> a
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]],

       [[15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])
>>> a[:, ::2, ::4]
array([[[ 0,  4],
        [10, 14]],

       [[15, 19],
        [25, 29]]])

The slice picks out 8 of the 30 items, and it's a view that shares data with the original array. The original array has 8 * 2*3*5 == 240 bytes. The slice uses memory that goes all the way from the 0 to the 29, inclusive, so one might think that it should also be represented as having 240 bytes—after all, the buffer it needs to hold uses that much memory. However, nbytes returns 8 * 2*2*2 == 64, only counting the elements that you can see.

For an extreme case, consider this:

>>> a, b = np.broadcast_arrays(123, np.zeros((10, 10)))
>>> a
array([[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123]])

The array a looks like it has 100 elements, and a.nbytes returns 8 * 100 == 800. However, this array is a trick: it's only storing a single 8-byte integer. The strides are zero, so it's showing you the same number over and over:

>>> a.shape
(10, 10)
>>> a.strides
(0, 0)

To prove it, try changing an element:

>>> a[0] = 999
<stdin>:1: DeprecationWarning: Numpy has detected that you (may be) writing to an array with
overlapping memory from np.broadcast_arrays. If this is intentional
set the WRITEABLE flag True or make a copy immediately before writing.
>>> a
array([[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999]])

Okay, NumPy gives a warning now because this is surprising behavior, but my point stands. There's only one number here, the buffer used to represent this array has only 8 bytes, but a.nbytes returns 800.

Conclusion: the nbytes value doesn't represent how much memory an array takes; it's just a synonym for a.size * a.itemsize.

Therefore, Awkward Array's nbytes can be similarly simple. It only needs to walk recursively through the array, adding up

x.data.nbytes

for each Index, NumpyArray, and Identifier x. It doesn't need a dict because it's not trying to avoid double-counting. All of the care that we'd take to avoid double-counting would be in a new function, not named nbytes.

ianna · 2021-11-12T13:18:35Z

@jpivarski - I think, it's done. Please, check it when you have time. Thanks!

jpivarski

Great! This is much simpler and correct. I only updated the docstring to reflect the fact that there's no longer any attempt to avoid double-counting overlaps.

ianna marked this pull request as draft November 9, 2021 13:14

ianna force-pushed the ianna/nbytes branch from fec0687 to 1bcd011 Compare November 11, 2021 11:42

ianna added 2 commits November 11, 2021 12:46

start

c36d9d6

count identifier

21428f5

ianna force-pushed the ianna/nbytes branch 2 times, most recently from 7b1ead1 to 21428f5 Compare November 11, 2021 12:06

all contents and indeces count nbytes

84f4a51

ianna force-pushed the ianna/nbytes branch from e0fbbb2 to 84f4a51 Compare November 11, 2021 12:27

ianna added 2 commits November 11, 2021 14:53

nbytes is a property

32c1c74

add tests

82efff0

ianna marked this pull request as ready for review November 11, 2021 16:50

ianna requested a review from jpivarski November 11, 2021 16:51

ianna marked this pull request as draft November 11, 2021 17:25

jpivarski requested changes Nov 11, 2021

View reviewed changes

ianna added 2 commits November 12, 2021 10:17

address Jim's comments

96c839c

cleanup

d422662

ianna marked this pull request as ready for review November 12, 2021 13:16

ianna requested a review from jpivarski November 12, 2021 13:18

Updated docstring for the new behavior.

bdf2ab2

jpivarski approved these changes Nov 12, 2021

View reviewed changes

jpivarski enabled auto-merge (squash) November 12, 2021 15:22

jpivarski merged commit ae99508 into main Nov 12, 2021

jpivarski deleted the ianna/nbytes branch November 12, 2021 16:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ refactoring: nbytes #1138

C++ refactoring: nbytes #1138

ianna commented Nov 9, 2021

codecov bot commented Nov 9, 2021 •

edited

Loading

ianna commented Nov 11, 2021 •

edited

Loading

jpivarski commented Nov 11, 2021

jpivarski commented Nov 11, 2021

ianna commented Nov 11, 2021

jpivarski left a comment

ianna commented Nov 12, 2021

jpivarski left a comment

C++ refactoring: nbytes #1138

C++ refactoring: nbytes #1138

Conversation

ianna commented Nov 9, 2021

codecov bot commented Nov 9, 2021 • edited Loading

Codecov Report

ianna commented Nov 11, 2021 • edited Loading

jpivarski commented Nov 11, 2021

jpivarski commented Nov 11, 2021

ianna commented Nov 11, 2021

jpivarski left a comment

Choose a reason for hiding this comment

ianna commented Nov 12, 2021

jpivarski left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 9, 2021 •

edited

Loading

ianna commented Nov 11, 2021 •

edited

Loading