Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++ refactoring: nbytes #1138

Merged
merged 8 commits into from
Nov 12, 2021
Merged

C++ refactoring: nbytes #1138

merged 8 commits into from
Nov 12, 2021

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Nov 9, 2021

No description provided.

@ianna ianna marked this pull request as draft November 9, 2021 13:14
@codecov
Copy link

codecov bot commented Nov 9, 2021

Codecov Report

Merging #1138 (bdf2ab2) into main (0b0fb3f) will increase coverage by 3.69%.
The diff coverage is 74.74%.

Impacted Files Coverage Δ
src/awkward/_v2/behaviors/string.py 100.00% <ø> (ø)
src/awkward/_v2/forms/bitmaskedform.py 75.64% <ø> (+7.69%) ⬆️
src/awkward/_v2/forms/bytemaskedform.py 76.05% <ø> (+7.04%) ⬆️
src/awkward/_v2/forms/indexedform.py 78.66% <ø> (+15.50%) ⬆️
src/awkward/_v2/operations/convert/ak_from_cupy.py 80.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_from_jax.py 80.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_from_json.py 80.00% <0.00%> (ø)
...rc/awkward/_v2/operations/convert/ak_from_numpy.py 36.53% <ø> (ø)
src/awkward/_v2/operations/convert/ak_to_cupy.py 80.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_to_jax.py 80.00% <0.00%> (ø)
... and 152 more

@ianna
Copy link
Collaborator Author

ianna commented Nov 11, 2021

@jpivarski - perhaps, related to this discussion - here is a recursive version sys.getsizeof() that is extendable with custom handlers: https://code.activestate.com/recipes/577504/

An example of a NumpyArrayHandler:

def NumpyArrayHandler(obj):
   yield obj._data
   yield obj._data.ctypes.data
   yield obj._data.shape

a test case:

np_data = np.random.random(size=(4, 100 * 1024 * 1024 // 8 // 4))
array = ak.from_numpy(np_data, regulararray=False)
array = v1_to_v2(array.layout)

and a result:

print(total_size(array,{ak._v2.contents.NumpyArray:NumpyArrayHandler},verbose=True))
48 <class 'awkward._v2.contents.numpyarray.NumpyArray'> <NumpyArray d...
</NumpyArray>
112 <class 'numpy.ndarray'> array([[0.798... 0.46882701]])
32 <class 'int'> 140219093352448
56 <class 'tuple'> (4, 3276800)
28 <class 'int'> 4
276

Numpy's nbytes returns:

array.nbytes()
104857600

The array object graph produced with an objgraph is:
array-graph

@ianna ianna force-pushed the ianna/nbytes branch 2 times, most recently from 7b1ead1 to 21428f5 Compare November 11, 2021 12:06
@jpivarski
Copy link
Member

The problem that recipe is solving is that sys.getsizeof is not deep, and most Python objects link to a lot of other objects. The situation is complicated because an object might link to some large data (or a cycle) that would not be deleted when the object gets deleted.

Fortunately, we don't need to solve the problem of figuring out how much memory is held by an ak.Array object graph. The array buffers are much larger than the Python objects, so just as NumPy's nbytes ignores the objects and only reports the data buffers (which aren't even counted in sys.getsizeof, recursive or not), we can ignore all the tuples and dicts in a Content hierarchy and just add up the sizes of the buffers.

As an nbytes property, it would just call NumPy's nbytes for all the Indexes and the NumpyArray data in the hierarchy, assuming no overlap. The function that takes overlap into account (not a method, as it will have to take multiple arrays to define what counts as overlap) will be more interesting.

@jpivarski
Copy link
Member

In fact, in that overlaps function, two buffers can overlap without sharing the same pointer address. One might start at pointer address 0x1000000000000000 and have length 1024 while another starts at 0x1000000000000010 and have length 64. Overlaps are intervals in a one-dimensional space.

@ianna
Copy link
Collaborator Author

ianna commented Nov 11, 2021

@jpivarski - an nbytes property is done and tested. The high-level nbytes property is enabled. Since I'll be starting on a function that takes overlaps into an account in the studies, I think, this PR can be merged.

@ianna ianna marked this pull request as ready for review November 11, 2021 16:50
@ianna ianna requested a review from jpivarski November 11, 2021 16:51
@ianna ianna marked this pull request as draft November 11, 2021 17:25
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The largest dict was a half-hearted attempt to not double-count overlaps (it didn't handle all cases). The algorithm also doesn't translate from C++ to Python well, since the feature that made it work in Awkward 1.x has been lost: in Awkward 1.x, an array and a view to that array were guaranteed to have the same base pointer, ptr_.

a = ak.Array([1, 2, 3, 4, 5])
b = a[2:]
assert a.layout.ptr == b.layout.ptr

In the above, a and b's NumpyArray nodes would have the same ptr_ and b would have a non-zero offset_. That way, they'd have the same key in the largest dict and only the largest of the two would count. (That's the half-hearted attempt at computing unions of intervals.)

In Awkward 2.x, we rely on NumPy for tracking views of buffers, and NumPy does not keep track of base pointers and offsets separately; it only keeps track of the start of the data of interest (and has a Python reference "base" for memory management).

>>> a = np.array([1, 2, 3, 4, 5])
>>> b = a[2:]
>>> assert a.ctypes.data + 16 == b.ctypes.data   # pointers differ by 2 * sizeof(int64)

Furthermore, id(self.ptr) is not a pointer to the data, it's a pointer to the Python integer that is equal to the pointer to the data. (And id is only a pointer in CPython, as a non-guaranteed implementation detail.) So this algorithm isn't going to work at all in Awkward v2.

Fortunately, it doesn't need to: NumPy's nbytes is simple-minded. It does not represent the true number of bytes held in memory by an array, for several reasons:

  • other arrays might be holding the same data, and freeing this one won't free the data
  • it takes a very naive approach to data with non-trivial strides.

Consider this:

>>> a = np.arange(2*3*5).reshape(2, 3, 5)
>>> a
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14]],

       [[15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])
>>> a[:, ::2, ::4]
array([[[ 0,  4],
        [10, 14]],

       [[15, 19],
        [25, 29]]])

The slice picks out 8 of the 30 items, and it's a view that shares data with the original array. The original array has 8 * 2*3*5 == 240 bytes. The slice uses memory that goes all the way from the 0 to the 29, inclusive, so one might think that it should also be represented as having 240 bytes—after all, the buffer it needs to hold uses that much memory. However, nbytes returns 8 * 2*2*2 == 64, only counting the elements that you can see.

For an extreme case, consider this:

>>> a, b = np.broadcast_arrays(123, np.zeros((10, 10)))
>>> a
array([[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
       [123, 123, 123, 123, 123, 123, 123, 123, 123, 123]])

The array a looks like it has 100 elements, and a.nbytes returns 8 * 100 == 800. However, this array is a trick: it's only storing a single 8-byte integer. The strides are zero, so it's showing you the same number over and over:

>>> a.shape
(10, 10)
>>> a.strides
(0, 0)

To prove it, try changing an element:

>>> a[0] = 999
<stdin>:1: DeprecationWarning: Numpy has detected that you (may be) writing to an array with
overlapping memory from np.broadcast_arrays. If this is intentional
set the WRITEABLE flag True or make a copy immediately before writing.
>>> a
array([[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
       [999, 999, 999, 999, 999, 999, 999, 999, 999, 999]])

Okay, NumPy gives a warning now because this is surprising behavior, but my point stands. There's only one number here, the buffer used to represent this array has only 8 bytes, but a.nbytes returns 800.

Conclusion: the nbytes value doesn't represent how much memory an array takes; it's just a synonym for a.size * a.itemsize.

Therefore, Awkward Array's nbytes can be similarly simple. It only needs to walk recursively through the array, adding up

x.data.nbytes

for each Index, NumpyArray, and Identifier x. It doesn't need a dict because it's not trying to avoid double-counting. All of the care that we'd take to avoid double-counting would be in a new function, not named nbytes.

@ianna ianna marked this pull request as ready for review November 12, 2021 13:16
@ianna
Copy link
Collaborator Author

ianna commented Nov 12, 2021

@jpivarski - I think, it's done. Please, check it when you have time. Thanks!

@ianna ianna requested a review from jpivarski November 12, 2021 13:18
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This is much simpler and correct. I only updated the docstring to reflect the fact that there's no longer any attempt to avoid double-counting overlaps.

@jpivarski jpivarski enabled auto-merge (squash) November 12, 2021 15:22
@jpivarski jpivarski merged commit ae99508 into main Nov 12, 2021
@jpivarski jpivarski deleted the ianna/nbytes branch November 12, 2021 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants