-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++ refactoring: nbytes #1138
C++ refactoring: nbytes #1138
Conversation
Codecov Report
|
@jpivarski - perhaps, related to this discussion - here is a recursive version An example of a def NumpyArrayHandler(obj):
yield obj._data
yield obj._data.ctypes.data
yield obj._data.shape a test case: np_data = np.random.random(size=(4, 100 * 1024 * 1024 // 8 // 4))
array = ak.from_numpy(np_data, regulararray=False)
array = v1_to_v2(array.layout) and a result: print(total_size(array,{ak._v2.contents.NumpyArray:NumpyArrayHandler},verbose=True))
48 <class 'awkward._v2.contents.numpyarray.NumpyArray'> <NumpyArray d...
</NumpyArray>
112 <class 'numpy.ndarray'> array([[0.798... 0.46882701]])
32 <class 'int'> 140219093352448
56 <class 'tuple'> (4, 3276800)
28 <class 'int'> 4
276 Numpy's array.nbytes()
104857600 The |
7b1ead1
to
21428f5
Compare
The problem that recipe is solving is that Fortunately, we don't need to solve the problem of figuring out how much memory is held by an ak.Array object graph. The array buffers are much larger than the Python objects, so just as NumPy's As an |
In fact, in that overlaps function, two buffers can overlap without sharing the same pointer address. One might start at pointer address 0x1000000000000000 and have length 1024 while another starts at 0x1000000000000010 and have length 64. Overlaps are intervals in a one-dimensional space. |
@jpivarski - an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The largest
dict was a half-hearted attempt to not double-count overlaps (it didn't handle all cases). The algorithm also doesn't translate from C++ to Python well, since the feature that made it work in Awkward 1.x has been lost: in Awkward 1.x, an array and a view to that array were guaranteed to have the same base pointer, ptr_
.
a = ak.Array([1, 2, 3, 4, 5])
b = a[2:]
assert a.layout.ptr == b.layout.ptr
In the above, a
and b
's NumpyArray nodes would have the same ptr_
and b
would have a non-zero offset_
. That way, they'd have the same key in the largest
dict and only the largest of the two would count. (That's the half-hearted attempt at computing unions of intervals.)
In Awkward 2.x, we rely on NumPy for tracking views of buffers, and NumPy does not keep track of base pointers and offsets separately; it only keeps track of the start of the data of interest (and has a Python reference "base
" for memory management).
>>> a = np.array([1, 2, 3, 4, 5])
>>> b = a[2:]
>>> assert a.ctypes.data + 16 == b.ctypes.data # pointers differ by 2 * sizeof(int64)
Furthermore, id(self.ptr)
is not a pointer to the data, it's a pointer to the Python integer that is equal to the pointer to the data. (And id
is only a pointer in CPython, as a non-guaranteed implementation detail.) So this algorithm isn't going to work at all in Awkward v2.
Fortunately, it doesn't need to: NumPy's nbytes
is simple-minded. It does not represent the true number of bytes held in memory by an array, for several reasons:
- other arrays might be holding the same data, and freeing this one won't free the data
- it takes a very naive approach to data with non-trivial strides.
Consider this:
>>> a = np.arange(2*3*5).reshape(2, 3, 5)
>>> a
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]],
[[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29]]])
>>> a[:, ::2, ::4]
array([[[ 0, 4],
[10, 14]],
[[15, 19],
[25, 29]]])
The slice picks out 8 of the 30 items, and it's a view that shares data with the original array. The original array has 8 * 2*3*5 == 240
bytes. The slice uses memory that goes all the way from the 0
to the 29
, inclusive, so one might think that it should also be represented as having 240 bytes—after all, the buffer it needs to hold uses that much memory. However, nbytes
returns 8 * 2*2*2 == 64
, only counting the elements that you can see.
For an extreme case, consider this:
>>> a, b = np.broadcast_arrays(123, np.zeros((10, 10)))
>>> a
array([[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123],
[123, 123, 123, 123, 123, 123, 123, 123, 123, 123]])
The array a
looks like it has 100 elements, and a.nbytes
returns 8 * 100 == 800
. However, this array is a trick: it's only storing a single 8-byte integer. The strides are zero, so it's showing you the same number over and over:
>>> a.shape
(10, 10)
>>> a.strides
(0, 0)
To prove it, try changing an element:
>>> a[0] = 999
<stdin>:1: DeprecationWarning: Numpy has detected that you (may be) writing to an array with
overlapping memory from np.broadcast_arrays. If this is intentional
set the WRITEABLE flag True or make a copy immediately before writing.
>>> a
array([[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999],
[999, 999, 999, 999, 999, 999, 999, 999, 999, 999]])
Okay, NumPy gives a warning now because this is surprising behavior, but my point stands. There's only one number here, the buffer used to represent this array has only 8 bytes, but a.nbytes
returns 800
.
Conclusion: the nbytes
value doesn't represent how much memory an array takes; it's just a synonym for a.size * a.itemsize
.
Therefore, Awkward Array's nbytes
can be similarly simple. It only needs to walk recursively through the array, adding up
x.data.nbytes
for each Index, NumpyArray, and Identifier x
. It doesn't need a dict because it's not trying to avoid double-counting. All of the care that we'd take to avoid double-counting would be in a new function, not named nbytes
.
@jpivarski - I think, it's done. Please, check it when you have time. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! This is much simpler and correct. I only updated the docstring to reflect the fact that there's no longer any attempt to avoid double-counting overlaps.
No description provided.