Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexing inconsistencies #758

Merged
merged 52 commits into from
Jun 2, 2021
Merged

Fix indexing inconsistencies #758

merged 52 commits into from
Jun 2, 2021

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Apr 15, 2021

Description

Fixing a few inconsistencies in the slicing of DNDarrays.

  1. lshapes after slicing, esp. with no data on process.
  2. number of dimensions after slicing, esp. if the slice only contains 1 element, or if the slice is along the split axis. We cannot be 100% consistent with numpy/torch here, as we need to keep the split dimension no matter what. Slicing one single element along the split axis results in the loss of the split dimension, i.e. in a DNDarray with split=None. The rank containing the element broadcasts it to the others.

Issue/s resolved: #656 #754 #770

Changes proposed:

  • local shapes are now always consistent with the global shape of the sliced DNDarray, incl. when local tensors are empty
  • always keep "sliced" dimension when key is slice()
  • always keep "sliced" dimension if it correspond to the split axis (N.B. different from numpy/torch)
  • advanced indexing now works with lists, dndarrays, torch tensors and ndarrays
  • distributed advanced indexing: valid (i.e. present on rank) local indices of key[split] must be applied to all other dimensions as well
  • tests updated

Also:

Type of change

  • Bug fix and breaking change: getting 1 item along the split dimension now results in the loss of the split axis, hence in a non-distributed DNDarray (the 1 item is "Bcast"ed to the other processes).

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #758 (cdf3e95) into master (dcebadc) will increase coverage by 1.69%.
The diff coverage is 95.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #758      +/-   ##
==========================================
+ Coverage   89.14%   90.84%   +1.69%     
==========================================
  Files          64       64              
  Lines        9510     8921     -589     
==========================================
- Hits         8478     8104     -374     
+ Misses       1032      817     -215     
Flag Coverage Δ
gpu ?
unit 90.84% <95.49%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
heat/core/_operations.py 95.74% <ø> (-0.05%) ⬇️
heat/core/linalg/solver.py 69.00% <ø> (+3.28%) ⬆️
heat/naive_bayes/gaussianNB.py 93.58% <91.66%> (+21.88%) ⬆️
heat/core/dndarray.py 96.50% <95.74%> (+28.20%) ⬆️
heat/core/linalg/basics.py 92.76% <100.00%> (+27.20%) ⬆️
heat/optim/dp_optimizer.py 24.19% <0.00%> (-71.89%) ⬇️
heat/core/devices.py 86.66% <0.00%> (-11.12%) ⬇️
heat/nn/data_parallel.py 84.13% <0.00%> (-10.35%) ⬇️
heat/core/communication.py 89.72% <0.00%> (-6.98%) ⬇️
heat/core/tests/test_suites/basic_test.py 91.26% <0.00%> (-4.86%) ⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcebadc...cdf3e95. Read the comment docs.

@ClaudiaComito ClaudiaComito linked an issue Apr 15, 2021 that may be closed by this pull request
@ClaudiaComito ClaudiaComito changed the title Fix indexing inconsistencies DRAFT, STILL EDITING Fix indexing inconsistencies Apr 15, 2021
@ClaudiaComito ClaudiaComito marked this pull request as ready for review April 15, 2021 09:03
@ClaudiaComito
Copy link
Contributor Author

@ben-bou I can't ask you to review but will be happy to hear feedback

@ClaudiaComito ClaudiaComito added the bug Something isn't working label Apr 15, 2021
@Inzlinger
Copy link
Collaborator

Looks like the problems i had in #656 are fixed with this, thanks.
Do you have time to look into #703 as well? Else i might take a shot at it, or try to find a workaround as i also need it for #760.

heat/core/dndarray.py Outdated Show resolved Hide resolved
heat/core/dndarray.py Show resolved Hide resolved
heat/core/dndarray.py Outdated Show resolved Hide resolved
heat/core/dndarray.py Outdated Show resolved Hide resolved
@ben-bou
Copy link
Collaborator

ben-bou commented May 13, 2021

I got this after indexing a 2D dndarray with (0,0), so receiving a scalar (0d).

File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 887, in __getitem__
    arr = self.comm.bcast(arr, root=active_rank)
  File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
  File "mpi4py/MPI/msgpickle.pxi", line 630, in mpi4py.MPI.PyMPI_bcast
  File "mpi4py/MPI/msgpickle.pxi", line 631, in mpi4py.MPI.PyMPI_bcast
mpi4py.MPI.Exception: Message truncated, error stack:
PMPI_Bcast(448)...............: MPI_Bcast(buf=0x7ffc2d95e9b0, count=1, MPI_INT, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Bcast(434)...............: 
MPID_Bcast(76)................: 
MPIR_Bcast_impl(310)..........: 
MPIR_Bcast_intra_auto(223)....: 
MPIR_Bcast_intra_binomial(112): 
(unknown)(): Message truncated

@ClaudiaComito
Copy link
Contributor Author

I got this after indexing a 2D dndarray with (0,0), so receiving a scalar (0d).

File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 887, in __getitem__
    arr = self.comm.bcast(arr, root=active_rank)
  File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
  File "mpi4py/MPI/msgpickle.pxi", line 630, in mpi4py.MPI.PyMPI_bcast
  File "mpi4py/MPI/msgpickle.pxi", line 631, in mpi4py.MPI.PyMPI_bcast
mpi4py.MPI.Exception: Message truncated, error stack:
PMPI_Bcast(448)...............: MPI_Bcast(buf=0x7ffc2d95e9b0, count=1, MPI_INT, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Bcast(434)...............: 
MPID_Bcast(76)................: 
MPIR_Bcast_impl(310)..........: 
MPIR_Bcast_intra_auto(223)....: 
MPIR_Bcast_intra_binomial(112): 
(unknown)(): Message truncated

@ben-bou I can't reproduce this locally. Can you post your code? Thanks a lot!

@ClaudiaComito ClaudiaComito marked this pull request as ready for review May 17, 2021 09:30
@ClaudiaComito ClaudiaComito linked an issue May 17, 2021 that may be closed by this pull request
@ClaudiaComito ClaudiaComito changed the title DRAFT Fix indexing inconsistencies Fix indexing inconsistencies May 17, 2021
@ben-bou
Copy link
Collaborator

ben-bou commented May 17, 2021

@ben-bou I can't reproduce this locally. Can you post your code? Thanks a lot!

@ClaudiaComito So it's actually not reproducible but rather the same problem you encountered in the tests: in a loop, the first 4 bcasts fail, after that it works. How did you figure out where those 'wrong' broadcasts were happening?

@ClaudiaComito
Copy link
Contributor Author

@ben-bou I can't reproduce this locally. Can you post your code? Thanks a lot!

@ClaudiaComito So it's actually not reproducible but rather the same problem you encountered in the tests: in a loop, the first 4 bcasts fail, after that it works. How did you figure out where those 'wrong' broadcasts were happening?

Could it be that you have a loop calling something like a[i] where i depends on rank?

@ben-bou
Copy link
Collaborator

ben-bou commented May 17, 2021

Could it be that you have a loop calling something like a[i] where i depends on rank?

@ClaudiaComito You changed the broadcasting in __getitem__ from Bcast to bcast. Is there a reason for that?

I tried changing it back to Bcast, but got this error:

  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 892, in __getitem__
    self.comm.Bcast(arr, active_rank)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 704, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 689, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 326, in as_buffer
    mpi_type, elements = cls.mpi_type_and_elements_of(obj, counts, displs)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 281, in mpi_type_and_elements_of
    strides[0] = obj.stride()[-1]

Which I believe shouldn't happen. The mpi_type_and_elements_of should be able to handle all kinds of Tensors.

As a side note, bcast, as is currently used in this branch, pickles the entire object. I'm unsure if this is efficient, especially given the strides. Wouldn't this bcast the entire memory footprint of the Tensor to all processes and then generate a view to only a single slice only afterwards?

EDIT2: the error had nothing to do with bcast or __getitem__; the error messages just made it seem that way. Thanks anyway. The questions about the strides in mpi_type_and_elements_of and the efficiency of bcast still stand but are probably out of the context of this PR

Copy link
Member

@coquelin77 coquelin77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. only minor structural changes needed as far as i can tell. the actual code proposed looks good to me

.flake8 Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
heat/core/dndarray.py Outdated Show resolved Hide resolved
heat/core/tests/test_dndarray.py Show resolved Hide resolved
@ClaudiaComito ClaudiaComito changed the base branch from release/1.0.x to master June 1, 2021 09:31
@ClaudiaComito
Copy link
Contributor Author

ClaudiaComito commented Jun 2, 2021

Could it be that you have a loop calling something like a[i] where i depends on rank?

@ClaudiaComito You changed the broadcasting in __getitem__ from Bcast to bcast. Is there a reason for that?

I tried changing it back to Bcast, but got this error:

  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 892, in __getitem__
    self.comm.Bcast(arr, active_rank)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 704, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 689, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 326, in as_buffer
    mpi_type, elements = cls.mpi_type_and_elements_of(obj, counts, displs)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 281, in mpi_type_and_elements_of
    strides[0] = obj.stride()[-1]

Which I believe shouldn't happen. The mpi_type_and_elements_of should be able to handle all kinds of Tensors.

As a side note, bcast, as is currently used in this branch, pickles the entire object. I'm unsure if this is efficient, especially given the strides. Wouldn't this bcast the entire memory footprint of the Tensor to all processes and then generate a view to only a single slice only afterwards?

EDIT2: the error had nothing to do with bcast or __getitem__; the error messages just made it seem that way. Thanks anyway. The questions about the strides in mpi_type_and_elements_of and the efficiency of bcast still stand but are probably out of the context of this PR

@ben-bou indeed I switched to bcast because Bcast was giving me problems, thanks for reminding me I needed to look into this. We seem to have a bug when communicating slices of an object (#784 ), so for the time being I'm going to keep bcast and will replace it with Bcast later.

@coquelin77
Copy link
Member

rerun tests

@coquelin77 coquelin77 merged commit 3dc9aca into master Jun 2, 2021
@coquelin77 coquelin77 deleted the bug/754-getitem-indexing branch June 2, 2021 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants