Fix indexing inconsistencies #758

ClaudiaComito · 2021-04-15T04:44:22Z

Description

Fixing a few inconsistencies in the slicing of DNDarrays.

lshapes after slicing, esp. with no data on process.
number of dimensions after slicing, esp. if the slice only contains 1 element, or if the slice is along the split axis. We cannot be 100% consistent with numpy/torch here, as we need to keep the split dimension no matter what. Slicing one single element along the split axis results in the loss of the split dimension, i.e. in a DNDarray with split=None. The rank containing the element broadcasts it to the others.

Issue/s resolved: #656 #754 #770

Changes proposed:

local shapes are now always consistent with the global shape of the sliced DNDarray, incl. when local tensors are empty
always keep "sliced" dimension when key is slice()
~~always keep "sliced" dimension if it correspond to the split axis (N.B. different from numpy/torch)~~
advanced indexing now works with lists, dndarrays, torch tensors and ndarrays
distributed advanced indexing: valid (i.e. present on rank) local indices of key[split] must be applied to all other dimensions as well
tests updated

Also:

implemented DNDarray.counts_displs(), returns actual counts and displacements (items and offsets) without assuming that the DNDarray is balanced, leading to...
getitem no longer assumes that the DNDarray is balanced, partially solving __getitem__ and __setitem__ assume DNDarray is balanced #668

Type of change

Bug fix and breaking change: getting 1 item along the split dimension now results in the loss of the split axis, hence in a non-distributed DNDarray (the 1 item is "Bcast"ed to the other processes).

Due Diligence

All split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

no

codecov · 2021-04-15T04:47:48Z

Codecov Report

Merging #758 (cdf3e95) into master (dcebadc) will increase coverage by 1.69%.
The diff coverage is 95.49%.

@@            Coverage Diff             @@
##           master     #758      +/-   ##
==========================================
+ Coverage   89.14%   90.84%   +1.69%     
==========================================
  Files          64       64              
  Lines        9510     8921     -589     
==========================================
- Hits         8478     8104     -374     
+ Misses       1032      817     -215

Flag	Coverage Δ
gpu	`?`
unit	`90.84% <95.49%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
heat/core/_operations.py	`95.74% <ø> (-0.05%)`	⬇️
heat/core/linalg/solver.py	`69.00% <ø> (+3.28%)`	⬆️
heat/naive_bayes/gaussianNB.py	`93.58% <91.66%> (+21.88%)`	⬆️
heat/core/dndarray.py	`96.50% <95.74%> (+28.20%)`	⬆️
heat/core/linalg/basics.py	`92.76% <100.00%> (+27.20%)`	⬆️
heat/optim/dp_optimizer.py	`24.19% <0.00%> (-71.89%)`	⬇️
heat/core/devices.py	`86.66% <0.00%> (-11.12%)`	⬇️
heat/nn/data_parallel.py	`84.13% <0.00%> (-10.35%)`	⬇️
heat/core/communication.py	`89.72% <0.00%> (-6.98%)`	⬇️
heat/core/tests/test_suites/basic_test.py	`91.26% <0.00%> (-4.86%)`	⬇️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcebadc...cdf3e95. Read the comment docs.

ClaudiaComito · 2021-04-15T09:04:54Z

@ben-bou I can't ask you to review but will be happy to hear feedback

Inzlinger · 2021-04-17T12:03:16Z

Looks like the problems i had in #656 are fixed with this, thanks.
Do you have time to look into #703 as well? Else i might take a shot at it, or try to find a workaround as i also need it for #760.

heat/core/dndarray.py

…ement DNDarray.counts_displs()

…balanced

heat/core/dndarray.py

… INDEXING NOT CORRECTED YET" This reverts commit 06f56ae.

…tics/heat into bug/754-getitem-indexing

ben-bou · 2021-05-13T14:20:24Z

I got this after indexing a 2D dndarray with (0,0), so receiving a scalar (0d).

File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 887, in __getitem__
    arr = self.comm.bcast(arr, root=active_rank)
  File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
  File "mpi4py/MPI/msgpickle.pxi", line 630, in mpi4py.MPI.PyMPI_bcast
  File "mpi4py/MPI/msgpickle.pxi", line 631, in mpi4py.MPI.PyMPI_bcast
mpi4py.MPI.Exception: Message truncated, error stack:
PMPI_Bcast(448)...............: MPI_Bcast(buf=0x7ffc2d95e9b0, count=1, MPI_INT, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Bcast(434)...............: 
MPID_Bcast(76)................: 
MPIR_Bcast_impl(310)..........: 
MPIR_Bcast_intra_auto(223)....: 
MPIR_Bcast_intra_binomial(112): 
(unknown)(): Message truncated

ClaudiaComito · 2021-05-17T04:40:55Z

I got this after indexing a 2D dndarray with (0,0), so receiving a scalar (0d).

File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 887, in __getitem__
    arr = self.comm.bcast(arr, root=active_rank)
  File "mpi4py/MPI/Comm.pyx", line 1257, in mpi4py.MPI.Comm.bcast
  File "mpi4py/MPI/msgpickle.pxi", line 630, in mpi4py.MPI.PyMPI_bcast
  File "mpi4py/MPI/msgpickle.pxi", line 631, in mpi4py.MPI.PyMPI_bcast
mpi4py.MPI.Exception: Message truncated, error stack:
PMPI_Bcast(448)...............: MPI_Bcast(buf=0x7ffc2d95e9b0, count=1, MPI_INT, root=0, comm=MPI_COMM_WORLD) failed
PMPI_Bcast(434)...............: 
MPID_Bcast(76)................: 
MPIR_Bcast_impl(310)..........: 
MPIR_Bcast_intra_auto(223)....: 
MPIR_Bcast_intra_binomial(112): 
(unknown)(): Message truncated

@ben-bou I can't reproduce this locally. Can you post your code? Thanks a lot!

ben-bou · 2021-05-17T11:01:51Z

@ben-bou I can't reproduce this locally. Can you post your code? Thanks a lot!

@ClaudiaComito So it's actually not reproducible but rather the same problem you encountered in the tests: in a loop, the first 4 bcasts fail, after that it works. How did you figure out where those 'wrong' broadcasts were happening?

ClaudiaComito · 2021-05-17T11:07:38Z

@ben-bou I can't reproduce this locally. Can you post your code? Thanks a lot!

@ClaudiaComito So it's actually not reproducible but rather the same problem you encountered in the tests: in a loop, the first 4 bcasts fail, after that it works. How did you figure out where those 'wrong' broadcasts were happening?

Could it be that you have a loop calling something like a[i] where i depends on rank?

ben-bou · 2021-05-17T17:57:06Z

Could it be that you have a loop calling something like a[i] where i depends on rank?

@ClaudiaComito You changed the broadcasting in __getitem__ from Bcast to bcast. Is there a reason for that?

I tried changing it back to Bcast, but got this error:

  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 892, in __getitem__
    self.comm.Bcast(arr, active_rank)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 704, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 689, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 326, in as_buffer
    mpi_type, elements = cls.mpi_type_and_elements_of(obj, counts, displs)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 281, in mpi_type_and_elements_of
    strides[0] = obj.stride()[-1]

Which I believe shouldn't happen. The mpi_type_and_elements_of should be able to handle all kinds of Tensors.

As a side note, bcast, as is currently used in this branch, pickles the entire object. I'm unsure if this is efficient, especially given the strides. Wouldn't this bcast the entire memory footprint of the Tensor to all processes and then generate a view to only a single slice only afterwards?

EDIT2: the error had nothing to do with bcast or __getitem__; the error messages just made it seem that way. Thanks anyway. The questions about the strides in mpi_type_and_elements_of and the efficiency of bcast still stand but are probably out of the context of this PR

coquelin77

looks good to me. only minor structural changes needed as far as i can tell. the actual code proposed looks good to me

.flake8

CHANGELOG.md

heat/core/dndarray.py

heat/core/tests/test_dndarray.py

ClaudiaComito · 2021-06-02T09:14:50Z

Could it be that you have a loop calling something like a[i] where i depends on rank?

@ClaudiaComito You changed the broadcasting in __getitem__ from Bcast to bcast. Is there a reason for that?

I tried changing it back to Bcast, but got this error:
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/dndarray.py", line 892, in __getitem__
    self.comm.Bcast(arr, active_rank)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 704, in Bcast
    ret, sbuf, rbuf, buf = self.__broadcast_like(self.handle.Bcast, buf, root)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 689, in __broadcast_like
    return func(self.as_buffer(srbuf), root), srbuf, srbuf, buf
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 326, in as_buffer
    mpi_type, elements = cls.mpi_type_and_elements_of(obj, counts, displs)
  File "/p/project/cslts/local/juwels/HeAT/experimental_HeAT/lib/python3.8/site-packages/heat/core/communication.py", line 281, in mpi_type_and_elements_of
    strides[0] = obj.stride()[-1]
Which I believe shouldn't happen. The mpi_type_and_elements_of should be able to handle all kinds of Tensors.

As a side note, bcast, as is currently used in this branch, pickles the entire object. I'm unsure if this is efficient, especially given the strides. Wouldn't this bcast the entire memory footprint of the Tensor to all processes and then generate a view to only a single slice only afterwards?

EDIT2: the error had nothing to do with bcast or __getitem__; the error messages just made it seem that way. Thanks anyway. The questions about the strides in mpi_type_and_elements_of and the efficiency of bcast still stand but are probably out of the context of this PR

@ben-bou indeed I switched to bcast because Bcast was giving me problems, thanks for reminding me I needed to look into this. We seem to have a bug when communicating slices of an object (#784 ), so for the time being I'm going to keep bcast and will replace it with Bcast later.

coquelin77 · 2021-06-02T11:54:31Z

rerun tests

…tz-analytics/heat into bug/754-getitem-indexing

ClaudiaComito added 9 commits April 13, 2021 13:44

Fix getitem lshapes, split axes

e898d87

Fix tests

2302795

getitem lshapes, gshapes, local indices in non-split dimensions

58bbb6a

Remove print statements

54c8157

Adjust tests, slicing keeps split dimension

641b2a2

Always keep sliced dimension if key is slice()

f238b95

Remove dead code, rename advanced indexing bool

f3db92f

Rework non-distributed case to return uniform results with distributed

d6eeb48

Merge branch 'bug/sort-balance' into bug/754-getitem-indexing

f8efbcd

ClaudiaComito linked an issue Apr 15, 2021 that may be closed by this pull request

Wrong shape after indexing a single element with a slice #754

Closed

ClaudiaComito changed the title ~~Fix indexing inconsistencies DRAFT, STILL EDITING~~ Fix indexing inconsistencies Apr 15, 2021

ClaudiaComito marked this pull request as ready for review April 15, 2021 09:03

ClaudiaComito requested review from coquelin77, Markus-Goetz and Inzlinger April 15, 2021 09:04

ClaudiaComito added the bug Something isn't working label Apr 15, 2021

Update changelog

fccf633

ben-bou reviewed Apr 19, 2021

View reviewed changes

heat/core/dndarray.py Show resolved Hide resolved

added ellipsis to get/set item functions

e557a16

coquelin77 reviewed Apr 20, 2021

View reviewed changes

heat/core/dndarray.py Outdated Show resolved Hide resolved

heat/core/dndarray.py Show resolved Hide resolved

ClaudiaComito and others added 4 commits April 28, 2021 12:51

Lose split dimension if indexing 1 element and broadcast result. Impl…

6341c44

…ement DNDarray.counts_displs()

Update tests

48e81a8

create_lshape_map() to require communication only if DNDarray is not …

f9a5820

…balanced

Merge branch 'master' into bug/754-getitem-indexing

1886c1b

coquelin77 requested changes Apr 29, 2021

View reviewed changes

heat/core/dndarray.py Outdated Show resolved Hide resolved

heat/core/dndarray.py Outdated Show resolved Hide resolved

ClaudiaComito and others added 3 commits May 12, 2021 05:14

Adjust lshape for distributed advanced indexing.

bb8515c

Revert "updated getitem to distributed constants for slices, ADVANCED…

5823db6

… INDEXING NOT CORRECTED YET" This reverts commit 06f56ae.

Merge branch 'bug/754-getitem-indexing' of github.com:helmholtz-analy…

2bf728c

…tics/heat into bug/754-getitem-indexing

ClaudiaComito mentioned this pull request May 12, 2021

GaussianNB distributed results deviate from the scikit-learn results #771

Open

Restore create_lshape_map, GaussianNB changes incl. tests

1da7401

ClaudiaComito marked this pull request as ready for review May 17, 2021 09:30

ClaudiaComito linked an issue May 17, 2021 that may be closed by this pull request

0-dimensional tensor not composable #770

Closed

ClaudiaComito changed the title ~~DRAFT Fix indexing inconsistencies~~ Fix indexing inconsistencies May 17, 2021

ClaudiaComito added 3 commits May 17, 2021 11:51

Update counts_displs() documentation

0c01850

counts_displs() documentation

4249af0

Implement test_counts_displs()

1d658dc

coquelin77 requested changes May 25, 2021

View reviewed changes

.flake8 Outdated Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

heat/core/dndarray.py Outdated Show resolved Hide resolved

heat/core/tests/test_dndarray.py Show resolved Hide resolved

Update CHANGELOG, remove dead code

b9048ed

ClaudiaComito changed the base branch from release/1.0.x to master June 1, 2021 09:31

Merge branch 'master' into bug/754-getitem-indexing

63df36e

ClaudiaComito added 2 commits June 2, 2021 12:44

Add reference to slice broadcasting issue (#784)

bdd3ee2

Remove clone() experiment left in by mistake

3cfc916

coquelin77 added 2 commits June 2, 2021 15:24

added example to changelog breaking changes

030d99f

Merge branch 'bug/754-getitem-indexing' of https://github.com/helmhol…

cdf3e95

…tz-analytics/heat into bug/754-getitem-indexing

coquelin77 approved these changes Jun 2, 2021

View reviewed changes

coquelin77 merged commit 3dc9aca into master Jun 2, 2021

coquelin77 deleted the bug/754-getitem-indexing branch June 2, 2021 13:30

ClaudiaComito mentioned this pull request Sep 20, 2021

__getitem__ and __setitem__ assume DNDarray is balanced #668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indexing inconsistencies #758

Fix indexing inconsistencies #758

ClaudiaComito commented Apr 15, 2021 •

edited

Loading

codecov bot commented Apr 15, 2021 •

edited

Loading

ClaudiaComito commented Apr 15, 2021

Inzlinger commented Apr 17, 2021

ben-bou commented May 13, 2021

ClaudiaComito commented May 17, 2021

ben-bou commented May 17, 2021

ClaudiaComito commented May 17, 2021

ben-bou commented May 17, 2021 •

edited

Loading

coquelin77 left a comment

ClaudiaComito commented Jun 2, 2021 •

edited

Loading

coquelin77 commented Jun 2, 2021

Fix indexing inconsistencies #758

Fix indexing inconsistencies #758

Conversation

ClaudiaComito commented Apr 15, 2021 • edited Loading

Description

Changes proposed:

Also:

Type of change

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

codecov bot commented Apr 15, 2021 • edited Loading

Codecov Report

ClaudiaComito commented Apr 15, 2021

Inzlinger commented Apr 17, 2021

ben-bou commented May 13, 2021

ClaudiaComito commented May 17, 2021

ben-bou commented May 17, 2021

ClaudiaComito commented May 17, 2021

ben-bou commented May 17, 2021 • edited Loading

coquelin77 left a comment

Choose a reason for hiding this comment

ClaudiaComito commented Jun 2, 2021 • edited Loading

coquelin77 commented Jun 2, 2021

ClaudiaComito commented Apr 15, 2021 •

edited

Loading

codecov bot commented Apr 15, 2021 •

edited

Loading

ben-bou commented May 17, 2021 •

edited

Loading

ClaudiaComito commented Jun 2, 2021 •

edited

Loading