Allow vector-based indices #181

ConnectedSystems · 2024-07-26T07:38:29Z

Addresses #180 caused by lack of support for vector-based indexing.

(related issue: JuliaDataCubes/YAXArrays.jl#416)

meggart · 2024-07-26T08:22:03Z

The main reason this is tricky is that it is quite hard to determine what eachchunk should return for arbitrary views into a chunked dimension. Just not throwing the error does not solve the underlying problem and will give wrong results for eachchunk.

The key functionality here is the subsetchunks function. A small example, where we have an array dimension with size 100 with a chunk size of 10. When we create a view into indices 5:25 this returns the correct chunk sizes:

using DiskArrays
chunks = DiskArrays.RegularChunks(10,0,100)
subset = 5:25
DiskArrays.subsetchunks(chunks,subset)

3-element DiskArrays.RegularChunks:
 1:6
 7:16
 17:21

However, when indexing with vectors this gets a bit more tricky, but also works in many cases through subsetchunks_fallback, for example

subset = [1,2,3,15,16,17,18,91,92,93]
DiskArrays.subsetchunks(chunks,subset)

3-element DiskArrays.IrregularChunks:
 1:3
 4:7
 8:10

returns the expected result. The main reason this was not yet exposed to users is that I did not have time to test this, so help here would be appreciated. Also, many cases still fail although they would not have to:

subset = [1,3,2]
DiskArrays.subsetchunks(chunks,subset)

fails although all elements come from the same chunk and one could easily return [1:3] A bit more tricky would be cases where the view actually jumps between chunks

subset = [1,2,15,16,3,4]

Probably this should return [1:2,3:4,5:6] but this might lead to some missed optimizations when reading the data later because the first chunk would be touched twice...

For now it would be great if you want to work on this by either

adding some tests for the current implementation of subsetchunks to clearly show what works and what doesn't (also broken tests help) and/or
adding more functionality to subsetchunks_fallback to make it work reasonably in more cases.

ConnectedSystems · 2024-07-26T08:30:54Z

I guess I was being too optimistic to think it was going to be a simple fix.

Let me see what I can do.

ConnectedSystems · 2024-07-26T08:38:54Z

Also, many cases still fail although they would not have to:
subset = [1,3,2]
DiskArrays.subsetchunks(chunks,subset)

fails although all elements come from the same chunk and one could easily return [1:3] A bit more tricky would be cases where > the view actually jumps between chunks

This is additionally tricky because the user would expect the subset to be ordered as requested...

meggart · 2024-07-26T09:07:55Z

This is additionally tricky because the user would expect the subset to be ordered as requested..

This will be handled correctly by the actual getindex call, there are methods to read the data chunk by chunk and then put it into the correct place. Here we just need to deal with the "Metadata" for DiskArrays so that downstream packages like YAXArrays can make a good guess in which chunks to group the data for large batch processing

ConnectedSystems · 2024-07-26T10:00:18Z

I'm not claiming this is completely fixed, but any thoughts on this approach? (see latest commits)

I think it works the way in the intended manner based on the example given above.
(see in particular the last example which I've added).

Just after a general comment on the approach. I recognize I haven't caught all cases yet.

using DiskArrays

chunks = DiskArrays.RegularChunks(10,0,100)
subset = 5:25
DiskArrays.subsetchunks(chunks,subset)
# 3-element DiskArrays.RegularChunks:
#  1:6
#  7:16
#  17:21

subset = [1,2,3,15,16,17,18,91,92,93]
DiskArrays.subsetchunks(chunks,subset)
# 3-element DiskArrays.IrregularChunks:
#  1:3
#  4:7
#  8:10

subset = [1,3,2]
DiskArrays.subsetchunks(chunks, subset)
# 1-element DiskArrays.RegularChunks:
#  1:3

subset = [1,2,15,16,3,4]
DiskArrays.subsetchunks(chunks, subset)
# 2-element DiskArrays.RegularChunks:
#  1:4
#  5:6


x = [1, 2, 3, 16, 17, 4, 5, 98]
DiskArrays.subsetchunks(chunks, x)
# 3-element DiskArrays.IrregularChunks:
#  1:5
#  6:7
#  8:8

Extending my MWE for YAXArrays.jl:

using Revise, Infiltrator
using NetCDF
using YAXArrays
using DiskArrays


axlist = (
    Dim{:v1}(range(1, 10, length=10)),
    Dim{:v2}([string("x$(i)") for i in 1:100])
)
test_arr = YAXArray(axlist, rand(10,100))
setchunks(test_arr, (10, 10))
savecube(test_arr, "test_cube.nc", driver=:netcdf, overwrite=true)

ds = open_dataset("test_cube.nc")
disk_arr = ds.layer

# These all pass...
@assert all(read(disk_arr[v1=[1, 3]]) .== test_arr[[1, 3], :].data)
@assert all(read(disk_arr[v2=At(["x1", "x2"])]) .== test_arr[:, 1:2].data)
@assert all(read(disk_arr[v2=At(["x1", "x90"])]) .== test_arr[:, [1, 90]].data)
@assert all(read(disk_arr[v2=[1,2,15,16,3,4]]) .== test_arr[:, [1,2,15,16,3,4]].data)
@assert all(read(disk_arr[v2=[1, 2, 3, 16, 17, 4, 5, 98, 97]]) .== test_arr[:, [1, 2, 3, 16, 17, 4, 5, 98, 97]].data)

meggart · 2024-07-26T12:43:39Z

I think these two look wrong to me:

subset = [1,2,15,16,3,4]
DiskArrays.subsetchunks(chunks, subset)
# 2-element DiskArrays.RegularChunks:
#  1:4
#  5:6


x = [1, 2, 3, 16, 17, 4, 5, 98]
DiskArrays.subsetchunks(chunks, x)
# 3-element DiskArrays.IrregularChunks:
#  1:5
#  6:7
#  8:8

I think the more sane behavior would be to return [1:2,3:4,5:6] for the first one and [1:3,4:5,6:7,8:8] for the latter example. So that data from different chunks does not get mixed into a single chunk of the resulting view, so I don't think simply sorting the inputs helps here. As I said, if you could write a few tests, even broken ones this would be great and then I could have another look at the implementation as well.

ConnectedSystems · 2024-07-26T13:04:02Z

So that data from different chunks does not get mixed into a single chunk of the resulting view

The thing is, I'm not too familiar with the internals here.
To me, if the chunk size is 10, then it makes sense to combine the two subsets ([1,2] and [3,4]) to avoid re-visiting the same chunk twice.

As I indicate in the MWE above (albeit with YAXArrays), the selection behaviour is as expected.

I'll try to mock up another example with PseudoDiskArray to investigate more when I have the time (following the example in the readme).
At least that way we're all on the same conceptual page.

ConnectedSystems · 2024-07-26T15:53:53Z

@meggart I'm definitely missing some context here, but as far as I can tell the behaviour is now as I would expected.

The tests that fail that appear to be relevant to this PR are those that check that no unsorted indices can be used and those that check for a specific order of chunks.

Both of these are now allowed/supported with the proposed change.

Incidentally, I did mock up an example using PseudoDiskArray but it's a bit redundant as I've committed tests which do the same thing. But for completeness, I include it below

test_arr = rand(100,100,100)
a = PseudoDiskArray(test_arr; chunksize=(10,10,20))

idx = rand(1:100)
@assert all(a[idx, :, :] .== test_arr[idx, :, :])
@assert all(a[:, idx, :] .== test_arr[:, idx, :])
@assert all(a[:, :, idx] .== test_arr[:, :, idx])

sel = [1,3,4,5,10,11,19,50,51,96,95,94,83,82,81]
@assert all(a[:, sel, :] .== test_arr[:, sel, :])
@assert all(a[sel, :, :] .== test_arr[sel, :, :])
@assert all(a[sel, :, sel] .== test_arr[sel, :, sel])

bit_sel = rand(100) .> 0.5
@assert all(a[bit_sel, :, :] .== test_arr[bit_sel, :, :])
@assert all(a[:, bit_sel, :] .== test_arr[:, bit_sel, :])
@assert all(a[:, bit_sel, bit_sel] .== test_arr[:, bit_sel, bit_sel])

meggart · 2024-08-01T09:32:05Z

Thanks a lot for adding the tests and sorry for the delay, I will have a look this afternoon

ConnectedSystems · 2024-08-01T12:08:31Z

Thanks @meggart

Just wanted to say that the "correct" behaviour I expect may be a coincidental side-effect. If subsetchunks is not behaving appropriately as you say then I'll devote some time looking into it.

meggart · 2024-08-01T14:12:08Z

No need to look into it, I just took the freedom to implement the desired behavior. In this case the issue is not about correctness but about downstream application performance might be affected in strange ways. Now I have no idea why this fails on nightly but not on 1.10

Balinus · 2024-09-13T14:04:21Z

Hello!

Is this gonna be merged soon? Tests are failing for a package that is being developed and I'd prefer to avoid harcoding this branch into the package.

Cheers!

rafaqz · 2024-09-13T14:42:48Z

Apparently @meggart is away and I'm finishing a PhD, so honestly not likely soon

Balinus · 2024-09-13T15:30:19Z

Apparently @meggart is away and I'm finishing a PhD, so honestly not likely soon

Good luck with your PhD! I know how it can be stressful and time-consuming at the end.

No worries, I will look into a workaround for CI. Cheers!

lazarusA · 2024-09-21T20:54:46Z

it looks like last commit was successful. Does this branch now do the right thing? @Balinus ?

ConnectedSystems · 2024-09-23T07:54:06Z

FWIW this is working for me @lazarusA

Balinus · 2024-09-27T12:32:23Z

it looks like last commit was successful. Does this branch now do the right thing? @Balinus ?

Should I test a specific commit/branch? I had tested with this branch (ConnectedSystems:fix-index-issue) and it was working. However, it works well for local code, when testing CI for a package, it is harder to specifically checkout this branch, hence I prefer to wait for the merging. Cheers!

felixcremer · 2024-10-16T14:04:37Z

@meggart Is this blocked by anything or could we merge this?

ConnectedSystems added 5 commits July 27, 2024 00:49

Allow vector-based indices

4b4bec5

Handle discontinous integer indices

dbff721

Allow integer indices, now handled by dispatch

ce39ad9

Allow BitVectors

1744835

Add tentative tests

4da6b79

ConnectedSystems force-pushed the fix-index-issue branch from a201164 to 4da6b79 Compare July 26, 2024 15:38

Define chunks of vector subset through run-length encoding

942445d

Allow AbstractVector views

5ef1432

felixcremer mentioned this pull request Aug 8, 2024

ArgumentError: Unable to determine chunksize of non-range views. #183

Closed

Balinus mentioned this pull request Aug 12, 2024

groupby unable to determine chunksize of non-range views JuliaDataCubes/YAXArrays.jl#424

Open

ConnectedSystems mentioned this pull request Sep 30, 2024

Viz metadata open-AIMS/ADRIA.jl#863

Merged

Zapiano mentioned this pull request Sep 30, 2024

Pin NetCDF.jl and Zarr.jl in order to pin DiskArrays.jl open-AIMS/ADRIA.jl#864

Merged

arlowhite mentioned this pull request Oct 14, 2024

DiskArrays pinned to 0.3.23 open-AIMS/ADRIA.jl#883

Closed

felixcremer mentioned this pull request Oct 16, 2024

How to select all except one variable in a Dataset JuliaDataCubes/YAXArrays.jl#321

Open

meggart merged commit 4160df4 into JuliaIO:main Oct 21, 2024
10 checks passed

ConnectedSystems deleted the fix-index-issue branch October 21, 2024 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow vector-based indices #181

Allow vector-based indices #181

ConnectedSystems commented Jul 26, 2024

meggart commented Jul 26, 2024

ConnectedSystems commented Jul 26, 2024 •

edited

Loading

ConnectedSystems commented Jul 26, 2024

meggart commented Jul 26, 2024

ConnectedSystems commented Jul 26, 2024 •

edited

Loading

meggart commented Jul 26, 2024

ConnectedSystems commented Jul 26, 2024 •

edited

Loading

ConnectedSystems commented Jul 26, 2024

meggart commented Aug 1, 2024

ConnectedSystems commented Aug 1, 2024

meggart commented Aug 1, 2024

Balinus commented Sep 13, 2024

rafaqz commented Sep 13, 2024

Balinus commented Sep 13, 2024

lazarusA commented Sep 21, 2024

ConnectedSystems commented Sep 23, 2024

Balinus commented Sep 27, 2024

felixcremer commented Oct 16, 2024

Allow vector-based indices #181

Allow vector-based indices #181

Conversation

ConnectedSystems commented Jul 26, 2024

meggart commented Jul 26, 2024

ConnectedSystems commented Jul 26, 2024 • edited Loading

ConnectedSystems commented Jul 26, 2024

meggart commented Jul 26, 2024

ConnectedSystems commented Jul 26, 2024 • edited Loading

meggart commented Jul 26, 2024

ConnectedSystems commented Jul 26, 2024 • edited Loading

ConnectedSystems commented Jul 26, 2024

meggart commented Aug 1, 2024

ConnectedSystems commented Aug 1, 2024

meggart commented Aug 1, 2024

Balinus commented Sep 13, 2024

rafaqz commented Sep 13, 2024

Balinus commented Sep 13, 2024

lazarusA commented Sep 21, 2024

ConnectedSystems commented Sep 23, 2024

Balinus commented Sep 27, 2024

felixcremer commented Oct 16, 2024

ConnectedSystems commented Jul 26, 2024 •

edited

Loading

ConnectedSystems commented Jul 26, 2024 •

edited

Loading

ConnectedSystems commented Jul 26, 2024 •

edited

Loading