Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow vector-based indices #181

Merged
merged 7 commits into from
Oct 21, 2024
Merged

Conversation

ConnectedSystems
Copy link
Contributor

Addresses #180 caused by lack of support for vector-based indexing.

(related issue: JuliaDataCubes/YAXArrays.jl#416)

@meggart
Copy link
Collaborator

meggart commented Jul 26, 2024

The main reason this is tricky is that it is quite hard to determine what eachchunk should return for arbitrary views into a chunked dimension. Just not throwing the error does not solve the underlying problem and will give wrong results for eachchunk.

The key functionality here is the subsetchunks function. A small example, where we have an array dimension with size 100 with a chunk size of 10. When we create a view into indices 5:25 this returns the correct chunk sizes:

using DiskArrays
chunks = DiskArrays.RegularChunks(10,0,100)
subset = 5:25
DiskArrays.subsetchunks(chunks,subset)
3-element DiskArrays.RegularChunks:
 1:6
 7:16
 17:21

However, when indexing with vectors this gets a bit more tricky, but also works in many cases through subsetchunks_fallback, for example

subset = [1,2,3,15,16,17,18,91,92,93]
DiskArrays.subsetchunks(chunks,subset)
3-element DiskArrays.IrregularChunks:
 1:3
 4:7
 8:10

returns the expected result. The main reason this was not yet exposed to users is that I did not have time to test this, so help here would be appreciated. Also, many cases still fail although they would not have to:

subset = [1,3,2]
DiskArrays.subsetchunks(chunks,subset)

fails although all elements come from the same chunk and one could easily return [1:3] A bit more tricky would be cases where the view actually jumps between chunks

subset = [1,2,15,16,3,4]

Probably this should return [1:2,3:4,5:6] but this might lead to some missed optimizations when reading the data later because the first chunk would be touched twice...

For now it would be great if you want to work on this by either

  • adding some tests for the current implementation of subsetchunks to clearly show what works and what doesn't (also broken tests help) and/or
  • adding more functionality to subsetchunks_fallback to make it work reasonably in more cases.

@ConnectedSystems
Copy link
Contributor Author

ConnectedSystems commented Jul 26, 2024

I guess I was being too optimistic to think it was going to be a simple fix.

Let me see what I can do.

@ConnectedSystems
Copy link
Contributor Author

Also, many cases still fail although they would not have to:
subset = [1,3,2]
DiskArrays.subsetchunks(chunks,subset)

fails although all elements come from the same chunk and one could easily return [1:3] A bit more tricky would be cases where > the view actually jumps between chunks

This is additionally tricky because the user would expect the subset to be ordered as requested...

@meggart
Copy link
Collaborator

meggart commented Jul 26, 2024

This is additionally tricky because the user would expect the subset to be ordered as requested..

This will be handled correctly by the actual getindex call, there are methods to read the data chunk by chunk and then put it into the correct place. Here we just need to deal with the "Metadata" for DiskArrays so that downstream packages like YAXArrays can make a good guess in which chunks to group the data for large batch processing

@ConnectedSystems
Copy link
Contributor Author

ConnectedSystems commented Jul 26, 2024

I'm not claiming this is completely fixed, but any thoughts on this approach? (see latest commits)

I think it works the way in the intended manner based on the example given above.
(see in particular the last example which I've added).

Just after a general comment on the approach. I recognize I haven't caught all cases yet.

using DiskArrays

chunks = DiskArrays.RegularChunks(10,0,100)
subset = 5:25
DiskArrays.subsetchunks(chunks,subset)
# 3-element DiskArrays.RegularChunks:
#  1:6
#  7:16
#  17:21

subset = [1,2,3,15,16,17,18,91,92,93]
DiskArrays.subsetchunks(chunks,subset)
# 3-element DiskArrays.IrregularChunks:
#  1:3
#  4:7
#  8:10

subset = [1,3,2]
DiskArrays.subsetchunks(chunks, subset)
# 1-element DiskArrays.RegularChunks:
#  1:3

subset = [1,2,15,16,3,4]
DiskArrays.subsetchunks(chunks, subset)
# 2-element DiskArrays.RegularChunks:
#  1:4
#  5:6


x = [1, 2, 3, 16, 17, 4, 5, 98]
DiskArrays.subsetchunks(chunks, x)
# 3-element DiskArrays.IrregularChunks:
#  1:5
#  6:7
#  8:8

Extending my MWE for YAXArrays.jl:

using Revise, Infiltrator
using NetCDF
using YAXArrays
using DiskArrays


axlist = (
    Dim{:v1}(range(1, 10, length=10)),
    Dim{:v2}([string("x$(i)") for i in 1:100])
)
test_arr = YAXArray(axlist, rand(10,100))
setchunks(test_arr, (10, 10))
savecube(test_arr, "test_cube.nc", driver=:netcdf, overwrite=true)

ds = open_dataset("test_cube.nc")
disk_arr = ds.layer

# These all pass...
@assert all(read(disk_arr[v1=[1, 3]]) .== test_arr[[1, 3], :].data)
@assert all(read(disk_arr[v2=At(["x1", "x2"])]) .== test_arr[:, 1:2].data)
@assert all(read(disk_arr[v2=At(["x1", "x90"])]) .== test_arr[:, [1, 90]].data)
@assert all(read(disk_arr[v2=[1,2,15,16,3,4]]) .== test_arr[:, [1,2,15,16,3,4]].data)
@assert all(read(disk_arr[v2=[1, 2, 3, 16, 17, 4, 5, 98, 97]]) .== test_arr[:, [1, 2, 3, 16, 17, 4, 5, 98, 97]].data)

@meggart
Copy link
Collaborator

meggart commented Jul 26, 2024

I think these two look wrong to me:

subset = [1,2,15,16,3,4]
DiskArrays.subsetchunks(chunks, subset)
# 2-element DiskArrays.RegularChunks:
#  1:4
#  5:6


x = [1, 2, 3, 16, 17, 4, 5, 98]
DiskArrays.subsetchunks(chunks, x)
# 3-element DiskArrays.IrregularChunks:
#  1:5
#  6:7
#  8:8

I think the more sane behavior would be to return [1:2,3:4,5:6] for the first one and [1:3,4:5,6:7,8:8] for the latter example. So that data from different chunks does not get mixed into a single chunk of the resulting view, so I don't think simply sorting the inputs helps here. As I said, if you could write a few tests, even broken ones this would be great and then I could have another look at the implementation as well.

@ConnectedSystems
Copy link
Contributor Author

ConnectedSystems commented Jul 26, 2024

So that data from different chunks does not get mixed into a single chunk of the resulting view

The thing is, I'm not too familiar with the internals here.
To me, if the chunk size is 10, then it makes sense to combine the two subsets ([1,2] and [3,4]) to avoid re-visiting the same chunk twice.

As I indicate in the MWE above (albeit with YAXArrays), the selection behaviour is as expected.

I'll try to mock up another example with PseudoDiskArray to investigate more when I have the time (following the example in the readme).
At least that way we're all on the same conceptual page.

@ConnectedSystems
Copy link
Contributor Author

@meggart I'm definitely missing some context here, but as far as I can tell the behaviour is now as I would expected.

The tests that fail that appear to be relevant to this PR are those that check that no unsorted indices can be used and those that check for a specific order of chunks.

Both of these are now allowed/supported with the proposed change.

Incidentally, I did mock up an example using PseudoDiskArray but it's a bit redundant as I've committed tests which do the same thing. But for completeness, I include it below

test_arr = rand(100,100,100)
a = PseudoDiskArray(test_arr; chunksize=(10,10,20))

idx = rand(1:100)
@assert all(a[idx, :, :] .== test_arr[idx, :, :])
@assert all(a[:, idx, :] .== test_arr[:, idx, :])
@assert all(a[:, :, idx] .== test_arr[:, :, idx])

sel = [1,3,4,5,10,11,19,50,51,96,95,94,83,82,81]
@assert all(a[:, sel, :] .== test_arr[:, sel, :])
@assert all(a[sel, :, :] .== test_arr[sel, :, :])
@assert all(a[sel, :, sel] .== test_arr[sel, :, sel])

bit_sel = rand(100) .> 0.5
@assert all(a[bit_sel, :, :] .== test_arr[bit_sel, :, :])
@assert all(a[:, bit_sel, :] .== test_arr[:, bit_sel, :])
@assert all(a[:, bit_sel, bit_sel] .== test_arr[:, bit_sel, bit_sel])

@meggart
Copy link
Collaborator

meggart commented Aug 1, 2024

Thanks a lot for adding the tests and sorry for the delay, I will have a look this afternoon

@ConnectedSystems
Copy link
Contributor Author

Thanks @meggart

Just wanted to say that the "correct" behaviour I expect may be a coincidental side-effect. If subsetchunks is not behaving appropriately as you say then I'll devote some time looking into it.

@meggart
Copy link
Collaborator

meggart commented Aug 1, 2024

No need to look into it, I just took the freedom to implement the desired behavior. In this case the issue is not about correctness but about downstream application performance might be affected in strange ways. Now I have no idea why this fails on nightly but not on 1.10

@Balinus
Copy link

Balinus commented Sep 13, 2024

Hello!

Is this gonna be merged soon? Tests are failing for a package that is being developed and I'd prefer to avoid harcoding this branch into the package.

Cheers!

@rafaqz
Copy link
Collaborator

rafaqz commented Sep 13, 2024

Apparently @meggart is away and I'm finishing a PhD, so honestly not likely soon

@Balinus
Copy link

Balinus commented Sep 13, 2024

Apparently @meggart is away and I'm finishing a PhD, so honestly not likely soon

Good luck with your PhD! I know how it can be stressful and time-consuming at the end.

No worries, I will look into a workaround for CI. Cheers!

@lazarusA
Copy link
Contributor

it looks like last commit was successful. Does this branch now do the right thing? @Balinus ?

@ConnectedSystems
Copy link
Contributor Author

FWIW this is working for me @lazarusA

@Balinus
Copy link

Balinus commented Sep 27, 2024

it looks like last commit was successful. Does this branch now do the right thing? @Balinus ?

Should I test a specific commit/branch? I had tested with this branch (ConnectedSystems:fix-index-issue) and it was working. However, it works well for local code, when testing CI for a package, it is harder to specifically checkout this branch, hence I prefer to wait for the merging. Cheers!

@felixcremer
Copy link
Contributor

@meggart Is this blocked by anything or could we merge this?

@meggart meggart merged commit 4160df4 into JuliaIO:main Oct 21, 2024
10 checks passed
@ConnectedSystems ConnectedSystems deleted the fix-index-issue branch October 21, 2024 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants