-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow vector-based indices #181
Conversation
The main reason this is tricky is that it is quite hard to determine what The key functionality here is the using DiskArrays
chunks = DiskArrays.RegularChunks(10,0,100)
subset = 5:25
DiskArrays.subsetchunks(chunks,subset)
However, when indexing with vectors this gets a bit more tricky, but also works in many cases through subset = [1,2,3,15,16,17,18,91,92,93]
DiskArrays.subsetchunks(chunks,subset)
returns the expected result. The main reason this was not yet exposed to users is that I did not have time to test this, so help here would be appreciated. Also, many cases still fail although they would not have to: subset = [1,3,2]
DiskArrays.subsetchunks(chunks,subset) fails although all elements come from the same chunk and one could easily return subset = [1,2,15,16,3,4] Probably this should return For now it would be great if you want to work on this by either
|
I guess I was being too optimistic to think it was going to be a simple fix. Let me see what I can do. |
This is additionally tricky because the user would expect the subset to be ordered as requested... |
This will be handled correctly by the actual getindex call, there are methods to read the data chunk by chunk and then put it into the correct place. Here we just need to deal with the "Metadata" for DiskArrays so that downstream packages like YAXArrays can make a good guess in which chunks to group the data for large batch processing |
I'm not claiming this is completely fixed, but any thoughts on this approach? (see latest commits) I think it works the way in the intended manner based on the example given above. Just after a general comment on the approach. I recognize I haven't caught all cases yet. using DiskArrays
chunks = DiskArrays.RegularChunks(10,0,100)
subset = 5:25
DiskArrays.subsetchunks(chunks,subset)
# 3-element DiskArrays.RegularChunks:
# 1:6
# 7:16
# 17:21
subset = [1,2,3,15,16,17,18,91,92,93]
DiskArrays.subsetchunks(chunks,subset)
# 3-element DiskArrays.IrregularChunks:
# 1:3
# 4:7
# 8:10
subset = [1,3,2]
DiskArrays.subsetchunks(chunks, subset)
# 1-element DiskArrays.RegularChunks:
# 1:3
subset = [1,2,15,16,3,4]
DiskArrays.subsetchunks(chunks, subset)
# 2-element DiskArrays.RegularChunks:
# 1:4
# 5:6
x = [1, 2, 3, 16, 17, 4, 5, 98]
DiskArrays.subsetchunks(chunks, x)
# 3-element DiskArrays.IrregularChunks:
# 1:5
# 6:7
# 8:8 Extending my MWE for YAXArrays.jl: using Revise, Infiltrator
using NetCDF
using YAXArrays
using DiskArrays
axlist = (
Dim{:v1}(range(1, 10, length=10)),
Dim{:v2}([string("x$(i)") for i in 1:100])
)
test_arr = YAXArray(axlist, rand(10,100))
setchunks(test_arr, (10, 10))
savecube(test_arr, "test_cube.nc", driver=:netcdf, overwrite=true)
ds = open_dataset("test_cube.nc")
disk_arr = ds.layer
# These all pass...
@assert all(read(disk_arr[v1=[1, 3]]) .== test_arr[[1, 3], :].data)
@assert all(read(disk_arr[v2=At(["x1", "x2"])]) .== test_arr[:, 1:2].data)
@assert all(read(disk_arr[v2=At(["x1", "x90"])]) .== test_arr[:, [1, 90]].data)
@assert all(read(disk_arr[v2=[1,2,15,16,3,4]]) .== test_arr[:, [1,2,15,16,3,4]].data)
@assert all(read(disk_arr[v2=[1, 2, 3, 16, 17, 4, 5, 98, 97]]) .== test_arr[:, [1, 2, 3, 16, 17, 4, 5, 98, 97]].data) |
I think these two look wrong to me: subset = [1,2,15,16,3,4]
DiskArrays.subsetchunks(chunks, subset)
# 2-element DiskArrays.RegularChunks:
# 1:4
# 5:6
x = [1, 2, 3, 16, 17, 4, 5, 98]
DiskArrays.subsetchunks(chunks, x)
# 3-element DiskArrays.IrregularChunks:
# 1:5
# 6:7
# 8:8 I think the more sane behavior would be to return |
The thing is, I'm not too familiar with the internals here. As I indicate in the MWE above (albeit with YAXArrays), the selection behaviour is as expected. I'll try to mock up another example with |
a201164
to
4da6b79
Compare
@meggart I'm definitely missing some context here, but as far as I can tell the behaviour is now as I would expected. The tests that fail that appear to be relevant to this PR are those that check that no unsorted indices can be used and those that check for a specific order of chunks. Both of these are now allowed/supported with the proposed change. Incidentally, I did mock up an example using test_arr = rand(100,100,100)
a = PseudoDiskArray(test_arr; chunksize=(10,10,20))
idx = rand(1:100)
@assert all(a[idx, :, :] .== test_arr[idx, :, :])
@assert all(a[:, idx, :] .== test_arr[:, idx, :])
@assert all(a[:, :, idx] .== test_arr[:, :, idx])
sel = [1,3,4,5,10,11,19,50,51,96,95,94,83,82,81]
@assert all(a[:, sel, :] .== test_arr[:, sel, :])
@assert all(a[sel, :, :] .== test_arr[sel, :, :])
@assert all(a[sel, :, sel] .== test_arr[sel, :, sel])
bit_sel = rand(100) .> 0.5
@assert all(a[bit_sel, :, :] .== test_arr[bit_sel, :, :])
@assert all(a[:, bit_sel, :] .== test_arr[:, bit_sel, :])
@assert all(a[:, bit_sel, bit_sel] .== test_arr[:, bit_sel, bit_sel]) |
Thanks a lot for adding the tests and sorry for the delay, I will have a look this afternoon |
Thanks @meggart Just wanted to say that the "correct" behaviour I expect may be a coincidental side-effect. If |
No need to look into it, I just took the freedom to implement the desired behavior. In this case the issue is not about correctness but about downstream application performance might be affected in strange ways. Now I have no idea why this fails on nightly but not on 1.10 |
Hello! Is this gonna be merged soon? Tests are failing for a package that is being developed and I'd prefer to avoid harcoding this branch into the package. Cheers! |
Apparently @meggart is away and I'm finishing a PhD, so honestly not likely soon |
Good luck with your PhD! I know how it can be stressful and time-consuming at the end. No worries, I will look into a workaround for CI. Cheers! |
it looks like last commit was successful. Does this branch now do the right thing? @Balinus ? |
FWIW this is working for me @lazarusA |
Should I test a specific commit/branch? I had tested with this branch ( |
@meggart Is this blocked by anything or could we merge this? |
Addresses #180 caused by lack of support for vector-based indexing.
(related issue: JuliaDataCubes/YAXArrays.jl#416)