Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Add LanceDB Datasource #44853
[data] Add LanceDB Datasource #44853
Changes from 3 commits
be6051e
2db26aa
50f2bef
43f3dd2
4577925
e2d7419
94ce5f0
5f4b253
486b71e
55ce8e6
2afef9a
757113a
931c6fe
5de8caa
db4c528
f39b18e
9b07881
cf3e700
02b4835
072ca1c
eb07726
a26625a
9b0d4d1
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth just supporting passing in a already configured
LanceDataset
. Then you don't have to reproduce all the same options.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, either that or allowing
kwargs
to get passed down.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if pickling preserves this today (we should fix that if it doesn't), but it might also be worth exposing the
storage_options
parameter fromlance.dataset()
. That will allow users to pass down credentials and other configurations for using object store (such as S3).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallelism
is not respected here. We should compare the value ofparallelism
andfragments
, to make sure the number ofReadTask
is no more thanparallelism
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be resolved - can you review?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fragments could be 100 GB or more. Maybe something we can implement later, but IMO it would be nice to let the user set the block size they want (in terms of # of rows), and then just slice the files according to that block size. We support partial scans of files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, "per-fragment" is probably too course for the long term. Right now though it might be tricky to know, up front, exactly how many batches will be generated (will depend on the row group size of the fragment which I'm not sure we make available)
I think we could probably add an API that, given a read size, will tell you how many batches will be generated for a fragment. Would that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if ray supported the idea of a streaming source (e.g. a
ReadTask
that returns aRecordBatchReader
) then a per-fragment API might actually work quite well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Ray Data supports streaming read in batches. Should we use https://lancedb.github.io/lance/read_and_write.html#iterative-read for per-fragment API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can Ray work with datasets that do their own parallelism? For example, you could read a fragment with
batch_readahead
and lance will do multi-threading on its own. This is a bit more efficient than kicking off a bunch of "read_batch" tasks since we only have to read/parse/decode the metadata a single time instead of multiple times.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, within each Ray task (per-fragment), multi-threading is allowed.