[Data] Avoid pickling LanceFragment when creating read tasks for Lance #45392

c21 · 2024-05-16T18:28:55Z

Why are these changes needed?

Avoid pickling LanceFragment when creating read tasks for Lance, as this is expensive.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cheng Su <[email protected]>

wjones127 · 2024-05-17T02:08:14Z

python/ray/data/datasource/lance_datasource.py

-            num_rows = sum([f.count_rows() for f in fragments])
-            input_files = [
-                data_file.path() for f in fragments for data_file in f.data_files()
-            ]
-


I think you can still keep these. We've established that count_rows is not the slow part. In fact, it's even faster than get_fragments().

Also, small nit: you don't need the [] inside of sum. If you omit them you get a generator expression which bypasses the need to allocate the whole list.

num_rows = sum(f.count_rows() for f in fragments)

Yes, updated.

wjones127 · 2024-05-17T02:11:11Z

python/ray/data/datasource/lance_datasource.py

+    for fragment_id in fragment_ids:
+        fragment = lance_ds.get_fragment(fragment_id)
        batches = fragment.to_batches(columns=columns, filter=row_filter)
        for batch in batches:
            yield pyarrow.Table.from_batches([batch])


If you wanted something that did some IO prefetching, you could instead do:

Suggested change

for fragment_id in fragment_ids:

fragment = lance_ds.get_fragment(fragment_id)

batches = fragment.to_batches(columns=columns, filter=row_filter)

for batch in batches:

yield pyarrow.Table.from_batches([batch])

fragments = [lance_ds.get_fragment(id) for id in fragment_ids]

scanner = lance_ds.scanner(

columns,

filter=row_filter,

fragments=fragments,

)

for batch in scanner.to_reader():

yield pyarrow.Table.from_batches([batch])

Cool, updated.

Signed-off-by: Cheng Su <[email protected]>

Avoid calling count_rows() when creating read tasks for Lance

e145a3e

Signed-off-by: Cheng Su <[email protected]>

c21 requested review from ericl, scv119, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners May 16, 2024 18:28

c21 assigned raulchen May 16, 2024

raulchen approved these changes May 16, 2024

View reviewed changes

c21 added the go add ONLY when ready to merge, run all tests label May 16, 2024

Avoid pickle Lance fragments which is expensive

1631165

Signed-off-by: Cheng Su <[email protected]>

wjones127 reviewed May 17, 2024

View reviewed changes

wjones127 mentioned this pull request May 17, 2024

Propagate storage_options and other read parameters when pickling LanceFragment lancedb/lance#2280

Open

Address comments and add unit test

4f00e69

Signed-off-by: Cheng Su <[email protected]>

c21 changed the title ~~[Data] Avoid calling count_rows() when creating read tasks for Lance~~ [Data] Avoid pickling LanceFragment when creating read tasks for Lance May 20, 2024

c21 merged commit e2028e0 into ray-project:master May 20, 2024
6 checks passed

c21 deleted the fix-lance branch May 20, 2024 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Avoid pickling LanceFragment when creating read tasks for Lance #45392

[Data] Avoid pickling LanceFragment when creating read tasks for Lance #45392

c21 commented May 16, 2024 •

edited

Loading

wjones127 May 17, 2024

c21 May 20, 2024

wjones127 May 17, 2024

c21 May 20, 2024

[Data] Avoid pickling LanceFragment when creating read tasks for Lance #45392

[Data] Avoid pickling LanceFragment when creating read tasks for Lance #45392

Conversation

c21 commented May 16, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

wjones127 May 17, 2024

Choose a reason for hiding this comment

c21 May 20, 2024

Choose a reason for hiding this comment

wjones127 May 17, 2024

Choose a reason for hiding this comment

c21 May 20, 2024

Choose a reason for hiding this comment

c21 commented May 16, 2024 •

edited

Loading