[data] Add LanceDB Datasource #44853

brent-anyscale · 2024-04-18T20:29:23Z

Why are these changes needed?

This PR adds the capability to load a LanceDB dataset into a Ray Dataset.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

This PR adds a new datasource for Ray Data that reads from LanceDB. This datasource is a thin wrapper around the LanceDB Python client that allows users to read data from LanceDB into Ray Data. On branch anyscalebrent/lancedb_datasource Changes to be committed: modified: python/ray/data/__init__.py modified: python/ray/data/datasource/__init__.py new file: python/ray/data/datasource/lancedb_datasource.py modified: python/ray/data/read_api.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/__init__.py modified: python/ray/data/datasource/lancedb_datasource.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/read_api.py

python/ray/data/datasource/lancedb_datasource.py

wjones127 · 2024-04-18T20:58:56Z

python/ray/data/datasource/lancedb_datasource.py

+        self.columns = columns
+        self.filter = filter
+
+        self.lance_ds = lance.dataset(uri)


I'm not sure if pickling preserves this today (we should fix that if it doesn't), but it might also be worth exposing the storage_options parameter from lance.dataset(). That will allow users to pass down credentials and other configurations for using object store (such as S3).

wjones127 · 2024-04-18T21:02:06Z

python/ray/data/datasource/lancedb_datasource.py

+
+    def __init__(
+        self,
+        uri: str,


It might be worth just supporting passing in a already configured LanceDataset. Then you don't have to reproduce all the same options.

Yeah, either that or allowing kwargs to get passed down.

wjones127 · 2024-04-18T21:04:23Z

python/ray/data/datasource/lancedb_datasource.py

+            read_task = ReadTask(
+                lambda fragment=fragment: [_read_single_fragment(fragment)],
+                metadata,
+            )


Fragments could be 100 GB or more. Maybe something we can implement later, but IMO it would be nice to let the user set the block size they want (in terms of # of rows), and then just slice the files according to that block size. We support partial scans of files.

Agreed, "per-fragment" is probably too course for the long term. Right now though it might be tricky to know, up front, exactly how many batches will be generated (will depend on the row group size of the fragment which I'm not sure we make available)

I think we could probably add an API that, given a read size, will tell you how many batches will be generated for a fragment. Would that work?

Alternatively, if ray supported the idea of a streaming source (e.g. a ReadTask that returns a RecordBatchReader) then a per-fragment API might actually work quite well.

Yes, Ray Data supports streaming read in batches. Should we use https://lancedb.github.io/lance/read_and_write.html#iterative-read for per-fragment API?

westonpace

Thanks for doing this. Here are a few thoughts from a lance perspective.

westonpace · 2024-04-18T21:25:08Z

python/ray/data/datasource/lancedb_datasource.py

+
+    def __init__(
+        self,
+        uri: str,


Yeah, either that or allowing kwargs to get passed down.

python/ray/data/datasource/lancedb_datasource.py

westonpace · 2024-04-18T21:31:11Z

python/ray/data/datasource/lancedb_datasource.py

+            read_task = ReadTask(
+                lambda fragment=fragment: [_read_single_fragment(fragment)],
+                metadata,
+            )


Agreed, "per-fragment" is probably too course for the long term. Right now though it might be tricky to know, up front, exactly how many batches will be generated (will depend on the row group size of the fragment which I'm not sure we make available)

I think we could probably add an API that, given a read size, will tell you how many batches will be generated for a fragment. Would that work?

python/ray/data/read_api.py

westonpace · 2024-04-18T21:35:05Z

python/ray/data/datasource/lancedb_datasource.py

+            read_task = ReadTask(
+                lambda fragment=fragment: [_read_single_fragment(fragment)],
+                metadata,
+            )


Alternatively, if ray supported the idea of a streaming source (e.g. a ReadTask that returns a RecordBatchReader) then a per-fragment API might actually work quite well.

westonpace · 2024-04-18T21:36:44Z

python/ray/data/read_api.py

+            If not specified, all columns are read.
+        filter: The filter to apply to the dataset.
+            If not specified, no filter is applied.
+        parallelism: Degree of parallelism to use for the Dataset


Can Ray work with datasets that do their own parallelism? For example, you could read a fragment with batch_readahead and lance will do multi-threading on its own. This is a bit more efficient than kicking off a bunch of "read_batch" tasks since we only have to read/parse/decode the metadata a single time instead of multiple times.

Yes, within each Ray task (per-fragment), multi-threading is allowed.

c21

Thank you @brent-anyscale working on this!

python/ray/data/datasource/lancedb_datasource.py

c21 · 2024-04-22T18:10:46Z

python/ray/data/datasource/lancedb_datasource.py

+            read_task = ReadTask(
+                lambda fragment=fragment: [_read_single_fragment(fragment)],
+                metadata,
+            )


Yes, Ray Data supports streaming read in batches. Should we use https://lancedb.github.io/lance/read_and_write.html#iterative-read for per-fragment API?

c21 · 2024-04-22T18:14:13Z

python/ray/data/read_api.py

+            If not specified, all columns are read.
+        filter: The filter to apply to the dataset.
+            If not specified, no filter is applied.
+        parallelism: Degree of parallelism to use for the Dataset


Yes, within each Ray task (per-fragment), multi-threading is allowed.

c21 · 2024-04-22T18:15:59Z

python/ray/data/datasource/lancedb_datasource.py

+            return table
+
+        read_tasks = []
+        for fragment in self.fragments:


parallelism is not respected here. We should compare the value of parallelism and fragments, to make sure the number of ReadTask is no more than parallelism.

Should be resolved - can you review?

python/ray/data/read_api.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/__init__.py modified: python/ray/data/datasource/__init__.py renamed: python/ray/data/datasource/lancedb_datasource.py -> python/ray/data/datasource/lance_datasource.py modified: python/ray/data/read_api.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/__init__.py modified: python/ray/data/datasource/lance_datasource.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/__init__.py modified: python/ray/data/datasource/__init__.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/read_api.py Signed-off-by: Brent Bain <[email protected]>

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/read_api.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

Signed-off-by: Brent Bain <[email protected]> The __init__ method of the LanceDatasource class now uses Optional instead of Union for the parameters. Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

This change updates lance_datasource to a simpler implementation of to_batches. Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

python/ray/data/datasource/lance_datasource.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

Yield isn't working as expected. Changing back to return. Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

c21 · 2024-04-26T20:15:16Z

python/ray/data/datasource/lance_datasource.py

+        if parallelism > len(self.fragments):
+            parallelism = len(self.fragments)
+            logger.warning(
+                f"Reducing the parallelism to {parallelism}, as that is the "
+                "number of files"
+            )


This is not the behavior we want, let's remove this.

c21 · 2024-04-26T20:15:24Z

python/ray/data/datasource/lance_datasource.py

+            if len(fragments) <= 0:
+                continue


when this can happen?

Changes to lance_datasource parallelism handling. Added initial test for lance_datasource. Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py modified: python/ray/data/read_api.py new file: python/ray/data/tests/test_lance.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py modified: python/ray/data/tests/test_lance.py

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/requirements/ml/data-test-requirements.txt

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/BUILD

brent-anyscale · 2024-05-02T18:35:54Z

Closing in favor of #45106

brent-anyscale added 3 commits April 18, 2024 13:08

upd: datasource __init__

2db26aa

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/__init__.py modified: python/ray/data/datasource/lancedb_datasource.py

upd: read_api.py - fix linting errors with line length

50f2bef

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/read_api.py

brent-anyscale requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen, stephanie-wang and omatthew98 as code owners April 18, 2024 20:29

wjones127 suggested changes Apr 18, 2024

View reviewed changes

westonpace reviewed Apr 18, 2024

View reviewed changes

c21 reviewed Apr 22, 2024

View reviewed changes

brent-anyscale and others added 9 commits April 23, 2024 08:36

upd: Additional updates to remove DB from Lance resources

4577925

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/__init__.py modified: python/ray/data/datasource/lance_datasource.py

upd: Additional updates to remove DB from lance name

e2d7419

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/__init__.py modified: python/ray/data/datasource/__init__.py

Merge branch 'ray-project:master' into anyscalebrent/lancedb_datasource

94ce5f0

upd: Lance ReadAPI comment for AZ support

5f4b253

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/read_api.py Signed-off-by: Brent Bain <[email protected]>

upd: Include limk to LanceDB docs in read_api.py

486b71e

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/read_api.py

upd: lance_datasource - remove header comment

55ce8e6

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: Change init params to Optional instead of Unions

2afef9a

Signed-off-by: Brent Bain <[email protected]> The __init__ method of the LanceDatasource class now uses Optional instead of Union for the parameters. Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: lance_datasource - change to use to_batches

757113a

This change updates lance_datasource to a simpler implementation of to_batches. Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

c21 reviewed Apr 24, 2024

View reviewed changes

python/ray/data/datasource/lance_datasource.py Outdated Show resolved Hide resolved

brent-anyscale added 5 commits April 24, 2024 11:07

upd: lance_datasource - set parallelism based on number of fragments

931c6fe

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: lance_datasource - change from yield to return

5de8caa

Yield isn't working as expected. Changing back to return. Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: lance_dataset comment - changing for consistent naming

db4c528

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: lance_datasource - changed how fragment reading is performed

f39b18e

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: lance datasource - comments updated

9b07881

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

upd: lance_datasource Add storage options to pass to Lance

cf3e700

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py

c21 reviewed Apr 26, 2024

View reviewed changes

brent-anyscale and others added 5 commits April 26, 2024 13:18

Merge branch 'ray-project:master' into anyscalebrent/lancedb_datasource

02b4835

upd: lance tests linting updates

eb07726

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/datasource/lance_datasource.py modified: python/ray/data/tests/test_lance.py

upd: data-test-requirements - add lancedb

a26625a

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/requirements/ml/data-test-requirements.txt

upd: data BUILD - and Lance test

9b0d4d1

Signed-off-by: Brent Bain <[email protected]> Changes to be committed: modified: python/ray/data/BUILD

c21 mentioned this pull request May 2, 2024

[Data] Add read_lance API to read Lance Dataset #45106

Merged

8 tasks

brent-anyscale closed this May 2, 2024

brent-anyscale deleted the anyscalebrent/lancedb_datasource branch May 2, 2024 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Add LanceDB Datasource #44853

[data] Add LanceDB Datasource #44853

brent-anyscale commented Apr 18, 2024

wjones127 Apr 18, 2024

wjones127 Apr 18, 2024

westonpace Apr 18, 2024

wjones127 Apr 18, 2024

westonpace Apr 18, 2024

westonpace Apr 18, 2024

c21 Apr 22, 2024

westonpace left a comment

westonpace Apr 18, 2024

westonpace Apr 18, 2024

westonpace Apr 18, 2024

westonpace Apr 18, 2024

c21 Apr 22, 2024

c21 left a comment

c21 Apr 22, 2024

c21 Apr 22, 2024

c21 Apr 22, 2024

brent-anyscale Apr 24, 2024

c21 Apr 26, 2024

c21 Apr 26, 2024

brent-anyscale commented May 2, 2024

[data] Add LanceDB Datasource #44853

[data] Add LanceDB Datasource #44853

Conversation

brent-anyscale commented Apr 18, 2024

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brent-anyscale commented May 2, 2024