feat: add python bindings for creating scalar indices #1592

westonpace · 2023-11-13T17:30:48Z

No description provided.

westonpace · 2023-11-13T17:31:45Z

python/src/dataset.rs

-        metric_type: Option<&str>,
+        replace: Option<bool>,


I moved replace out of kwargs and moved metric_type into kwargs since:

replace is universally applicable regardless of index type

metric_type is only applicable for vector indices

westonpace · 2023-11-13T17:33:58Z

python/python/lance/dataset.py

@@ -689,6 +689,83 @@ def cleanup_old_versions(
            td_to_micros(older_than), delete_unverified
        )

+    def create_scalar_index(


I wasn't sure whether we wanted to expose scalar indices as a separate method (create_scalar_index) or a different index type. Internally we have a single method with a different index type.

I ended up opting for two different methods but I could be argued out of it. My concern is that users will not realize that the one API can do both things and they would need too much sophistication to use it correctly. I'm hoping exposing this as two different APIs reduces the cognitive load on the user.

🤷 Yeah I'm not sure either way. This approach seems fine.

wjones127 · 2023-11-13T20:14:58Z

python/python/lance/dataset.py

@@ -689,6 +689,83 @@ def cleanup_old_versions(
            td_to_micros(older_than), delete_unverified
        )

+    def create_scalar_index(
+        self,
+        column: Union[str, List[str]],


In Python, it wouldn't be a breaking API change to go from str to Union[str, List[str]], so since we don't yet support multiple columns why don't we just keep this as str for now? The validation logic within the function is still relevant though.

I've updated this to str.

wjones127 · 2023-11-13T20:15:41Z

python/python/lance/dataset.py

+    def create_scalar_index(
+        self,
+        column: Union[str, List[str]],
+        index_type: str,


I'd generally prefer to use Literal["BTREE"] for now, if possible. This given better auto-completion options.

Or an enum?

I ended up going with an enum. I hadn't really used enums for python before but now that python has better autocomplete I think this might be useful. Although, Literal["BTREE"] should functionally be an enum. @wjones127 any opinion on which to prefer?

I think it's more typical Python style to use strings. It avoids the need to have to import the enum type in order to pass the value, which is nice in interactive settings.

Having tried out both of them now I think I will stick with strings. Besides the issues that Will mentioned this will also give us consistency between the two create index APIs and between the rust API and the python API.

wjones127 · 2023-11-13T20:16:35Z

python/python/lance/dataset.py

+        column: Union[str, List[str]],
+        index_type: str,
+        name: Optional[str] = None,
+        replace: bool = True,


nit: I think boolean options should almost always be keyword only, for the sake of readability.

Suggested change

replace: bool = True,

*,

replace: bool = True,

Good idea. I've changed this to keyword only

wjones127 · 2023-11-13T20:19:33Z

python/python/lance/dataset.py

+            )
+
+        self._ds.create_index([column], index_type, name, replace)
+        return LanceDataset(self.uri)


This clears the session cache, which isn't ideal. It would be nice if self._ds.create_index() mutated self to update the dataset reference.

In fact, this is already handled here:

lance/python/src/dataset.rs

Line 707 in 5a510af

self.ds = Arc::new(new_self);

I changed this to return nothing (since it is already updating self.ds as you pointed out). We should probably fix the other create_index function too (although we can defer that for a different PR)

wjones127 · 2023-11-13T20:25:44Z

python/python/lance/dataset.py

+        """Create scalar index on column.
+


It might be worth explaining what a "scalar" column is, for the lay user. Also mention the index types (right now BTREE) and what they are able to handle (equality and range queries). That way they can know what kind of predicates could benefit from the index. Maybe also worth calling our explicitly that this speeds up ANN search and scans, and that it can be used in combination with ANN index.

I added a lot more documentation on the function.

This is excellent. Thanks for adding this.

wjones127 · 2023-11-13T20:27:05Z

python/python/lance/dataset.py

@@ -689,6 +689,83 @@ def cleanup_old_versions(
            td_to_micros(older_than), delete_unverified
        )

+    def create_scalar_index(


🤷 Yeah I'm not sure either way. This approach seems fine.

westonpace · 2023-11-16T21:53:39Z

This is ready for another look.

wjones127

Looking good. I would just say move the enum to a string and then this is good to go.

westonpace · 2023-11-17T00:03:17Z

Rebased and will merge when CI passes

westonpace commented Nov 13, 2023

View reviewed changes

wjones127 requested changes Nov 13, 2023

View reviewed changes

westonpace force-pushed the feat/scalar-index-python-bindings branch from 5a510af to 6c341b2 Compare November 14, 2023 22:20

westonpace mentioned this pull request Nov 16, 2023

chore: expose scalar index #1614

Closed

westonpace force-pushed the feat/scalar-index-python-bindings branch from 6c341b2 to 352fa48 Compare November 16, 2023 21:51

westonpace requested a review from wjones127 November 16, 2023 21:53

wjones127 reviewed Nov 16, 2023

View reviewed changes

wjones127 approved these changes Nov 16, 2023

View reviewed changes

westonpace added 4 commits November 16, 2023 16:02

add python bindings for creating scalar indices

4151a3b

Apply clippy suggestion

eb98b58

Changed the index type to an enum. Added more documentation.

aa8308a

Back to using strings instead of enum for index type

e160ad8

westonpace force-pushed the feat/scalar-index-python-bindings branch from 551342d to e160ad8 Compare November 17, 2023 00:02

westonpace merged commit 296752c into lancedb:main Nov 17, 2023
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add python bindings for creating scalar indices #1592

feat: add python bindings for creating scalar indices #1592

westonpace commented Nov 13, 2023

westonpace Nov 13, 2023

westonpace Nov 13, 2023

wjones127 Nov 13, 2023

wjones127 Nov 13, 2023 •

edited

Loading

westonpace Nov 16, 2023

wjones127 Nov 13, 2023

judahrand Nov 16, 2023

westonpace Nov 16, 2023

wjones127 Nov 16, 2023 •

edited

Loading

westonpace Nov 16, 2023

wjones127 Nov 13, 2023

westonpace Nov 16, 2023

wjones127 Nov 13, 2023

wjones127 Nov 13, 2023

westonpace Nov 16, 2023

wjones127 Nov 13, 2023

westonpace Nov 16, 2023

wjones127 Nov 16, 2023

wjones127 Nov 13, 2023

westonpace commented Nov 16, 2023

wjones127 left a comment

westonpace commented Nov 17, 2023

feat: add python bindings for creating scalar indices #1592

feat: add python bindings for creating scalar indices #1592

Conversation

westonpace commented Nov 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Nov 16, 2023

wjones127 left a comment

Choose a reason for hiding this comment

westonpace commented Nov 17, 2023

wjones127 Nov 13, 2023 •

edited

Loading

wjones127 Nov 16, 2023 •

edited

Loading