Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add python bindings for creating scalar indices #1592

Merged
merged 4 commits into from
Nov 17, 2023

Conversation

westonpace
Copy link
Contributor

No description provided.

Comment on lines -600 to +602
metric_type: Option<&str>,
replace: Option<bool>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved replace out of kwargs and moved metric_type into kwargs since:

  • replace is universally applicable regardless of index type
  • metric_type is only applicable for vector indices

@@ -689,6 +689,83 @@ def cleanup_old_versions(
td_to_micros(older_than), delete_unverified
)

def create_scalar_index(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether we wanted to expose scalar indices as a separate method (create_scalar_index) or a different index type. Internally we have a single method with a different index type.

I ended up opting for two different methods but I could be argued out of it. My concern is that users will not realize that the one API can do both things and they would need too much sophistication to use it correctly. I'm hoping exposing this as two different APIs reduces the cognitive load on the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Yeah I'm not sure either way. This approach seems fine.

@@ -689,6 +689,83 @@ def cleanup_old_versions(
td_to_micros(older_than), delete_unverified
)

def create_scalar_index(
self,
column: Union[str, List[str]],
Copy link
Contributor

@wjones127 wjones127 Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, it wouldn't be a breaking API change to go from str to Union[str, List[str]], so since we don't yet support multiple columns why don't we just keep this as str for now? The validation logic within the function is still relevant though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to str.

def create_scalar_index(
self,
column: Union[str, List[str]],
index_type: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd generally prefer to use Literal["BTREE"] for now, if possible. This given better auto-completion options.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or an enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up going with an enum. I hadn't really used enums for python before but now that python has better autocomplete I think this might be useful. Although, Literal["BTREE"] should functionally be an enum. @wjones127 any opinion on which to prefer?

Copy link
Contributor

@wjones127 wjones127 Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more typical Python style to use strings. It avoids the need to have to import the enum type in order to pass the value, which is nice in interactive settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having tried out both of them now I think I will stick with strings. Besides the issues that Will mentioned this will also give us consistency between the two create index APIs and between the rust API and the python API.

column: Union[str, List[str]],
index_type: str,
name: Optional[str] = None,
replace: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think boolean options should almost always be keyword only, for the sake of readability.

Suggested change
replace: bool = True,
*,
replace: bool = True,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've changed this to keyword only

)

self._ds.create_index([column], index_type, name, replace)
return LanceDataset(self.uri)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clears the session cache, which isn't ideal. It would be nice if self._ds.create_index() mutated self to update the dataset reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this is already handled here:

self.ds = Arc::new(new_self);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to return nothing (since it is already updating self.ds as you pointed out). We should probably fix the other create_index function too (although we can defer that for a different PR)

Comment on lines 699 to 760
"""Create scalar index on column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth explaining what a "scalar" column is, for the lay user. Also mention the index types (right now BTREE) and what they are able to handle (equality and range queries). That way they can know what kind of predicates could benefit from the index. Maybe also worth calling our explicitly that this speeds up ANN search and scans, and that it can be used in combination with ANN index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a lot more documentation on the function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent. Thanks for adding this.

@@ -689,6 +689,83 @@ def cleanup_old_versions(
td_to_micros(older_than), delete_unverified
)

def create_scalar_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 Yeah I'm not sure either way. This approach seems fine.

@westonpace
Copy link
Contributor Author

This is ready for another look.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I would just say move the enum to a string and then this is good to go.

@westonpace westonpace force-pushed the feat/scalar-index-python-bindings branch from 551342d to e160ad8 Compare November 17, 2023 00:02
@westonpace
Copy link
Contributor Author

Rebased and will merge when CI passes

@westonpace westonpace merged commit 296752c into lancedb:main Nov 17, 2023
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants