-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add python bindings for creating scalar indices #1592
feat: add python bindings for creating scalar indices #1592
Conversation
metric_type: Option<&str>, | ||
replace: Option<bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved replace
out of kwargs
and moved metric_type
into kwargs
since:
replace
is universally applicable regardless of index typemetric_type
is only applicable for vector indices
@@ -689,6 +689,83 @@ def cleanup_old_versions( | |||
td_to_micros(older_than), delete_unverified | |||
) | |||
|
|||
def create_scalar_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure whether we wanted to expose scalar indices as a separate method (create_scalar_index
) or a different index type. Internally we have a single method with a different index type.
I ended up opting for two different methods but I could be argued out of it. My concern is that users will not realize that the one API can do both things and they would need too much sophistication to use it correctly. I'm hoping exposing this as two different APIs reduces the cognitive load on the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 Yeah I'm not sure either way. This approach seems fine.
python/python/lance/dataset.py
Outdated
@@ -689,6 +689,83 @@ def cleanup_old_versions( | |||
td_to_micros(older_than), delete_unverified | |||
) | |||
|
|||
def create_scalar_index( | |||
self, | |||
column: Union[str, List[str]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, it wouldn't be a breaking API change to go from str
to Union[str, List[str]]
, so since we don't yet support multiple columns why don't we just keep this as str
for now? The validation logic within the function is still relevant though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this to str
.
python/python/lance/dataset.py
Outdated
def create_scalar_index( | ||
self, | ||
column: Union[str, List[str]], | ||
index_type: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd generally prefer to use Literal["BTREE"]
for now, if possible. This given better auto-completion options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or an enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up going with an enum. I hadn't really used enums for python before but now that python has better autocomplete I think this might be useful. Although, Literal["BTREE"]
should functionally be an enum. @wjones127 any opinion on which to prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's more typical Python style to use strings. It avoids the need to have to import the enum type in order to pass the value, which is nice in interactive settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tried out both of them now I think I will stick with strings. Besides the issues that Will mentioned this will also give us consistency between the two create index APIs and between the rust API and the python API.
column: Union[str, List[str]], | ||
index_type: str, | ||
name: Optional[str] = None, | ||
replace: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think boolean options should almost always be keyword only, for the sake of readability.
replace: bool = True, | |
*, | |
replace: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I've changed this to keyword only
python/python/lance/dataset.py
Outdated
) | ||
|
||
self._ds.create_index([column], index_type, name, replace) | ||
return LanceDataset(self.uri) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clears the session cache, which isn't ideal. It would be nice if self._ds.create_index()
mutated self
to update the dataset reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, this is already handled here:
Line 707 in 5a510af
self.ds = Arc::new(new_self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to return nothing (since it is already updating self.ds
as you pointed out). We should probably fix the other create_index
function too (although we can defer that for a different PR)
python/python/lance/dataset.py
Outdated
"""Create scalar index on column. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth explaining what a "scalar" column is, for the lay user. Also mention the index types (right now BTREE) and what they are able to handle (equality and range queries). That way they can know what kind of predicates could benefit from the index. Maybe also worth calling our explicitly that this speeds up ANN search and scans, and that it can be used in combination with ANN index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a lot more documentation on the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is excellent. Thanks for adding this.
@@ -689,6 +689,83 @@ def cleanup_old_versions( | |||
td_to_micros(older_than), delete_unverified | |||
) | |||
|
|||
def create_scalar_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 Yeah I'm not sure either way. This approach seems fine.
5a510af
to
6c341b2
Compare
6c341b2
to
352fa48
Compare
This is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I would just say move the enum to a string and then this is good to go.
551342d
to
e160ad8
Compare
Rebased and will merge when CI passes |
No description provided.