Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: use selection vector strategy to improve exact knn performance with deletions #1418

Merged
merged 3 commits into from
Oct 19, 2023

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Oct 14, 2023

This PR repurposes the _rowid column's validity buffer as a selection vector. By using a selection vector, we can defer the mem copies involved with applying deletions to columnar data, which is a big bottleneck in KNN queries with deletions.

For convenience, we make the _distance column nullable. The final output will never be null, but we sometimes will have the intermediate values be null.

Closes #1352

TODO:

  • switch to using _rowid validity bitmap as selection vector.

@wjones127
Copy link
Contributor Author

This PR eliminates the latency cost of deletions for KNN search (red bars are before, yellow are after this PR):

Screenshot 2023-10-14 at 6 38 54 PM

@wjones127 wjones127 marked this pull request as ready for review October 15, 2023 03:53
@wjones127 wjones127 changed the title perf: use selection vector strategy to import exact knn performance with deletions perf: use selection vector strategy to improve exact knn performance with deletions Oct 15, 2023
rust/lance-linalg/src/distance/dot.rs Outdated Show resolved Hide resolved
rust/lance/src/dataset.rs Outdated Show resolved Hide resolved
fn scan(
&self,
with_row_id: bool,
with_make_deletions_null: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what there be an implication of this with regular null support? Do we need to differentiate them?

Maybe add some docstring to discuss the considerations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only affects the nulls/validity of the _rowid column, which I don't think we had any plans to use anyways.

@eddyxu
Copy link
Contributor

eddyxu commented Oct 16, 2023

Does this PR suggest that the main overhead is memory allocation ?

@wjones127
Copy link
Contributor Author

Does this PR suggest that the main overhead is memory allocation ?

Yes, see the traces in the issue: #1352 (comment)

The apply deletions portions are where we have to do mem copies to remove the deleted rows. This PR makes that unnecessary.

Basic lesson here is that take() and concat_batches() are expensive for large blobs.

@wjones127 wjones127 force-pushed the wjones127/optimize-knn-with-delete branch from 1fd2203 to fc59ffb Compare October 17, 2023 16:37
@wjones127 wjones127 requested a review from eddyxu October 18, 2023 15:43
@@ -1259,7 +1277,7 @@ mod test {
),
true,
),
ArrowField::new("_distance", DataType::Float32, false),
ArrowField::new("_distance", DataType::Float32, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change these to DIST_COL as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I've change all other instances or ROW_ID and DIST_COL.

@wjones127 wjones127 force-pushed the wjones127/optimize-knn-with-delete branch from dfd6e2e to 434e286 Compare October 19, 2023 00:51
@wjones127 wjones127 merged commit 1120a47 into main Oct 19, 2023
15 checks passed
@wjones127 wjones127 deleted the wjones127/optimize-knn-with-delete branch October 19, 2023 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf: investigate KNN performance with deletions
2 participants