-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: use selection vector strategy to improve exact knn performance with deletions #1418
Conversation
fn scan( | ||
&self, | ||
with_row_id: bool, | ||
with_make_deletions_null: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what there be an implication of this with regular null support? Do we need to differentiate them?
Maybe add some docstring to discuss the considerations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only affects the nulls/validity of the _rowid
column, which I don't think we had any plans to use anyways.
Does this PR suggest that the main overhead is memory allocation ? |
Yes, see the traces in the issue: #1352 (comment) The Basic lesson here is that |
1fd2203
to
fc59ffb
Compare
rust/lance/src/dataset/scanner.rs
Outdated
@@ -1259,7 +1277,7 @@ mod test { | |||
), | |||
true, | |||
), | |||
ArrowField::new("_distance", DataType::Float32, false), | |||
ArrowField::new("_distance", DataType::Float32, true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets change these to DIST_COL
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I've change all other instances or ROW_ID
and DIST_COL
.
test just use _rowid validity as selection vector pr feedback fix
dfd6e2e
to
434e286
Compare
This PR repurposes the
_rowid
column's validity buffer as a selection vector. By using a selection vector, we can defer the mem copies involved with applying deletions to columnar data, which is a big bottleneck in KNN queries with deletions.For convenience, we make the
_distance
column nullable. The final output will never be null, but we sometimes will have the intermediate values be null.Closes #1352
TODO:
_rowid
validity bitmap as selection vector.