-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizes live docs computes for force merge in NativeEngines990KNNVectorWriter #2135
Optimizes live docs computes for force merge in NativeEngines990KNNVectorWriter #2135
Conversation
quantization isn't needed Signed-off-by: Tejas Shah <[email protected]>
0a280c1
to
2ae6af5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add java doc on all the new added classes.
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public final class KNNMergeVectorValues { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add java docs on top of this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is a utility, lets rename the class to reflect that. Before you added the java doc I was under impression that its some kind of vectorValues.
if (liveDocs instanceof FixedBitSet) { | ||
liveDocsInt = ((FixedBitSet) liveDocs).cardinality(); | ||
} else { | ||
liveDocsInt = computeLiveDocs(values, liveDocs); | ||
values = knnVectorsReader.getFloatVectorValues(fieldInfo.name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let add comment around this code on why we are doing it.
return count; | ||
} | ||
|
||
static class KNNVectorValuesSub<T extends DocIdSetIterator> extends DocIDMerger.Sub { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this is package private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed in KNNVectorValuesIterator
return subs; | ||
} | ||
|
||
public static List<KNNVectorValuesSub<ByteVectorValues>> mergeByteVectorValues(FieldInfo fieldInfo, MergeState mergeState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add java doc.
return getVectorValues(vectorDataType, new KNNVectorValuesIterator.FieldWriterIteratorValues<>(docIdWithFieldSet, vectors)); | ||
} | ||
|
||
public static <T> KNNVectorValues<T> getVectorValues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java doc.
@VisibleForTesting | ||
@SuppressWarnings("unchecked") | ||
private static <T> KNNVectorValues<T> getVectorValues( | ||
static <T> KNNVectorValues<T> getVectorValues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why to make it package private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed for unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets try to avoid this and use one of the public function to test this function.
Please also update the changelog. |
2ae6af5
to
4fcd529
Compare
NativeEnginesKNNVectorsWriter Currently vector values are iterated irrespective of whether there are deleted docs in the segment. This makes sure they aren't Signed-off-by: Tejas Shah <[email protected]>
4fcd529
to
316a903
Compare
int count = 0; | ||
for (int index = 0; index < liveDocs.length(); index++) { | ||
count += liveDocs.get(index) ? 1 : 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will iterate over all the bits in the bitset, rather I would suggest VectorValues to iterate over the docIds and then see if the docIds is present in bitset or not. Because it can happen for a field only few docIds are present in liveDocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires another call to knnVectorsReader.get[Float/Byte]VectorValues
for each segment, which is an anticipated latency driver. IO operation might be more expensive than looping through considering merges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check my comment here: #2135 (comment) using live docs will not give you correct doc count.
ByteVectorValues values = knnVectorsReader.getByteVectorValues(fieldInfo.getName()); | ||
if (values != null) { | ||
final Bits liveDocs = mergeState.liveDocs[i]; | ||
final int liveDocsCardinality = cardinality(liveDocs, Math.toIntExact(values.cost())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for understanding on how much time we are spending on this operation its better to add some debug logs.
* @return List of KNNVectorSub | ||
* @throws IOException | ||
*/ | ||
public static KNNVectorValuesIterator.MergeFloat32VectorValuesIterator mergeFloatVectorValues( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static KNNVectorValuesIterator.MergeFloat32VectorValuesIterator mergeFloatVectorValues( | |
public static KNNVectorValuesIterator.MergeFloat32VectorValuesIterator getMergedFloatVectorValuesIterator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to return MergedFloatVectorValues which extends FloatVectorValues so iterator will not make sense
if (liveDocs instanceof FixedBitSet) { | ||
return ((FixedBitSet) liveDocs).cardinality(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this special case for fixedbitset and not for SparseFixedBitSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think its being used. I was looking at the usages of LiveDocsFormats write
read only returns FixedBits. Moreover while debugging I did not see SparseFixedBitSet usage. I can change it to instanceof BitSet
to be safe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super confident with this and also the next loop to find the live docs. I would check with lucene community first before making any assumptions here to avoid any possible regression, since this is on the write path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
community first before making any assumptions
Can you elaborate about the concerns with BitSet usage, as in performance or functionality? cardinality is supposed to return set bits and lucene seems to have optimized it for Fixed and Sparse with bit shifting
the next loop to find the live docs
For this is the concern performance or functionality?
Asking because we should do our due diligence before reaching out to the lucene community
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shatejas I think we don't need to reach out to Lucene, I thought of the case where this bitset count will fail and you can try this out with your cluster.
Live docs tells the total number of docs which are not deleted on the segment. Now a doc may or may not contain the vector field. So consider a case where there are 3 docs in the segment.
doc1 : has vector field : doc is deleted
doc2: has no vector field : doc is not deleted
doc3: has vector field : doc is not deleted.
With the above case the bitset will give live docs as 2 but for the vector field the live docs is only 1.
Hence, using the bitset.cardinality() will never give you right number of live docs.
and this is the reason why lucene doesn't use this during segment merges during the training.
cc: @jmazanec15 and @heemin32
*/ | ||
@Deprecated | ||
public long totalLiveDocs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we removing deprecated from here, we should also update the java docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed as there was no @deprecated javadoc tag. The convention is to add @deprecated in javadocs tag and give a reason next to it instead of combining it with existing
|
||
@Override | ||
public int advance(int docId) throws IOException { | ||
throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the exception message.
public abstract U vectorValue() throws IOException; | ||
} | ||
|
||
class MergeFloat32VectorValuesIterator extends MergeSegmentVectorValuesIterator<FloatVectorValues, float[]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we went with creating the iterator route here? rather we could have just extended the FloatVectorValues and ByteVectorValues. This would have avoided all these iterators and new strategy classes. Because FloatVectorValues extends the DISI and a lot of code could have been easily reused.
The problem with LuceneMergedFloatVector values was it was not calculating the cost()
correctly, hence we needed some wrapper classes to calculate the liveDocs aka cost correctly.
One suggestion here I would have is to go in lucene and actually fix the MergedFloat32VectorValues
to ensure that its cost() function is giving right cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize its a DISI, changing it
We can take up the lucene contribution as a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can take up the lucene contribution as a follow up
Please create a gh issue and ref it here.
After more experiments we figured that this live docs iteration is not the root cause, This only yields 32ms gains for 1.6M vectors adding an overhead of maintaining the merge vectors code. Decided to discard this |
Description
Currently live docs are computed with linear complexity even when it does not have deleted docs. This PR skips computing it
Testing
Related Issues
#2134
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.