Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizes live docs computes for force merge in NativeEngines990KNNVectorWriter #2135

Closed

Conversation

shatejas
Copy link
Collaborator

Description

Currently live docs are computed with linear complexity even when it does not have deleted docs. This PR skips computing it

Testing

  • UTs and javadocs in progression. Need initial feed back

Related Issues

#2134

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

quantization isn't needed

Signed-off-by: Tejas Shah <[email protected]>
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add java doc on all the new added classes.

import java.util.ArrayList;
import java.util.List;

public final class KNNMergeVectorValues {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add java docs on top of this class.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a utility, lets rename the class to reflect that. Before you added the java doc I was under impression that its some kind of vectorValues.

Comment on lines 40 to 44
if (liveDocs instanceof FixedBitSet) {
liveDocsInt = ((FixedBitSet) liveDocs).cardinality();
} else {
liveDocsInt = computeLiveDocs(values, liveDocs);
values = knnVectorsReader.getFloatVectorValues(fieldInfo.name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let add comment around this code on why we are doing it.

return count;
}

static class KNNVectorValuesSub<T extends DocIdSetIterator> extends DocIDMerger.Sub {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is package private?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return subs;
}

public static List<KNNVectorValuesSub<ByteVectorValues>> mergeByteVectorValues(FieldInfo fieldInfo, MergeState mergeState)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add java doc.

return getVectorValues(vectorDataType, new KNNVectorValuesIterator.FieldWriterIteratorValues<>(docIdWithFieldSet, vectors));
}

public static <T> KNNVectorValues<T> getVectorValues(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java doc.

Comment on lines +111 to +104
@VisibleForTesting
@SuppressWarnings("unchecked")
private static <T> KNNVectorValues<T> getVectorValues(
static <T> KNNVectorValues<T> getVectorValues(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to make it package private?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed for unit test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets try to avoid this and use one of the public function to test this function.

@navneet1v
Copy link
Collaborator

Please also update the changelog.

NativeEnginesKNNVectorsWriter

Currently vector values are iterated irrespective of whether there are
deleted docs in the segment. This makes sure they aren't

Signed-off-by: Tejas Shah <[email protected]>
Comment on lines +98 to +101
int count = 0;
for (int index = 0; index < liveDocs.length(); index++) {
count += liveDocs.get(index) ? 1 : 0;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will iterate over all the bits in the bitset, rather I would suggest VectorValues to iterate over the docIds and then see if the docIds is present in bitset or not. Because it can happen for a field only few docIds are present in liveDocs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires another call to knnVectorsReader.get[Float/Byte]VectorValues for each segment, which is an anticipated latency driver. IO operation might be more expensive than looping through considering merges

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check my comment here: #2135 (comment) using live docs will not give you correct doc count.

ByteVectorValues values = knnVectorsReader.getByteVectorValues(fieldInfo.getName());
if (values != null) {
final Bits liveDocs = mergeState.liveDocs[i];
final int liveDocsCardinality = cardinality(liveDocs, Math.toIntExact(values.cost()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for understanding on how much time we are spending on this operation its better to add some debug logs.

* @return List of KNNVectorSub
* @throws IOException
*/
public static KNNVectorValuesIterator.MergeFloat32VectorValuesIterator mergeFloatVectorValues(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static KNNVectorValuesIterator.MergeFloat32VectorValuesIterator mergeFloatVectorValues(
public static KNNVectorValuesIterator.MergeFloat32VectorValuesIterator getMergedFloatVectorValuesIterator(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to return MergedFloatVectorValues which extends FloatVectorValues so iterator will not make sense

Comment on lines +94 to +96
if (liveDocs instanceof FixedBitSet) {
return ((FixedBitSet) liveDocs).cardinality();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this special case for fixedbitset and not for SparseFixedBitSet ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think its being used. I was looking at the usages of LiveDocsFormats write

read only returns FixedBits. Moreover while debugging I did not see SparseFixedBitSet usage. I can change it to instanceof BitSet to be safe

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super confident with this and also the next loop to find the live docs. I would check with lucene community first before making any assumptions here to avoid any possible regression, since this is on the write path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

community first before making any assumptions

Can you elaborate about the concerns with BitSet usage, as in performance or functionality? cardinality is supposed to return set bits and lucene seems to have optimized it for Fixed and Sparse with bit shifting

the next loop to find the live docs

For this is the concern performance or functionality?

Asking because we should do our due diligence before reaching out to the lucene community

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shatejas I think we don't need to reach out to Lucene, I thought of the case where this bitset count will fail and you can try this out with your cluster.

Live docs tells the total number of docs which are not deleted on the segment. Now a doc may or may not contain the vector field. So consider a case where there are 3 docs in the segment.
doc1 : has vector field : doc is deleted
doc2: has no vector field : doc is not deleted
doc3: has vector field : doc is not deleted.

With the above case the bitset will give live docs as 2 but for the vector field the live docs is only 1.

Hence, using the bitset.cardinality() will never give you right number of live docs.

and this is the reason why lucene doesn't use this during segment merges during the training.

cc: @jmazanec15 and @heemin32

Comment on lines 84 to 85
*/
@Deprecated
public long totalLiveDocs() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we removing deprecated from here, we should also update the java docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed as there was no @deprecated javadoc tag. The convention is to add @deprecated in javadocs tag and give a reason next to it instead of combining it with existing


@Override
public int advance(int docId) throws IOException {
throw new UnsupportedOperationException();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the exception message.

public abstract U vectorValue() throws IOException;
}

class MergeFloat32VectorValuesIterator extends MergeSegmentVectorValuesIterator<FloatVectorValues, float[]> {
Copy link
Collaborator

@navneet1v navneet1v Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we went with creating the iterator route here? rather we could have just extended the FloatVectorValues and ByteVectorValues. This would have avoided all these iterators and new strategy classes. Because FloatVectorValues extends the DISI and a lot of code could have been easily reused.

Ref: https://github.com/apache/lucene/blob/branch_9x/lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java#L286

The problem with LuceneMergedFloatVector values was it was not calculating the cost() correctly, hence we needed some wrapper classes to calculate the liveDocs aka cost correctly.

One suggestion here I would have is to go in lucene and actually fix the MergedFloat32VectorValues to ensure that its cost() function is giving right cost.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize its a DISI, changing it

We can take up the lucene contribution as a follow up

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take up the lucene contribution as a follow up

Please create a gh issue and ref it here.

@shatejas
Copy link
Collaborator Author

shatejas commented Oct 2, 2024

After more experiments we figured that this live docs iteration is not the root cause, This only yields 32ms gains for 1.6M vectors adding an overhead of maintaining the merge vectors code.

Decided to discard this

@shatejas shatejas closed this Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants