Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nested field missing sub embedding field will cause the IndexOutOfBoundsException #909

Closed
wdongyu opened this issue Sep 18, 2024 · 4 comments · Fixed by #913
Closed
Assignees
Labels
bug Something isn't working v2.18.0

Comments

@wdongyu
Copy link
Contributor

wdongyu commented Sep 18, 2024

What is the bug?

Currently, if a nested field nestedField with two sub field textField and textFieldNotForEmbedding, and a doc like:

{
  "nestedField": [
    {
      "textFieldNotForEmbedding": "This is a text field"
    },
    {
      "textField": "This is another text field"
    }
  ]
}

applying pipeline configuration:

{
  "description": ...,
  "processors": [
    {
      "text_embedding": {
        "model_id": ...,
        "field_map": {
          "nestedField": {
              "textField": "vectorField"
          }
        }
      }
    }
  ]
}

Result:

java.lang.IndexOutOfBoundsException: Index 1 out of bounds for length 1

Expected result:

{
  "nestedField": [
    {
      "textFieldNotForEmbedding": "This is a text field"
    },
    {
      "textField": "This is another text field"
      "vectorField": [...]
    }
  ]
}

How can one reproduce the bug?

Change the mock doc in test code:

// textField --> textFieldNotForEmbedding

nestedObj1.put("textFieldNotForEmbedding", "This is a text field") 

In unit test, get the nlpResult from textEmbeddingProcessor.buildNLPResult, then set it to each field in original doc just like the embedding processor actually does:

Map<String, Object> nlpResult = textEmbeddingProcessor.buildNLPResult(
            knnMap,
            modelTensorList,
            ingestDocument.getSourceAndMetadata()
        );
nlpResult.forEach(ingestDocument::setFieldValue);

And run the unit test with:

./gradlew ':test' --tests "org.opensearch.neuralsearch.processor.TextEmbeddingProcessorTests.testBuildVectorOutput_withNestedList_successful"

What is the expected behavior?

Mentioned above.

What is your host/environment?

Operating system, version.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

@wdongyu wdongyu added bug Something isn't working untriaged labels Sep 18, 2024
@wdongyu
Copy link
Contributor Author

wdongyu commented Sep 18, 2024

When setting fields in IngestDocument, the processor should check if the textField is not empty :

// build nlp output for list of nested objects
for (Map<String, Object> nestedElement : (List<Map<String, Object>>) sourceAndMetadataMap.get(processorKey)) {
     // Adding non empty check for inputNestedMapEntry, Only fill in embedding when value is not null
     // if (inputNestedMapEntry.getValue().get(index) != null)
     nestedElement.put(inputNestedMapEntry.getKey(), results.get(indexWrapper.index++));
}

willing to open a pr if confirmed.

@vibrantvarun
Copy link
Member

@martin-gaievski I think it is a bug and needs to be fixed. Can you once confirm?

@martin-gaievski
Copy link
Member

yup, looks like it, we need a fix for it. @wdongyu is this on latest main and 2.x, or specific to a certain 2.x version?

@wdongyu
Copy link
Contributor Author

wdongyu commented Sep 20, 2024

@martin-gaievski I test on latest main, it should exist in all versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.18.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants