Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

Closed
securigy opened this issue Jan 3, 2024 · 13 comments
Closed

[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

securigy opened this issue Jan 3, 2024 · 13 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Search Service Attention Workflow: This issue is responsible by Azure service team.
Milestone

Comments

@securigy
Copy link

securigy commented Jan 3, 2024

Library name and version

Azure.Search.Documents and Azure.AI.OpenAI

Query/Question

I put together a RAG system. I segmented 2 Word documents on paragraph boundaries using only full sentences and successfully ingested them with Azure API. My search has a choice of Vector, Hybrid (vector + text), and Semantic (vector + text + semantic reordering). Those 2 documents are my resume and my friend's resume. When I ask "Who is [my name]?" I get a decent answer. However, when I ask "Who is [my friend's name] I get nothing in the form of "Based on provided information there is no data on...".

I tried that as either of 3 modes of search I mentioned above. The code that defines my search is :

`  public SearchOptions? CreateSearchOptions(int searchTypeInt, int k, ReadOnlyMemory<float> embeddings)
   {
       _logger.LogInformation("CreateSearchOptions entered");

       SearchOptions? searchOptions = null;
       try
       {
           SearchType searchType = (SearchType)searchTypeInt;

           searchOptions = new SearchOptions
           {
               //Filter = filter, will be set later
               Size = k,

               // fields to retrieve, if not specified then all are retrieved if retrievable
               Select = { "SegmentText", "NamedEntities" },
           };

           if ((searchType & SearchType.Vector) == SearchType.Vector)
           {
               searchOptions.VectorSearch = new VectorSearchOptions
               {
                   Queries = { new VectorizedQuery(embeddings) { KNearestNeighborsCount = k, Fields = { "SegmentTextVector" } } },
               };
           }

           if ((searchType & SearchType.Semantic) == SearchType.Semantic)
           {
               // This is going to invoke semantic ranking, if its resource is enable in Azure
               searchOptions.SemanticSearch = new SemanticSearchOptions
               {
                   SemanticConfigurationName = "my-semantic-config", // Is it a new name that we give?
                   QueryCaption = new QueryCaption(QueryCaptionType.Extractive),
                   QueryAnswer = new QueryAnswer(QueryAnswerType.Extractive),
               };
               searchOptions.QueryType = SearchQueryType.Semantic; // Set the QueryType for Semantic Search
           }
       }
       catch (Exception ex)
       {
           _logger.LogError(ex, ex.Message);
           _logger.LogInformation("CreateSearchOptions exiting");
           return null;
       }

       _logger.LogInformation("CreateSearchOptions exiting");

       return searchOptions;
   }`

An here is how the index is constructed:

`   public SearchIndex? GetOrCreateSearchIndex(int searchTypeInt,
                                               SearchIndexClient searchIndexClient,
                                               string searchIndexName)
    {
        SearchIndex? searchIndex = null;

        SearchType searchType = (SearchType)searchTypeInt;

        try
        {
            // Get index if exists
            searchIndex = searchIndexClient.GetIndex(searchIndexName);
        }
        catch (RequestFailedException ex) when (ex.Status == 404)
        {
            try
            {
                // The search index schema (local definition of index) does not exist - create one
                FieldBuilder builder = new FieldBuilder();
                searchIndex = new SearchIndex(searchIndexName, builder.Build(typeof(SegmentObj)));

                // VECTOR OPTION
                if ((searchType & SearchType.Vector) == SearchType.Vector)
                {
                    //string vectorProfileFilePath = Path.Combine(InstallDir, mVectorProfileFileName);
                    string jsonStr = File.ReadAllText("VectorProfile.json");

                    searchIndex.VectorSearch = new VectorSearch
                    {
                        Profiles =
                        {
                            // using 11.5.0-beta5 of Azure.Search.Documents, in 11.5.0 it will be different ???
                            //new VectorSearchProfile("my-default-vector-profile", "my-hnsw-config-2")
                            new VectorSearchProfile("my-vector-profile", "my-hnsw-config")
                            {
                                 Name = "my-vector-profile",
                                 AlgorithmConfigurationName ="my-hnsw-config"
                            }
                        },
                        Algorithms =
                        {
                            // using 11.5.0-beta5 of Azure.Search.Documents, in 11.5.0 it will be different ???
                            //new HnswAlgorithmConfiguration("my-hnsw-config-2") // using 11.5.0-beta5 of Azure.Search.Documents
                            new HnswAlgorithmConfiguration("my-hnsw-config")
                            {
                                 Name = "my-hnsw-config",
                                 Parameters = new HnswParameters()
                                 {
                                      Metric = "cosine",
                                       EfSearch = 800,
                                        EfConstruction = 800,
                                         M = 8
                                 }
                            },
                        }
                    };

                }

                // SEMANTIC OPTION
                if ((searchType & SearchType.Semantic) == SearchType.Semantic)
                {
                    SemanticSearch semanticSearch = new SemanticSearch
                    {
                        Configurations =
                        {
                            // Looks line "my-semantic-config" is not a file name by a name I give
                            // to new semantic configuration I am creating...Not sure about it...
                            // https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-request?tabs=rest%2Crest-query
                            // The above documentation says: Set "semanticConfiguration" to a
                            // predefined semantic configuration that's embedded in your index.
                            //
                            new SemanticConfiguration("my-semantic-config", new()
                            {
                                //TitleField = new SemanticField("HotelName"),
                                ContentFields =
                                {
                                    new SemanticField("SegmentText"),
                                    new SemanticField("NamedEntities")
                                },
                                KeywordsFields =
                                {
                                    new SemanticField("NamedEntities")
                                }
                            })
                        }
                    };

                    searchIndex.SemanticSearch = semanticSearch;
                }

                // Create SearchIndex in Azure
                if (searchIndex != null)
                    searchIndex = searchIndexClient.CreateIndex(searchIndex);
            }
            catch (Exception ex2)
            {
                _logger.LogError(ex2, ex2.Message);
                return null;
            }
        }

        return searchIndex;
    }

`

As you can see, in addition to all I use Analytics (not shown in the displayed code) in order to extract Named Entities from every text segment and populate NamedEntities field for every segment I ingest, and KeywordsFields accordingly. When I use

Calling the code is as follows:

`            searchOptions = CreateSearchOptions(searchTypeInt, k, embeddings);
            if (searchOptions != null)
            {
                if (!String.IsNullOrWhiteSpace(filter.Trim()))
                    searchOptions.Filter = $"NamedEntities eq '{filter}'";
            }

            SearchResults<SegmentObj>? response = null;
            if (searchOptions != null)
                response = await searchClient.SearchAsync<SegmentObj>(searchText, searchOptions);`

Now, I get responses that are relevant, that is, valid responses, where some of them have valid description who the person is, professionally. But when I feed the prompt along with the 8 results that I get above from vector search into Completion API, I get nothing.
I do use GPT-3.5-turbo-16k, because for whatever reason GPT-4 is not available to me at this time in Azure. So, my first suspicion is: is it because I do not use GPT-4? Is there any other reason you could think about? If you'd like you I can provide you with prompt and the context (8 results from vector search).

Environment

Windows 11, VS2022, Azure.Search.Document 11.5.1, Azure.AI.OpenAI 1.0.0-beta.11, Embeddings 1.0.0-beta.9,
Microsoft.AspNetCore.OpenAI 7.0.13

@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jan 3, 2024
@jsquire jsquire added Search Client This issue points to a problem in the data-plane of the library. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team and removed needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. labels Jan 3, 2024
@jsquire
Copy link
Member

jsquire commented Jan 3, 2024

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@mattgotteiner
Copy link
Member

Hi @securigy ,

Sorry to hear the results aren't what you are expecting. Here's some answers to your questions in the comments

SemanticConfigurationName = "my-semantic-config", // Is it a new name that we give?
No, this name must match an existing semantic configuration field on your index https://learn.microsoft.com/en-us/azure/search/semantic-search-overview

                            // using 11.5.0-beta5 of Azure.Search.Documents, in 11.5.0 it will be different ???
                            //new VectorSearchProfile("my-default-vector-profile", "my-hnsw-config-2")
                            new VectorSearchProfile("my-vector-profile", "my-hnsw-config")
                            {
                                 Name = "my-vector-profile",
                                 AlgorithmConfigurationName ="my-hnsw-config"
                            }

There have been backwards-incompatible changes to the nuget, but the same code that works on 11.5.0 should work with 11.5.0-beta5. For the list of specific changes to the 2023-11-01 api please review https://learn.microsoft.com/en-us/azure/search/search-api-migration#upgrade-to-2023-11-01

              // Looks line "my-semantic-config" is not a file name by a name I give
                            // to new semantic configuration I am creating...Not sure about it...
                            // https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-request?tabs=rest%2Crest-query
                            // The above documentation says: Set "semanticConfiguration" to a
                            // predefined semantic configuration that's embedded in your index.
                            //
                            new SemanticConfiguration("my-semantic-config", new()
                            {

semantic configuration is not a file name. It's a JSON object that describes how semantic ranking will work. The name "my-semantic-config" means you can reference this config in the query by using the name "my-semantic-config".

Here's a couple thoughts as to how you can improve search quality:

  1. Evaluate your chunking strategy. For example, it's possible to miss relevant chunks if there is no overlap. https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents
  2. Consider using integrated vectorization instead of manually chunking https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization
  3. Experiment with removing the named entities filter. It's possible relevant chunks are being removed by the filter.

I hope this helps,
Matt

@securigy
Copy link
Author

securigy commented Jan 4, 2024

  1. This is probably irrelevant to my case. I split the Word document based on paragraphs and always on sentence boundaries, and never exceed 512 tokens. I agree that overlap is generally useful when you just chop the documents based on max tokens like 512. Then it could be a middle of the sentence, etc....
  2. I do not know how you vectorize the documents - Is Word, PDF, Excel, etc. all vectorized the same? If yes, I differ... so I cannot use something that is not documented and is a big question mark.
  3. I don't get it... removed? I thought that the filter helps bringing the results that include particular word. Anyway, I have a problem with it - how do I include filter that will return only results containing specific word. For example:

searchOptions = CreateSearchOptions(searchTypeInt, k, embeddings);
if (searchOptions != null)
{
if (!String.IsNullOrWhiteSpace(filter.Trim()))
searchOptions.Filter = $"NamedEntities eq '{filter}'";
}

When I do the above and my filter word is "Amy" I get an error:

Parameter name: $filter
Status: 400 (Bad Request)

Content:
{"error":{"code":"","message":"Invalid expression: Syntax error: character '' is not valid at position 17 in 'NamedEntities eq Amy`'.\r\nParameter name: $filter"}}

The NamedEntities string field is very useful, it contains coma-separated words and phrases, like names of people, addresses, location names, dates, etc.

So maybe I dont understand how filter is supposed to work. I tried it without single quate, and get the error:
Azure.RequestFailedException: 'Invalid expression: Could not find a property named 'Amy' on type 'search.document'.
Parameter name: $filter
Status: 400 (Bad Request)

@pallavit
Copy link
Contributor

pallavit commented Jan 4, 2024

Assigning to @mattmsft and tagging as Service Attention

@pallavit pallavit added the Service Attention Workflow: This issue is responsible by Azure service team. label Jan 4, 2024
Copy link

github-actions bot commented Jan 4, 2024

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @arv100kri @bleroy @tjacobhi.

@mattgotteiner
Copy link
Member

  1. I’m glad you have a specific chunking strategy in mind.
  2. The chunking strategy for integrated vectorization is documented, and it’s customizable https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit
  3. How are you extracting named entities from your chunks? As long as you’re confident your named entity extraction strategy works then you can leave the filter in, but if it doesn’t find the most relevant named entities it might not improve search quality.
  4. What’s the type of the NamedEntities field? The syntax for filtering strings and collections of strings is different.

I hope this helps,
Matt

@securigy
Copy link
Author

securigy commented Jan 4, 2024 via email

@securigy
Copy link
Author

securigy commented Jan 5, 2024

NamedEntities is a string filed. The results of the Azure Analytics query produces strings that I concatenate into one string where words and phrases produced by analytics are separated by comma.
You said:
"The syntax for filtering strings and collections of strings is different."

So how would you find where filter text is includes in the NamedEntities as exact match or as part of the phrase.
Can you provide examples in C#?
I just tried to delete a file based on file name that I keep in the metadata as "Source" field and it produces 0 results. What what is wrong with this code?

`            var options = new SearchOptions
            {
                Filter = $"Source eq '{fileName}'"
            };

            Response<SearchResults<SegmentObj>>? searchResults = null;
            try
            {
                searchResults = await searchClient
                                .SearchAsync<SegmentObj>("*", options)
                                .ConfigureAwait(false);
            }
            catch (RequestFailedException ex)
            {
                _logger.LogError(ex, ex.Message);
            }

            List<SegmentObj> segList = new List<SegmentObj>();
            if (searchResults != null)
            {
                await foreach (SearchResult<SegmentObj>? result in searchResults.Value.GetResultsAsync())
                {
                    try
                    {
                        if (result == null)
                            continue;

                        segList.Add(result.Document);

                    }
                    catch (Exception ex)
                    {
                        _logger.LogError(ex, ex.Message);
                    }
                }
                
                if (segList.Count > 0)
                    await searchClient.DeleteDocumentsAsync(segList);
            }`

@mattgotteiner
Copy link
Member

Since NamedEntities is a string field, you might want to use the search.ismatchscoring function to issue a sub-query targeted towards just that field.

https://learn.microsoft.com/en-us/azure/search/search-query-odata-full-text-search-functions#searchismatchscoring

Regarding why the filter "Source eq '{fileName}'" returns no results - it's possible the file name you have specified isn't present in the index with the exact same casing or spacing. eq is an exact string match.

I hope this helps,
Matt

@pallavit
Copy link
Contributor

Thank you @mattgotteiner for the assistance here.

@pallavit pallavit added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Jan 18, 2024
Copy link

Hi @securigy. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

@github-actions github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jan 18, 2024
@pallavit pallavit added this to the 2024-02 milestone Jan 19, 2024
@securigy
Copy link
Author

search.ismatchscoring does not exist in C#/AzureSDK.
Id I am mistaken please point me to it.

Copy link

Hi @securigy, since you haven’t asked that we /unresolve the issue, we’ll close this out. If you believe further discussion is needed, please add a comment /unresolve to reopen the issue.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 28, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Search Service Attention Workflow: This issue is responsible by Azure service team.
Projects
Archived in project
Development

No branches or pull requests

5 participants