[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

securigy · 2024-01-03T20:20:14Z

Library name and version

Azure.Search.Documents and Azure.AI.OpenAI

Query/Question

I put together a RAG system. I segmented 2 Word documents on paragraph boundaries using only full sentences and successfully ingested them with Azure API. My search has a choice of Vector, Hybrid (vector + text), and Semantic (vector + text + semantic reordering). Those 2 documents are my resume and my friend's resume. When I ask "Who is [my name]?" I get a decent answer. However, when I ask "Who is [my friend's name] I get nothing in the form of "Based on provided information there is no data on...".

I tried that as either of 3 modes of search I mentioned above. The code that defines my search is :

`  public SearchOptions? CreateSearchOptions(int searchTypeInt, int k, ReadOnlyMemory<float> embeddings)
   {
       _logger.LogInformation("CreateSearchOptions entered");

       SearchOptions? searchOptions = null;
       try
       {
           SearchType searchType = (SearchType)searchTypeInt;

           searchOptions = new SearchOptions
           {
               //Filter = filter, will be set later
               Size = k,

               // fields to retrieve, if not specified then all are retrieved if retrievable
               Select = { "SegmentText", "NamedEntities" },
           };

           if ((searchType & SearchType.Vector) == SearchType.Vector)
           {
               searchOptions.VectorSearch = new VectorSearchOptions
               {
                   Queries = { new VectorizedQuery(embeddings) { KNearestNeighborsCount = k, Fields = { "SegmentTextVector" } } },
               };
           }

           if ((searchType & SearchType.Semantic) == SearchType.Semantic)
           {
               // This is going to invoke semantic ranking, if its resource is enable in Azure
               searchOptions.SemanticSearch = new SemanticSearchOptions
               {
                   SemanticConfigurationName = "my-semantic-config", // Is it a new name that we give?
                   QueryCaption = new QueryCaption(QueryCaptionType.Extractive),
                   QueryAnswer = new QueryAnswer(QueryAnswerType.Extractive),
               };
               searchOptions.QueryType = SearchQueryType.Semantic; // Set the QueryType for Semantic Search
           }
       }
       catch (Exception ex)
       {
           _logger.LogError(ex, ex.Message);
           _logger.LogInformation("CreateSearchOptions exiting");
           return null;
       }

       _logger.LogInformation("CreateSearchOptions exiting");

       return searchOptions;
   }`

An here is how the index is constructed:

`   public SearchIndex? GetOrCreateSearchIndex(int searchTypeInt,
                                               SearchIndexClient searchIndexClient,
                                               string searchIndexName)
    {
        SearchIndex? searchIndex = null;

        SearchType searchType = (SearchType)searchTypeInt;

        try
        {
            // Get index if exists
            searchIndex = searchIndexClient.GetIndex(searchIndexName);
        }
        catch (RequestFailedException ex) when (ex.Status == 404)
        {
            try
            {
                // The search index schema (local definition of index) does not exist - create one
                FieldBuilder builder = new FieldBuilder();
                searchIndex = new SearchIndex(searchIndexName, builder.Build(typeof(SegmentObj)));

                // VECTOR OPTION
                if ((searchType & SearchType.Vector) == SearchType.Vector)
                {
                    //string vectorProfileFilePath = Path.Combine(InstallDir, mVectorProfileFileName);
                    string jsonStr = File.ReadAllText("VectorProfile.json");

                    searchIndex.VectorSearch = new VectorSearch
                    {
                        Profiles =
                        {
                            // using 11.5.0-beta5 of Azure.Search.Documents, in 11.5.0 it will be different ???
                            //new VectorSearchProfile("my-default-vector-profile", "my-hnsw-config-2")
                            new VectorSearchProfile("my-vector-profile", "my-hnsw-config")
                            {
                                 Name = "my-vector-profile",
                                 AlgorithmConfigurationName ="my-hnsw-config"
                            }
                        },
                        Algorithms =
                        {
                            // using 11.5.0-beta5 of Azure.Search.Documents, in 11.5.0 it will be different ???
                            //new HnswAlgorithmConfiguration("my-hnsw-config-2") // using 11.5.0-beta5 of Azure.Search.Documents
                            new HnswAlgorithmConfiguration("my-hnsw-config")
                            {
                                 Name = "my-hnsw-config",
                                 Parameters = new HnswParameters()
                                 {
                                      Metric = "cosine",
                                       EfSearch = 800,
                                        EfConstruction = 800,
                                         M = 8
                                 }
                            },
                        }
                    };

                }

                // SEMANTIC OPTION
                if ((searchType & SearchType.Semantic) == SearchType.Semantic)
                {
                    SemanticSearch semanticSearch = new SemanticSearch
                    {
                        Configurations =
                        {
                            // Looks line "my-semantic-config" is not a file name by a name I give
                            // to new semantic configuration I am creating...Not sure about it...
                            // https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-request?tabs=rest%2Crest-query
                            // The above documentation says: Set "semanticConfiguration" to a
                            // predefined semantic configuration that's embedded in your index.
                            //
                            new SemanticConfiguration("my-semantic-config", new()
                            {
                                //TitleField = new SemanticField("HotelName"),
                                ContentFields =
                                {
                                    new SemanticField("SegmentText"),
                                    new SemanticField("NamedEntities")
                                },
                                KeywordsFields =
                                {
                                    new SemanticField("NamedEntities")
                                }
                            })
                        }
                    };

                    searchIndex.SemanticSearch = semanticSearch;
                }

                // Create SearchIndex in Azure
                if (searchIndex != null)
                    searchIndex = searchIndexClient.CreateIndex(searchIndex);
            }
            catch (Exception ex2)
            {
                _logger.LogError(ex2, ex2.Message);
                return null;
            }
        }

        return searchIndex;
    }

`

As you can see, in addition to all I use Analytics (not shown in the displayed code) in order to extract Named Entities from every text segment and populate NamedEntities field for every segment I ingest, and KeywordsFields accordingly. When I use

Calling the code is as follows:

`            searchOptions = CreateSearchOptions(searchTypeInt, k, embeddings);
            if (searchOptions != null)
            {
                if (!String.IsNullOrWhiteSpace(filter.Trim()))
                    searchOptions.Filter = $"NamedEntities eq '{filter}'";
            }

            SearchResults<SegmentObj>? response = null;
            if (searchOptions != null)
                response = await searchClient.SearchAsync<SegmentObj>(searchText, searchOptions);`

Now, I get responses that are relevant, that is, valid responses, where some of them have valid description who the person is, professionally. But when I feed the prompt along with the 8 results that I get above from vector search into Completion API, I get nothing.
I do use GPT-3.5-turbo-16k, because for whatever reason GPT-4 is not available to me at this time in Azure. So, my first suspicion is: is it because I do not use GPT-4? Is there any other reason you could think about? If you'd like you I can provide you with prompt and the context (8 results from vector search).

Environment

Windows 11, VS2022, Azure.Search.Document 11.5.1, Azure.AI.OpenAI 1.0.0-beta.11, Embeddings 1.0.0-beta.9,
Microsoft.AspNetCore.OpenAI 7.0.13

The text was updated successfully, but these errors were encountered:

jsquire · 2024-01-03T20:41:19Z

Thank you for your feedback. Tagging and routing to the team member best able to assist.

mattgotteiner · 2024-01-03T23:11:43Z

Hi @securigy ,

Sorry to hear the results aren't what you are expecting. Here's some answers to your questions in the comments

SemanticConfigurationName = "my-semantic-config", // Is it a new name that we give?
No, this name must match an existing semantic configuration field on your index https://learn.microsoft.com/en-us/azure/search/semantic-search-overview

                            // using 11.5.0-beta5 of Azure.Search.Documents, in 11.5.0 it will be different ???
                            //new VectorSearchProfile("my-default-vector-profile", "my-hnsw-config-2")
                            new VectorSearchProfile("my-vector-profile", "my-hnsw-config")
                            {
                                 Name = "my-vector-profile",
                                 AlgorithmConfigurationName ="my-hnsw-config"
                            }

There have been backwards-incompatible changes to the nuget, but the same code that works on 11.5.0 should work with 11.5.0-beta5. For the list of specific changes to the 2023-11-01 api please review https://learn.microsoft.com/en-us/azure/search/search-api-migration#upgrade-to-2023-11-01

              // Looks line "my-semantic-config" is not a file name by a name I give
                            // to new semantic configuration I am creating...Not sure about it...
                            // https://learn.microsoft.com/en-us/azure/search/semantic-how-to-query-request?tabs=rest%2Crest-query
                            // The above documentation says: Set "semanticConfiguration" to a
                            // predefined semantic configuration that's embedded in your index.
                            //
                            new SemanticConfiguration("my-semantic-config", new()
                            {

semantic configuration is not a file name. It's a JSON object that describes how semantic ranking will work. The name "my-semantic-config" means you can reference this config in the query by using the name "my-semantic-config".

Here's a couple thoughts as to how you can improve search quality:

Evaluate your chunking strategy. For example, it's possible to miss relevant chunks if there is no overlap. https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents
Consider using integrated vectorization instead of manually chunking https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization
Experiment with removing the named entities filter. It's possible relevant chunks are being removed by the filter.

I hope this helps,
Matt

securigy · 2024-01-04T01:48:19Z

This is probably irrelevant to my case. I split the Word document based on paragraphs and always on sentence boundaries, and never exceed 512 tokens. I agree that overlap is generally useful when you just chop the documents based on max tokens like 512. Then it could be a middle of the sentence, etc....
I do not know how you vectorize the documents - Is Word, PDF, Excel, etc. all vectorized the same? If yes, I differ... so I cannot use something that is not documented and is a big question mark.
I don't get it... removed? I thought that the filter helps bringing the results that include particular word. Anyway, I have a problem with it - how do I include filter that will return only results containing specific word. For example:

searchOptions = CreateSearchOptions(searchTypeInt, k, embeddings);
if (searchOptions != null)
{
if (!String.IsNullOrWhiteSpace(filter.Trim()))
searchOptions.Filter = $"NamedEntities eq '{filter}'";
}

When I do the above and my filter word is "Amy" I get an error:

Parameter name: $filter
Status: 400 (Bad Request)

Content:
{"error":{"code":"","message":"Invalid expression: Syntax error: character '' is not valid at position 17 in 'NamedEntities eq Amy`'.\r\nParameter name: $filter"}}

The NamedEntities string field is very useful, it contains coma-separated words and phrases, like names of people, addresses, location names, dates, etc.

So maybe I dont understand how filter is supposed to work. I tried it without single quate, and get the error:
Azure.RequestFailedException: 'Invalid expression: Could not find a property named 'Amy' on type 'search.document'.
Parameter name: $filter
Status: 400 (Bad Request)

pallavit · 2024-01-04T18:01:53Z

Assigning to @mattmsft and tagging as Service Attention

github-actions · 2024-01-04T18:02:25Z

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @arv100kri @bleroy @tjacobhi.

mattgotteiner · 2024-01-04T19:06:17Z

I’m glad you have a specific chunking strategy in mind.
The chunking strategy for integrated vectorization is documented, and it’s customizable https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit
How are you extracting named entities from your chunks? As long as you’re confident your named entity extraction strategy works then you can leave the filter in, but if it doesn’t find the most relevant named entities it might not improve search quality.
What’s the type of the NamedEntities field? The syntax for filtering strings and collections of strings is different.

I hope this helps,
Matt

securigy · 2024-01-04T19:29:45Z

#2. I will take a look, but based what I experienced with chunking Word, PDF, Excel, TEXT and CSV - chunking is different for every file type... #3. I use Analytics and it does it for me. It really extracts names, addresses, titles, etc. and does a great job - verified results. #4. I have a problem with filtering. I tried the suggestion ofsearchOptions.Filter = "NamedEntities eq 'Amy'" or searchOptions.Filter = "NamedEntities eq '{filterText}'" searchOptions.Filter = "NamedEntities eq {Amy}" searchOptions.Filter = "NamedEntities eq {filterText}" and it is either throws exception or kills any all the search results. Basically, my impression is that it will only produce search results where the field NamedEntities has the word Amy. Am I wrong? Another thing that I observed is that when the produce (8) results are ok, some of them really relevant info and I feed them into GPT-3.5-turbo-16k along with the prompt "Who is Amy" - it produces 0 responses... On Thursday, January 4, 2024 at 11:06:29 AM PST, Matt ***@***.***> wrote: - I’m glad you have a specific chunking strategy in mind. - The chunking strategy for integrated vectorization is documented, and it’s customizable https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit - How are you extracting named entities from your chunks? As long as you’re confident your named entity extraction strategy works then you can leave the filter in, but if it doesn’t find the most relevant named entities it might not improve search quality. - What’s the type of the NamedEntities field? The syntax for filtering strings and collections of strings is different. I hope this helps, Matt — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

securigy · 2024-01-05T03:53:53Z

NamedEntities is a string filed. The results of the Azure Analytics query produces strings that I concatenate into one string where words and phrases produced by analytics are separated by comma.
You said:
"The syntax for filtering strings and collections of strings is different."

So how would you find where filter text is includes in the NamedEntities as exact match or as part of the phrase.
Can you provide examples in C#?
I just tried to delete a file based on file name that I keep in the metadata as "Source" field and it produces 0 results. What what is wrong with this code?

`            var options = new SearchOptions
            {
                Filter = $"Source eq '{fileName}'"
            };

            Response<SearchResults<SegmentObj>>? searchResults = null;
            try
            {
                searchResults = await searchClient
                                .SearchAsync<SegmentObj>("*", options)
                                .ConfigureAwait(false);
            }
            catch (RequestFailedException ex)
            {
                _logger.LogError(ex, ex.Message);
            }

            List<SegmentObj> segList = new List<SegmentObj>();
            if (searchResults != null)
            {
                await foreach (SearchResult<SegmentObj>? result in searchResults.Value.GetResultsAsync())
                {
                    try
                    {
                        if (result == null)
                            continue;

                        segList.Add(result.Document);

                    }
                    catch (Exception ex)
                    {
                        _logger.LogError(ex, ex.Message);
                    }
                }
                
                if (segList.Count > 0)
                    await searchClient.DeleteDocumentsAsync(segList);
            }`

mattgotteiner · 2024-01-17T00:23:17Z

Since NamedEntities is a string field, you might want to use the search.ismatchscoring function to issue a sub-query targeted towards just that field.

https://learn.microsoft.com/en-us/azure/search/search-query-odata-full-text-search-functions#searchismatchscoring

Regarding why the filter "Source eq '{fileName}'" returns no results - it's possible the file name you have specified isn't present in the index with the exact same casing or spacing. eq is an exact string match.

I hope this helps,
Matt

pallavit · 2024-01-18T18:06:34Z

Thank you @mattgotteiner for the assistance here.

github-actions · 2024-01-18T18:07:05Z

Hi @securigy. Thank you for opening this issue and giving us the opportunity to assist. We believe that this has been addressed. If you feel that further discussion is needed, please add a comment with the text "/unresolve" to remove the "issue-addressed" label and continue the conversation.

securigy · 2024-01-22T04:25:56Z

search.ismatchscoring does not exist in C#/AzureSDK.
Id I am mistaken please point me to it.

github-actions · 2024-01-29T04:34:14Z

Hi @securigy, since you haven’t asked that we /unresolve the issue, we’ll close this out. If you believe further discussion is needed, please add a comment /unresolve to reopen the issue.

jsquire assigned ShivangiReja Jan 3, 2024

pallavit assigned mattgotteiner Jan 4, 2024

pallavit added the Service Attention Workflow: This issue is responsible by Azure service team. label Jan 4, 2024

pallavit added the issue-addressed Workflow: The Azure SDK team believes it to be addressed and ready to close. label Jan 18, 2024

github-actions bot removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Jan 18, 2024

pallavit added this to the 2024-02 milestone Jan 19, 2024

github-actions bot closed this as completed Jan 29, 2024

github-actions bot locked and limited conversation to collaborators Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

securigy commented Jan 3, 2024 •

edited

Loading

jsquire commented Jan 3, 2024

mattgotteiner commented Jan 3, 2024

securigy commented Jan 4, 2024 •

edited

Loading

pallavit commented Jan 4, 2024

github-actions bot commented Jan 4, 2024

mattgotteiner commented Jan 4, 2024

securigy commented Jan 4, 2024 via email

securigy commented Jan 5, 2024

mattgotteiner commented Jan 17, 2024

pallavit commented Jan 18, 2024

github-actions bot commented Jan 18, 2024

securigy commented Jan 22, 2024

github-actions bot commented Jan 29, 2024

[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

[Poor results for RAG with Vector/Hybrid/Semantic search] #40983

Comments

securigy commented Jan 3, 2024 • edited Loading

Library name and version

Query/Question

Environment

jsquire commented Jan 3, 2024

mattgotteiner commented Jan 3, 2024

securigy commented Jan 4, 2024 • edited Loading

pallavit commented Jan 4, 2024

github-actions bot commented Jan 4, 2024

mattgotteiner commented Jan 4, 2024

securigy commented Jan 4, 2024 via email

securigy commented Jan 5, 2024

mattgotteiner commented Jan 17, 2024

pallavit commented Jan 18, 2024

github-actions bot commented Jan 18, 2024

securigy commented Jan 22, 2024

github-actions bot commented Jan 29, 2024

securigy commented Jan 3, 2024 •

edited

Loading

securigy commented Jan 4, 2024 •

edited

Loading