[BUG] Additional aggregation in search request is changing results #14000

dennisoelkers · 2024-06-05T12:22:52Z

Describe the bug

We have discovered erratic behavior for aggregation results introduced in v2.14.0 (double-checked against v2.13.0, results are okay there). We have the following query (the actual queries we are using make more sense, this is the minimal query we have found to trigger the issue):

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				}
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"timestamp-min": {
			"min": {
				"field": "timestamp"
			}
		}
	}
}

The underlying data in the index are test fixtures from out integration tests (check here) - format is that each document in $.documents[*].document has a index metadata object containing the index name/document id/document type and data containing the actual document.

Starting with v2.14.0, the result looks like this:

{
	"took": 102,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1000,
			"relation": "eq"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
		"timestamp-min": {
			"value": 1.664201539998E12,
			"value_as_string": "2022-09-26 14:12:19.998"
		},
		"http_method": {
			"meta": {},
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [
				{
					"key": "GET",
					"doc_count": 860,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 1720
							}
						]
					}
				},
				{
					"key": "DELETE",
					"doc_count": 52,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 104
							}
						]
					}
				},
				{
					"key": "POST",
					"doc_count": 45,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 90
							}
						]
					}
				},
				{
					"key": "PUT",
					"doc_count": 43,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 86
							}
						]
					}
				}
			]
		}
	}
}

What is noteworthy is that for each of the buckets generated by the filter aggregations, the document count of the bucket that goes into the filter aggregation is half of the document count that comes out of the filter aggregation:

[...]
			"buckets": [
				{
					"key": "GET",
					"doc_count": 860,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 1720 <--- more documents in leaf bucket than in parent bucket
							}
						]
					}
				},
[...]

The initial document count (e.g. 860 for the GET bucket) is correct, the one in the bucket produced by the filter aggregation is not. When adding e.g. a sum metric aggregation as a leaf, the value returned is double of what is expected, so it looks like for some reason the actual documents are duplicated during the aggregation.

What is very strange is:

the exact same query returns correct results in all versions before v2.14.0
when removing the timestamp-min metric, the results are okay
when using a terms aggregation instead of a scripted terms aggregation (i.e. replacing "script": {...} with "field": "http_method", the results are okay

Related component

Search:Aggregations

To Reproduce

Use this query:

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				}
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"timestamp-min": {
			"min": {
				"field": "timestamp"
			}
		}
	}
}

Execute query against sample data

Expected behavior

Expecting a result like this:

{
	"took": 4,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 1000,
			"relation": "eq"
		},
		"max_score": null,
		"hits": []
	},
	"aggregations": {
                "timestamp-min": {
			"value": 1.664201539998E12,
			"value_as_string": "2022-09-26 14:12:19.998"
		},
		"http_method": {
			"meta": {},
			"doc_count_error_upper_bound": 0,
			"sum_other_doc_count": 0,
			"buckets": [
				{
					"key": "GET",
					"doc_count": 860,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 860
							}
						]
					}
				},
				{
					"key": "DELETE",
					"doc_count": 52,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 52
							}
						]
					}
				},
				{
					"key": "POST",
					"doc_count": 45,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 45
							}
						]
					}
				},
				{
					"key": "PUT",
					"doc_count": 43,
					"filter_aggregation": {
						"meta": {},
						"buckets": [
							{
								"doc_count": 43
							}
						]
					}
				}
			]
		}
	}
}

Additional Details

Plugins
No plugins installed

Screenshots
-- Not applicable --

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

dennisoelkers · 2024-06-05T14:53:12Z

A git bisect turned up this commit to be the first introducing the issue: 795d868

dennisoelkers · 2024-06-05T14:58:35Z

@jed326: Do you have an idea why that change produces these results?

jed326 · 2024-06-05T15:07:55Z

Thanks for reporting this @dennisoelkers. Are you seeing the issue with the filter aggregation only with concurrent segment search enabled or also with it disabled as well?

peternied · 2024-06-05T15:11:38Z

[Triage - attendees 1 2 3 4 5 6 7]
@dennisoelkers Thanks for creating this issue

dennisoelkers · 2024-06-05T15:22:12Z

@jed326: I am using default options. It looks like concurrent segment search is disabled by default?

From querying /<index>/_settings?include_defaults=true:

        "search": {
          "concurrent_segment_search": {
            "enabled": "false"
          },

jed326 · 2024-06-05T15:30:58Z

@dennisoelkers yes concurrent segment search is disabled by default. Could you quickly check if the same issue persists with concurrent search enabled?

These are both very interesting:

when removing the timestamp-min metric, the results are okay
when using a terms aggregation instead of a scripted terms aggregation (i.e. replacing "script": {...} with "field": "http_method", the results are okay

At a glance it looks like there might be something going wrong with agg scripting (there's a known issue for composite aggs w/ scripting #12947).

What's strange is the commit 795d868 ideally shouldn't affect the non-concurrent search path at all but perhaps that is not the case.

Let me take a deeper look at this today. If you have any pointers on how to set up the backing data for reproduction that would help a lot (for example if you have a java integration test that is indexing all this data).

dennisoelkers · 2024-06-05T15:43:34Z

@jed326: Will try with CSS enabled, just want to drop you some aid for setting up a reproducible scenario. For bisecting, I was using the fixtures from one of our tests. In order to ingest it into an instance running locally, I wrote this script. It is assuming that OS is running on port 9200, with no TLS and no authentication (I just used ./gradlew run to start OS). The index it writes to is graylog_0.

dennisoelkers · 2024-06-05T15:47:29Z

@jed326: Enabling CSS on the index and rerunning the query also returns results which are off.

jed326 · 2024-06-05T15:48:00Z

@dennisoelkers Thanks for checking and thanks for the reproduction aids. I will try to get to the bottom of this today!

jed326 · 2024-06-05T16:32:38Z

Just confirming I checked out the parent commit 42f00ba and don't see the problem with either concurrent search enabled.
On commit 795d868 the same query from above shows this on both concurrent search enabled and disabled.

{
                    "key": "GET",
                    "doc_count": 860,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 927
                            }
                        ]
                    }
                },
                {
                    "key": "DELETE",
                    "doc_count": 52,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 60
                            }
                        ]
                    }
                },
                {
                    "key": "POST",
                    "doc_count": 45,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 51
                            }
                        ]
                    }
                },
                {
                    "key": "PUT",
                    "doc_count": 43,
                    "filter_aggregation": {
                        "meta": {},
                        "buckets": [
                            {
                                "doc_count": 45
                            }
                        ]
                    }
                }

Tried re-indexing and see slightly different results so it's probably segment layout dependent.

jed326 · 2024-06-05T23:12:29Z

@dennisoelkers I was able to get to the bottom of this today. In 618782d we added some logic to unwrap the MultiBucketCollector to get the saved InternalAggregation objects here:

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/BucketCollectorProcessor.java

Lines 73 to 84 in ba0df74

    
           } else if (currentCollector instanceof BucketCollector) { 
        
               ((BucketCollector) currentCollector).postCollection(); 
        
               // Perform build aggregation during post collection 
        
               if (currentCollector instanceof Aggregator) { 
        
                   ((Aggregator) currentCollector).buildTopLevel(); 
        
               } else if (currentCollector instanceof MultiBucketCollector) { 
        
                   for (Collector innerCollector : ((MultiBucketCollector) currentCollector).getCollectors()) { 
        
                       collectors.offer(innerCollector); 
        
                   } 
        
               } 
        
           }

However there's a bug here where if a MultiBucketCollector is present then postCollection is going to get called twice for the collectors in the MultiBucketCollector -- once as a part of MultiBucketCollector::postCollection and then again when it's unwrapped to the individual collector.

This manifests as a problem in the BestBucketsDeferringCollector used for deferred collections as finishLeaf() will subsequently get called twice and we will get 2 deferred entries for the last leaf.

OpenSearch/server/src/main/java/org/opensearch/search/aggregations/bucket/BestBucketsDeferringCollector.java

Lines 123 to 128 in ba0df74

    
           private void finishLeaf() { 
        
               if (context != null) { 
        
                   assert docDeltasBuilder != null && bucketsBuilder != null; 
        
                   entries.add(new Entry(context, docDeltasBuilder.build(), bucketsBuilder.build())); 
        
               } 
        
           }

To specifically address the "strange" points you shared above:

when removing the timestamp-min metric, the results are okay

The issue at hand is not specific to the min aggregation, if any additional agg is at the same level you will see this issue as the MultiBucketCollector is used when there are multiple aggregations at the same level. For example you will see the same problem in the following query:

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				}
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"other-terms": {
			"terms": {
				"field": "timestamp"
			}
		}
	}
}

when using a terms aggregation instead of a scripted terms aggregation (i.e. replacing "script": {...} with "field": "http_method", the results are okay

The issue with the double counted last leaf is specific to when the deferring collector is used. By default the collect mode when a script is present is breadth_first, which uses the deferring collector, while the default for the regular terms agg on the http_method is depth_first. If you manually set the collect mode you can still get the correct results with the painless script like so:

{
	"size": 0,
	"aggs": {
		"http_method": {
			"terms": {
				"script": {
					"source": "doc['http_method']",
					"lang": "painless"
				},
                "collect_mode" : "depth_first"
			},
			"aggs": {
				"filter_aggregation": {
					"filters": {
						"filters": [
							{
								"bool": {}
							}
						]
					}
				}
			}
		},
		"timestamp-min": {
			"min": {
				"field": "timestamp"
			}
		}
	}
}

jed326 · 2024-06-05T23:15:42Z

Working on getting a PR out to address this bug, I think we should be able to get it into the 2.15 release. On 2.14 you can manually set the collect_mode to depth_first for now as a workaround (this might have some performance regressions on fields with very high cardinality though).

dennisoelkers added bug Something isn't working untriaged labels Jun 5, 2024

github-actions bot added the Search:Aggregations label Jun 5, 2024

peternied removed the untriaged label Jun 5, 2024

jed326 self-assigned this Jun 5, 2024

sohami added the v2.15.0 Issues and PRs related to version 2.15.0 label Jun 5, 2024

jed326 mentioned this issue Jun 6, 2024

Fix double invocation of postCollection when MultiBucketCollector is present #14015

Merged

3 tasks

jed326 closed this as completed in #14015 Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Additional aggregation in search request is changing results #14000

[BUG] Additional aggregation in search request is changing results #14000

dennisoelkers commented Jun 5, 2024 •

edited

Loading

dennisoelkers commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024

jed326 commented Jun 5, 2024

peternied commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024

jed326 commented Jun 5, 2024 •

edited

Loading

dennisoelkers commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024

jed326 commented Jun 5, 2024

jed326 commented Jun 5, 2024

jed326 commented Jun 5, 2024 •

edited

Loading

jed326 commented Jun 5, 2024

[BUG] Additional aggregation in search request is changing results #14000

[BUG] Additional aggregation in search request is changing results #14000

Comments

dennisoelkers commented Jun 5, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

dennisoelkers commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024

jed326 commented Jun 5, 2024

peternied commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024

jed326 commented Jun 5, 2024 • edited Loading

dennisoelkers commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024

jed326 commented Jun 5, 2024

jed326 commented Jun 5, 2024

jed326 commented Jun 5, 2024 • edited Loading

jed326 commented Jun 5, 2024

dennisoelkers commented Jun 5, 2024 •

edited

Loading

jed326 commented Jun 5, 2024 •

edited

Loading

jed326 commented Jun 5, 2024 •

edited

Loading