Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing ES Forward Compatibility: Default CI Group #1 / APM API tests basic apm_8.0.0 Transaction groups main statistics when data is loaded returns the correct data for latency aggregation 99th percentile #161042

Closed
mistic opened this issue Jun 30, 2023 · 14 comments
Labels
apm:needs-test failed-es-promotion Team:APM All issues that need APM UI Team support

Comments

@mistic
Copy link
Member

mistic commented Jun 30, 2023

APM API Integration tests (basic)
x-pack/test/apm_api_integration/tests/transactions/transactions_groups_main_statistics.ts

APM API tests basic apm_8.0.0 Transaction groups main statistics when data is loaded returns the correct data for latency aggregation 99th percentile

This failure is preventing the ES 8.10 forward compatibility pipeline to proceed.

Error: expect(received).toMatchInlineSnapshot(snapshot)

Snapshot name: `APM API tests basic apm_8.0.0 Transaction groups main statistics when data is loaded returns the correct data for latency aggregation 99th percentile 1`

Snapshot: 66836803
Received: 36904905.5499999
    at Context.<anonymous> (test/apm_api_integration/tests/transactions/transactions_groups_main_statistics.ts:153:43)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.apply (/opt/local-ssd/buildkite/builds/kb-n2-4-25fbcb5062acb83c/elastic/kibana-7-dot-17-es-8-dot-10-forward-compatibility/kibana/node_modules/@kbn/test/target_node/functional_test_runner/lib/mocha/wrap_function.js:87:16)
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:APM)

@mistic
Copy link
Member Author

mistic commented Jun 30, 2023

Skipped top level suite.

7.17: 881bb7c

@sorenlouv
Copy link
Member

sorenlouv commented Jul 2, 2023

@mistic It's interesting that this occurs only shortly after the fork of tdigest. One of the known impacts of that is the new percentile algo. Could this be related?

mistic added a commit to mistic/kibana that referenced this issue Jul 5, 2023
chore(NA): upgrade babel to 7.21 on branch 7.17
@mistic
Copy link
Member Author

mistic commented Jul 5, 2023

@sqren interesting finding, I think it can be very well related

@sorenlouv
Copy link
Member

It's a pretty big change: 66,836,803 vs 36,904,905. Should we flag it as a possible bug with the ES team or just let it slide and update the snapshots?

@achyutjhunjhunwala
Copy link
Contributor

achyutjhunjhunwala commented Aug 3, 2023

@mistic I am a bit confused here, is this test still breaking on main.The reported failure i see is from 30th June. Currently i don't seen this test failing in any of our PRs, neither locally ?

Also i see a comment tagging 7.17 ? Does this mean that the ES Forward Compatibility test from 7.17 branch is failing ?

I see the test is skipped for 7.17 branch.
@mistic
Do you by any chance know how i can run this test locally. I mean
Kibana - 7.17 and ES - 8.10 or even 8.9 as i want to check if its something introduced in 8.10 which is causing the test to break ?

[UPDATE]
I can reproduce the issue locally now

@achyutjhunjhunwala
Copy link
Contributor

I believe @sqren suspicion is correct here. I ran the tests locally with Kibana at 7.17 and ES set once at 8.10 and once at 8.9 and both times, it gave consistently the same error.

Steps performed

  1. Check out 7.17 branch locally(As that's required to run Kibana 7.17)
  2. Set environment variable to use ES 8.10
export ES_SNAPSHOT_MANIFEST="https://storage.googleapis.com/kibana-ci-es-snapshots-daily/8.10.0/manifest-latest-verified.json"
  1. Run the API tests which uses Kibana 7.17 and ES 8.10
node scripts/functional_tests \
    --bail \
    --config x-pack/test/apm_api_integration/basic/config.ts

The tests fails.

In current main, some time back i have removed the archives and added synthtrace which means it works correctly there.

Not sure how to proceed here ?

The only option i see is to leave this test skipped forever on 7.17 as updating archives is not an easy task.

@achyutjhunjhunwala
Copy link
Contributor

achyutjhunjhunwala commented Aug 9, 2023

So APM Server uses HDR to compute Histograms and save it.

2 possible theories -

  1. Up until 8.8 i believe Elasticsearch's Latency Aggregation query would default to HDR but now it defaults to TDigest
  2. Up until 8.8 Elasticsearch's Latency Aggregation query would use a different version of TDigest and from 8.9 the change in TDigest is causing this huge discrepancy.

After discussion with @dgieselaar it seems that theory 2 is valid here.

Query to retrieve percentile

"latency": {
  "percentiles": {
    "field": "transaction.duration.histogram",
    "percents": [
        99
     ]
  }
},

Now if we add hdr to it, only then the result is close to what it used to be previously.

"latency": {
  "percentiles": {
    "field": "transaction.duration.histogram",
    "percents": [
        99
     ],
    "hdr": {}
  }
},

There is still a delta in numbers now -

7.17 8.9 (HDR) 8.9 (T Digest)
66836803 66846716 36904905.5499999

@sorenlouv
Copy link
Member

@achyutjhunjhunwala Thanks for digging into this. Let's talk about this during today's UI sync.

@kkrik-es
Copy link

kkrik-es commented Aug 11, 2023

This is likely due to #95903, indeed. Note that TDigest was already the default for percentiles; what changed was the internal implementation, with the new default being 2x-10x faster but less accurate.

We included some details about how to fall back to the old behavior in the breaking changes documentation. Still, do we have a good idea of how much of an issue this is, in practice? How many values are processed? What are the other percentiles (50%, 75%, 90%, 99%, 100%)?

@achyutjhunjhunwala
Copy link
Contributor

achyutjhunjhunwala commented Aug 16, 2023

@kkrik-es I have prepared a 8.10 stack which has this archive imported.

https://p.elstc.co/paste/EjSk4xkG#ZCJ+r2lrKmM19Vwk01yvvUGVbeRdhsw5WlmuNSzwF1Q

Here you can play the with data as well

Regarding the numbers, on the APM side, we were only calculating P99 and P95

P99 (8.10) P95 (8.10) P99 (7.17) P95 (7.17)
36917738.51999988 5639371.8 66836803 5665134

Looking at the above data, P95 looks quite close but P99 is almost 2X

@kkrik-es
Copy link

This looks like a fairly skewed distribution:

"latency": {
  "values": {
    "90.0": 736562.2000000074,
    "95.0": 5639371.8,
    "98.0": 6988758.039999757,
    "99.0": 36917738.51999988,
    "99.9": 63853820.95200047,
    "100.0": 66846719
  }
},

It seems like the newer implementation has fewer data points (expected) so the error is due to interpolating between the last and second-to-last centroid values. Nothing too extreme.

@achyutjhunjhunwala
Copy link
Contributor

@gbamparop @sqren @dgieselaar I had an offline chat with @kkrik-es and this seems to be expected behaviour with the TDigest change. We need to explore what we do with this test now as it needs to run on 7.17 as well forward compatibility on 8.9+.

One suggestion from @kkrik-es was to rather check for ranges instead of exact values. In our case the delta is 2x. we can leave it skipped as well on 7.17 alone as this test has been re-written using synthtrace for 8.x

@achyutjhunjhunwala
Copy link
Contributor

As discussed with the team during the weekly call, we will keep these tests skipped on 7.17 branch due to the complexity involved. Also this failure is not linked due to any bug related to Forward Compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apm:needs-test failed-es-promotion Team:APM All issues that need APM UI Team support
Projects
None yet
Development

No branches or pull requests

6 participants