Adding new inferred sampler #12927

nishchay21 · 2024-03-26T17:00:12Z

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-26T17:09:01Z

❌ Gradle check result for e02fc95: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-26T17:22:09Z

Compatibility status:

Checks if related components are compatible with change b944c0b

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/sql.git]

github-actions · 2024-03-26T17:25:46Z

❌ Gradle check result for 78dcc6b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-27T03:37:45Z

❌ Gradle check result for 432d294: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-27T04:01:04Z

❌ Gradle check result for ad8d7f3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-27T04:58:52Z

❌ Gradle check result for 5c4128b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-27T05:58:20Z

❌ Gradle check result for 47bef14: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-27T07:25:03Z

✅ Gradle check result for 4f0586a: SUCCESS

codecov · 2024-03-27T07:27:09Z

Codecov Report

Attention: Patch coverage is 78.04878% with 27 lines in your changes are missing coverage. Please review.

Project coverage is 71.37%. Comparing base (b15cb0c) to head (b944c0b).
Report is 287 commits behind head on main.

Files	Patch %	Lines
...racing/ThreadContextBasedTracerContextStorage.java	0.00%	10 Missing ⚠️
...ava/org/opensearch/telemetry/tracing/OTelSpan.java	70.83%	0 Missing and 7 partials ⚠️
...rg/opensearch/telemetry/tracing/DefaultTracer.java	75.00%	2 Missing and 1 partial ⚠️
...ing/handler/TraceableTransportResponseHandler.java	50.00%	2 Missing and 1 partial ⚠️
...telemetry/tracing/processor/OTelSpanProcessor.java	88.88%	0 Missing and 2 partials ⚠️
...tracing/channels/TraceableTcpTransportChannel.java	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #12927      +/-   ##
============================================
- Coverage     71.42%   71.37%   -0.05%     
- Complexity    59978    60376     +398     
============================================
  Files          4985     5025      +40     
  Lines        282275   284175    +1900     
  Branches      40946    41165     +219     
============================================
+ Hits         201603   202835    +1232     
- Misses        63999    64520     +521     
- Partials      16673    16820     +147

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/DefaultSpanScope.java

CHANGELOG.md

libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/attributes/AttributeType.java

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/noop/NoopSpan.java

plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java

...metry-otel/src/main/java/org/opensearch/telemetry/tracing/sampler/InferredActionSampler.java

plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelTracingTelemetry.java

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-28T09:46:49Z

❕ Gradle check result for 178e48b: UNSTABLE

TEST FAILURES:

      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationMultiSearchDuringQueryPhase
      1 org.opensearch.http.SearchRestCancellationIT.testAutomaticCancellationDuringQueryPhase
      1 org.opensearch.gateway.RecoveryFromGatewayIT.testShardStoreFetchMultiNodeMultiIndexesUsingBatchAction

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-28T12:34:26Z

✅ Gradle check result for 6bd52e9: SUCCESS

libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/DefaultTracer.java

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/Span.java

libs/telemetry/src/main/java/org/opensearch/telemetry/tracing/TracerContextStorage.java

plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java

...rc/main/java/org/opensearch/telemetry/tracing/handler/TraceableTransportResponseHandler.java

libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java

Signed-off-by: Nishchay Malhotra <[email protected]> Adding new sampled attributes Adding new inferred sampler attributes Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-03-29T15:00:09Z

❌ Gradle check result for 7504c6a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-04-01T05:12:29Z

❌ Gradle check result for a1b3df1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-04-01T06:32:36Z

✅ Gradle check result for 1e1d624: SUCCESS

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-04-01T12:14:42Z

✅ Gradle check result for 8bce540: SUCCESS

Signed-off-by: Nishchay Malhotra <[email protected]>

github-actions · 2024-04-01T18:40:09Z

❕ Gradle check result for b944c0b: UNSTABLE

TEST FAILURES:

      1 org.opensearch.repositories.azure.AzureBlobStoreRepositoryTests.testIndicesDeletedFromRepository

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

nishchay21 · 2024-04-03T06:10:02Z

@reta Can you help review the same

reta · 2024-04-03T11:19:47Z

@reta Can you help review the same

I will try to find the time this week, thanks @nishchay21

reta · 2024-04-05T13:49:49Z

...metry-otel/src/main/java/org/opensearch/telemetry/tracing/sampler/InferredActionSampler.java

+                .putAll(attributes)
+                .build();
+            SamplingResult result = SamplingResult.recordAndSample();
+            return new OTelSamplingResult(result.getDecision(), customSampleAttributes);


Sorry, I don't understand what this sampler is doing: it is 100% (aka always) sampler?

Yes this is 100% sampler. We are using batch span processor itself for the final export so that will only export the span if the span is marked to be sampled. This takes care of marking all span as sampled so that the spans can be exported. With this it also adds an attribute to the span, for the processor layer to know that the span has been sampled by this span processor.

Adding further, we have created a custom delegated processor layer which act as an intermediate layer between batch span processor and Otel framework and forwards the normal span i.e. spans which are not marked by this sampler directly to batch span processor and if the span is marked by this sampler then filter out spans that are outlier or their child are outlier.

Also adding further we need to mark the spans to be sampled beforehand so that the span can record information as well. If the span is not marked for sampling then it will not record any information as well and will act as a Noop span.

reta · 2024-04-05T13:55:28Z

...lemetry-otel/src/main/java/org/opensearch/telemetry/tracing/processor/OTelSpanProcessor.java

+                SamplingAttributes.INFERRED_SAMPLER.getValue()
+            )) {
+            if (span.getAttribute(AttributeKey.booleanKey(SamplingAttributes.SAMPLED.getValue())) != null) {
+                this.delegateProcessor.onEnd(span);


How come we are not calling delegateProcessor.onEnd(span) in all branches? We don't know what delegateProcessor is actually doing, it may track spans with onStart / onEnd calls

Delegated Processor here is the BatchSpanProcessor itself. This custom processor is acting as an intermediate layer between BatchSpanProcessor and Otel framework and forwards the normal span i.e. spans which are not marked by inferred sampler directly to BatchSpanProcessor and if the span is marked by inferred sampler then filter out spans that are outlier or their children are outlier. This is done by adding a sampled attribute to the span itself so that the processor can distinguish between the outlier and non-outlier spans

We cannot do that: the onStart / onEnd pair has to be called (by OTEL spec actually). The fact that we rely on BatchSpanProcessor is risky implementation detail we should not rely upon: using LeakDetectingSpanProcessor or even our own StrictCheckSpanProcessor will fail (the fact we have no failures means we have insufficient test coverage.

You are correct @reta . However we cannot use BatchSpanProcessor as it-is because we want the processor to be attribute aware. What we can do is create a WrappedBatchSpanProcessor that will always initialize BatchSpanProcessor as the delegated processor and use the same.

Since we are not following the specs (red flag here), the only way implement this feature is to provide our own SpanProcessor that does not rely on any existing OTEL provided classes.

So as I understand the concern here is that the onStart for certain spans will be called however the onEnd will not be called which could be an issue . Is that correct? So it would be better to have our own implementation for the same?

Yes, it violates specs and leaks spans (with delegation model)

reta · 2024-04-08T16:29:35Z

plugins/telemetry-otel/src/main/java/org/opensearch/telemetry/tracing/OTelSpan.java

@@ -32,9 +38,33 @@ public OTelSpan(String spanName, Span span, org.opensearch.telemetry.tracing.Spa

    @Override
    public void endSpan() {
+        if (isSpanOutlier()) {


How that is suppose to work in case of async calls when parent is finished before the child? (example below)

try (ScopedSpan startScopedSpan = tracer.startScopedSpan(...)) { threadPool.executor(...).execute( ... ) }

So the parent span object is still maintained in the scope of child even if the parent has ended. What we plan to do here is mark the current child's parent and the chain above as one of the parent within the chain would still be alive. As suggested here as well if the parent is still in the recording phase we will go ahead and add the attribute to the parent and mark the parent to be sampled as well. If the parent is not recording then the information about this parent will be stored in event of its parent so that we don't loose on the info of not-recording spans.

Adding the closed parent event to its parent span is still pending and will add in the next revision itself.

I don't understand the terminology here (what does it mean for span chain to be alive), sorry about that , let me try to illustrate that better:

try (ScopedSpan startScopedSpan = tracer.startScopedSpan(...)) { threadPool.executor(...).execute( ... ) }

since we sample 100%, all spans are sampled

the parent span is created (tracer.startScopedSpan) , onSpanStart is called

the threadPool captures context (current span as parent) and schedules the work

the parent span is ended, onSpanEnd is called

After onSpanEnd, the parent span is immutable, no matter if we still have reference to in from Java code or not - any modification to it have no effect but produce warning.

My question is: since we not be exporting the parent span (due to absence of the expected attributes), what would happen in this case?

So here is how this would work:

If the span is an outlier span we will add the sampled attribute to the span.

Once the span is marked to be sampled we will mark its parent to be sampled as well. This could now have two possibilities

One, the parent is still recording. If that is the case we will go ahead and add the attribute to the parent as well and the parent would be sampled as well.

Second, the parent is not recording/has ended in that case we mark its grand-parent if that is alive to be sampled and record this closed span information in the alive grand-parent itself. This way we will not get the span which was closed however the information about the span would still persist in its parent which will be sampled

This is what I have suspected - we are messing up the trace in unpredictable ways (another red flag), there should be only one possibility: the trace is recorded as it was sampled or not recorded.

Yes you are correct we will miss the span. However to maintain the lineage we will still maintain the information of closed span in its parent itself.

So here is the thing folks, I my opinion, which I shared on RFC as well:

if we cannot preserve span hierarchy, this is no go (the mess would be on user to deal with)

if we cannot follow the rules of the library we are using, this is no go

OTEL cannot solve tail sampling on library level, this is documented and acknowledged:

https://opentelemetry.io/blog/2022/tail-sampling/

https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor

There are other tracing instrumentations that try to implement sampling differently, including adaptive sampling that could tune the sampling rates based on the different factors:

https://github.com/openzipkin-attic/zipkin-zookeeper

https://medium.com/jaegertracing/adaptive-sampling-in-jaeger-50f336f4334

Please consider those as a robust and reliable alternatives, that do not exhibit flaws of this solution.

I see where you're coming from regarding the potential gaps in the hierarchy, which users may need to address based on their choice of collector and visualization solutions. It's been recognized as a requirement to address such gaps in the Otel specification, and the community appears to be quite receptive to it. If you take a look at the Otel issues provided, you'll notice that even with these gaps, there are still ways to infer missing parent information through parent span events, aiding in understanding the code path and other relevant details.

open-telemetry/opentelemetry-specification#3205
open-telemetry/opentelemetry-specification#3867

In general, there are numerous benefits to this approach, particularly for customers dealing with larger clusters processing tens of thousands of requests per second and managing multiple clusters. It offers insights into problematic areas of code, and the availability of missing span information as part of the event in the parent span can be very helpful in connecting the dots.

Moreover, this won't be the default behavior and can be enabled as needed (though it's crucial to clearly document the shortcomings and benefits of such samplers). There have been requests from Otel users for deferred sampling support and reverse propagation of context, and it's hoped that these features will be available soon, at which point a switch can be made.

@Gaganjuneja I shared my opinion on the subject many times, please let's get back to this when these features are available, if other maintainers see this is a way to go , I am fine with that, but strong -1 from me moving this further as it stands today.

One last thing, I think we can solve this issue by adding a link between the span and it's grand parent to persist the relationship. Since this is a follows_from span.

peternied · 2024-04-16T13:47:03Z

@nishchay21 @Gaganjuneja @reta From the discussion that I've read on this issue I believe this pull request should be closed unmerged.

I'd recommend regrouping with a fresh RFC(s) or considerable investment in a POC with reliability tests (that directly address the questions raised here).

opensearch-trigger-bot · 2024-05-16T15:20:40Z

This PR is stalled because it has been open for 30 days with no activity.

peternied · 2024-05-16T17:43:51Z

I'm closing this PR out because it has stalled and there doesn't seem to be a good course of action to move it forward. I'd recommend regrouping with a fresh RFC(s) or considerable investment in a POC with reliability tests that directly address the questions raised here.

Adding new inferred sampler

e02fc95

Signed-off-by: Nishchay Malhotra <[email protected]>

Added header null check and changelog

78dcc6b

Signed-off-by: Nishchay Malhotra <[email protected]>

Merge branch 'main' into sampling

432d294

Signed-off-by: Nishchay Malhotra <[email protected]>

Adding spotless apply

ad8d7f3

Signed-off-by: Nishchay Malhotra <[email protected]>

Fixing test cases

5c4128b

Signed-off-by: Nishchay Malhotra <[email protected]>

Adding fix for span creation npe

47bef14

Signed-off-by: Nishchay Malhotra <[email protected]>

Fix batch span processor worker leak

4f0586a

Signed-off-by: Nishchay Malhotra <[email protected]>

Gaganjuneja reviewed Mar 27, 2024

View reviewed changes

Merge branch 'main' into sampling

178e48b

Signed-off-by: Nishchay Malhotra <[email protected]>

Adding get attribute as readable span attribute

6bd52e9

Signed-off-by: Nishchay Malhotra <[email protected]>

Gaganjuneja reviewed Mar 28, 2024

View reviewed changes

reta reviewed Mar 28, 2024

View reviewed changes

libs/core/src/main/java/org/opensearch/core/transport/TransportMessage.java Outdated Show resolved Hide resolved

Fixing comments

7504c6a

Signed-off-by: Nishchay Malhotra <[email protected]> Adding new sampled attributes Adding new inferred sampler attributes Signed-off-by: Nishchay Malhotra <[email protected]>

Fixing spotlesschecks

a1b3df1

Signed-off-by: Nishchay Malhotra <[email protected]>

nishchay21 marked this pull request as ready for review April 1, 2024 04:41

nishchay21 requested review from peternied, abbashus and adnapibar as code owners April 1, 2024 04:41

nishchay21 requested review from mch2, msfroh, nknize, owaiskazi19, Rishikesh1159, ryanbogan, sachinpkale, saratvemulapalli, shwetathareja, sohami, tlfeng and VachaShah as code owners April 1, 2024 04:41

Merge branch 'main' into sampling

1e1d624

Signed-off-by: Nishchay Malhotra <[email protected]>

Adding unit test cases

8bce540

Signed-off-by: Nishchay Malhotra <[email protected]>

Addimh more unit test cases

b944c0b

Signed-off-by: Nishchay Malhotra <[email protected]>

nishchay21 requested a review from Gaganjuneja April 2, 2024 04:42

reta reviewed Apr 5, 2024

View reviewed changes

reta reviewed Apr 8, 2024

View reviewed changes

opensearch-trigger-bot bot added the stalled Issues that have stalled label May 16, 2024

peternied closed this May 16, 2024

Adding new inferred sampler #12927

Adding new inferred sampler #12927

Conversation

nishchay21 commented Mar 26, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Mar 26, 2024

github-actions bot commented Mar 26, 2024 • edited Loading

Compatibility status:

Incompatible components

Skipped components

Compatible components

github-actions bot commented Mar 26, 2024

github-actions bot commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

codecov bot commented Mar 27, 2024 • edited Loading

Codecov Report

github-actions bot commented Mar 28, 2024

github-actions bot commented Mar 28, 2024

github-actions bot commented Mar 29, 2024

github-actions bot commented Apr 1, 2024

github-actions bot commented Apr 1, 2024

github-actions bot commented Apr 1, 2024

github-actions bot commented Apr 1, 2024

nishchay21 commented Apr 3, 2024

reta commented Apr 3, 2024

Choose a reason for hiding this comment

nishchay21 Apr 6, 2024 • edited Loading

Choose a reason for hiding this comment

reta Apr 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reta Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

reta Apr 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reta Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

nishchay21 Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nishchay21 Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

reta Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reta Apr 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peternied commented Apr 16, 2024

opensearch-trigger-bot bot commented May 16, 2024

peternied commented May 16, 2024

nishchay21 commented Mar 26, 2024 •

edited

Loading

github-actions bot commented Mar 26, 2024 •

edited

Loading

codecov bot commented Mar 27, 2024 •

edited

Loading

nishchay21 Apr 6, 2024 •

edited

Loading

reta Apr 5, 2024 •

edited

Loading

reta Apr 9, 2024 •

edited

Loading

reta Apr 8, 2024 •

edited

Loading

reta Apr 9, 2024 •

edited

Loading

nishchay21 Apr 9, 2024 •

edited

Loading

nishchay21 Apr 11, 2024 •

edited

Loading

reta Apr 11, 2024 •

edited

Loading

reta Apr 12, 2024 •

edited

Loading