OpenTelemetry tracing in OpenShift

Signed-off-by: Sally O'Malley <[email protected]> Signed-off-by: Parul Singh <[email protected]> Co-authored-by: husky-parul <[email protected]> Co-authored-by: damemi <[email protected]>
openshift · Aug 4, 2021 · ebdc855 · ebdc855
1 parent ac1c27d
commit ebdc855
Showing 1 changed file with 120 additions and 0 deletions.
diff --git a/enhancements/distributed-tracing/distributed-tracing.md b/enhancements/distributed-tracing/distributed-tracing.md
@@ -0,0 +1,120 @@
+---
+title: distributed-tracing-with-opentelemetry
+authors:
+  - "@sallyom"
+  - "@husky-parul"
+  - "@damemi"
+reviewers:
+  - TBD
+approvers:
+  - TBD
+creation-date: 2021-04-14
+last-updated: 2021-08-04
+status: informational
+---
+
+# Distributed Tracing with OpenTelemetry
+
+## Release Signoff Checklist
+
+- [ ] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+This document is an overview of OpenTelemetry tracing and how it benefits a distributed system
+such as OpenShift. Also, this document provides information necessary
+to configure distributed tracing in an OpenShift deployment with a vendor-agnostic collector
+capable of exporting telemetry data to any back-end analysis tool. Some open-source back-ends
+for OpenTelemetry data are Jaeger and Zipkin. Here is a list of [vendors that currently support
+OpenTelemetry.](https://opentelemetry.io/vendors/)
+
+OpenTelemetry tracing has the ability to quickly detect and diagnose problems as well
+as improve performance of distributed systems and microservices. By definition,
+distributed tracing tracks and observes service requests as they flow through a system by
+collecting data as requests go from one service to another. It's possible, then, to
+pinpoint bugs and bottlenecks or other issues that can impact overall system performance.
+Tracing provides the story of an end-to-end request that is difficult to get otherwise.
+
+The OpenTelemetry Collector component can ingest data in OTLP format and be configured to forward
+and export data to a variety of vendor-specific backends, as well as backends that can ingest data
+in OTLP, Jaeger, or Zipkin formats.
+
+## Motivation
+
+As platforms and applications become more distributed and built on microservices or serverless, tracing
+provides an overall picture of system performance. This visibility reveals service dependencies and how
+one component affects another, things which are difficult to observe otherwise. For example, many OpenShift
+bugs or issues are not contained to a single component. Instead, several teams and component owners often
+work together to solve issues and make system improvements. Distributed tracing aids this by
+tracking events across service boundaries. Furthermore, tracing can shrink the time it takes to diagnose issues,
+giving useful information and pinpointing problems without the need for extra code. Upstream, etcd has been
+instrumented to export gRPC traces. CRI-O is also adding instrumentation. Kubernetes API server added the option
+to enable OpenTelemetry tracing in version 1.22. A KEP is under review and work is underway to instrument kubelet.
+A POC has been created with kube-scheduler. With these components instrumented, it will be possible to view traces with
+CRI-O <-> Kubelet <-> Kube-Apiserver <-> ETCD. At this point, there is much to gain in instrumenting
+other components and extending the OpenTelemetry train to give a complete view of the system.
+
+### Goals
+
+Provide an easy way for OpenShift components to add instrumentation for distributed tracing using OpenTelemetry.
+Adding OpenTelemetry tracing spans requires only a few lines of code. The more components that add tracing,
+the more complete the picture will be for anyone who is debugging or trying to understand cluster performance.
+Also, a vendor-agnostic OpenTelemetry Collector with an operator in the works on OperatorHub can be
+temporarily deployed in times of debugging to turn on tracing, and removed when no longer needed. Any component
+that adds instrumentation should add a switch to turn tracing on. It should be easy to enable
+and disable tracing. When disabled, there will be no instrumentation. If enabled but no backend is detected, there
+should be no performance hit or trace exporter connection errors in component logs.
+
+### Non-Goals
+
+The OpenTelemetry Collector operator for OpenShift will not be part of core OpenShift. Instead, the operator is available
+on the OperatorHub in the OpenShift console, or, can be deployed manually.
+
+## Proposal
+
+This will be a living document that will be a record of adding tracing to CRI-O. The end
+result will be a merged document that will serve as a guideline for anyone else who wishes to add tracing
+to their components or applications.
+
+### User Stories
+
+* As a cluster administrator, I want to easily switch on or off OpenTelemetry tracing.
+* As a cluster administrator, I want to diagnose performance issues with my component or service (CRI-O).
+* As a cluster administrator, I want to inspect the service boundary between CRI-O and kubelet.
+* As a component owner, I want to instrument code to enable OpenTelemetry tracing.
+* As a component owner, I want to propagate OpenTelemetry data to other components.
+
+### Implementation Details/Notes/Constraints [optional]
+
+TODO: Add all steps required to add tracing to a component and how to implement a back-end such as
+Jaeger, that is currently available on the OperatorHub in the OpenShift Console.
+
+### Risks and Mitigations
+
+TODO
+
+## Design Details
+
+TODO
+
+### Open Questions [optional]
+
+TODO
+
+### Upgrade / Downgrade Strategy
+
+TODO
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Drawbacks
+
+The idea is to find the best form of an argument why this enhancement should _not_ be implemented.