-
Notifications
You must be signed in to change notification settings - Fork 471
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Sally O'Malley <[email protected]> Signed-off-by: Parul Singh <[email protected]> Co-authored-by: husky-parul <[email protected]> Co-authored-by: damemi <[email protected]>
- Loading branch information
1 parent
ac1c27d
commit ebdc855
Showing
1 changed file
with
120 additions
and
0 deletions.
There are no files selected for viewing
120 changes: 120 additions & 0 deletions
120
enhancements/distributed-tracing/distributed-tracing.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
--- | ||
title: distributed-tracing-with-opentelemetry | ||
authors: | ||
- "@sallyom" | ||
- "@husky-parul" | ||
- "@damemi" | ||
reviewers: | ||
- TBD | ||
approvers: | ||
- TBD | ||
creation-date: 2021-04-14 | ||
last-updated: 2021-08-04 | ||
status: informational | ||
--- | ||
|
||
# Distributed Tracing with OpenTelemetry | ||
|
||
## Release Signoff Checklist | ||
|
||
- [ ] Enhancement is `implementable` | ||
- [ ] Design details are appropriately documented from clear requirements | ||
- [ ] Test plan is defined | ||
- [ ] Operational readiness criteria is defined | ||
- [ ] Graduation criteria for dev preview, tech preview, GA | ||
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
||
## Summary | ||
|
||
This document is an overview of OpenTelemetry tracing and how it benefits a distributed system | ||
such as OpenShift. Also, this document provides information necessary | ||
to configure distributed tracing in an OpenShift deployment with a vendor-agnostic collector | ||
capable of exporting telemetry data to any back-end analysis tool. Some open-source back-ends | ||
for OpenTelemetry data are Jaeger and Zipkin. Here is a list of [vendors that currently support | ||
OpenTelemetry.](https://opentelemetry.io/vendors/) | ||
|
||
OpenTelemetry tracing has the ability to quickly detect and diagnose problems as well | ||
as improve performance of distributed systems and microservices. By definition, | ||
distributed tracing tracks and observes service requests as they flow through a system by | ||
collecting data as requests go from one service to another. It's possible, then, to | ||
pinpoint bugs and bottlenecks or other issues that can impact overall system performance. | ||
Tracing provides the story of an end-to-end request that is difficult to get otherwise. | ||
|
||
The OpenTelemetry Collector component can ingest data in OTLP format and be configured to forward | ||
and export data to a variety of vendor-specific backends, as well as backends that can ingest data | ||
in OTLP, Jaeger, or Zipkin formats. | ||
|
||
## Motivation | ||
|
||
As platforms and applications become more distributed and built on microservices or serverless, tracing | ||
provides an overall picture of system performance. This visibility reveals service dependencies and how | ||
one component affects another, things which are difficult to observe otherwise. For example, many OpenShift | ||
bugs or issues are not contained to a single component. Instead, several teams and component owners often | ||
work together to solve issues and make system improvements. Distributed tracing aids this by | ||
tracking events across service boundaries. Furthermore, tracing can shrink the time it takes to diagnose issues, | ||
giving useful information and pinpointing problems without the need for extra code. Upstream, etcd has been | ||
instrumented to export gRPC traces. CRI-O is also adding instrumentation. Kubernetes API server added the option | ||
to enable OpenTelemetry tracing in version 1.22. A KEP is under review and work is underway to instrument kubelet. | ||
A POC has been created with kube-scheduler. With these components instrumented, it will be possible to view traces with | ||
CRI-O <-> Kubelet <-> Kube-Apiserver <-> ETCD. At this point, there is much to gain in instrumenting | ||
other components and extending the OpenTelemetry train to give a complete view of the system. | ||
|
||
### Goals | ||
|
||
Provide an easy way for OpenShift components to add instrumentation for distributed tracing using OpenTelemetry. | ||
Adding OpenTelemetry tracing spans requires only a few lines of code. The more components that add tracing, | ||
the more complete the picture will be for anyone who is debugging or trying to understand cluster performance. | ||
Also, a vendor-agnostic OpenTelemetry Collector with an operator in the works on OperatorHub can be | ||
temporarily deployed in times of debugging to turn on tracing, and removed when no longer needed. Any component | ||
that adds instrumentation should add a switch to turn tracing on. It should be easy to enable | ||
and disable tracing. When disabled, there will be no instrumentation. If enabled but no backend is detected, there | ||
should be no performance hit or trace exporter connection errors in component logs. | ||
|
||
### Non-Goals | ||
|
||
The OpenTelemetry Collector operator for OpenShift will not be part of core OpenShift. Instead, the operator is available | ||
on the OperatorHub in the OpenShift console, or, can be deployed manually. | ||
|
||
## Proposal | ||
|
||
This will be a living document that will be a record of adding tracing to CRI-O. The end | ||
result will be a merged document that will serve as a guideline for anyone else who wishes to add tracing | ||
to their components or applications. | ||
|
||
### User Stories | ||
|
||
* As a cluster administrator, I want to easily switch on or off OpenTelemetry tracing. | ||
* As a cluster administrator, I want to diagnose performance issues with my component or service (CRI-O). | ||
* As a cluster administrator, I want to inspect the service boundary between CRI-O and kubelet. | ||
* As a component owner, I want to instrument code to enable OpenTelemetry tracing. | ||
* As a component owner, I want to propagate OpenTelemetry data to other components. | ||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
|
||
TODO: Add all steps required to add tracing to a component and how to implement a back-end such as | ||
Jaeger, that is currently available on the OperatorHub in the OpenShift Console. | ||
|
||
### Risks and Mitigations | ||
|
||
TODO | ||
|
||
## Design Details | ||
|
||
TODO | ||
|
||
### Open Questions [optional] | ||
|
||
TODO | ||
|
||
### Upgrade / Downgrade Strategy | ||
|
||
TODO | ||
|
||
## Implementation History | ||
|
||
Major milestones in the life cycle of a proposal should be tracked in `Implementation | ||
History`. | ||
|
||
## Drawbacks | ||
|
||
The idea is to find the best form of an argument why this enhancement should _not_ be implemented. |