Skip to content

Commit

Permalink
distributed tracing doc
Browse files Browse the repository at this point in the history
  • Loading branch information
sallyom committed Jun 9, 2021
1 parent 6404a23 commit 81d43ee
Showing 1 changed file with 116 additions and 0 deletions.
116 changes: 116 additions & 0 deletions enhancements/distributed-tracing/distributed-tracing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
---
title: distributed-tracing-with-opentelemetry
authors:
- "@sallyom"
- "@parulsingh"
reviewers:
- TBD
approvers:
- TBD
creation-date: 2021-04-14
last-updated: 2021-06-09
status: informational
---

# Distributed Tracing with OpenTelemetry

## Release Signoff Checklist

- [ ] Enhancement is `implementable`
- [ ] Design details are appropriately documented from clear requirements
- [ ] Test plan is defined
- [ ] Operational readiness criteria is defined
- [ ] Graduation criteria for dev preview, tech preview, GA
- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

This document is an overview of distributed tracing and how it benefits a distributed system
such as OpenShift. Also, this document provides information necessary
to configure distributed tracing in an OpenShift deployment with a vendor-agnostic collector
capable of exporting telemetry data to any back-end analysis tool. Some open-source back-ends
for OpenTelemetry data are Jaeger, Prometheus, and Zipkin. Here is a list of [vendors that currently support
OpenTelemetry.](https://opentelemetry.io/vendors/)

Distributed tracing has the ability to quickly detect and diagnose problems as well
as improve performance of distributed systems and microservices. By definition,
distributed tracing tracks and observes service requests as they flow through a system by
collecting data as requests go from one service to another. It's possible, then, to
pinpoint bugs and bottlenecks or other issues that can impact overall system performance.
Simply put, tracing provides the story of an end-to-end request that is difficult to get otherwise.

OpenTelemetry supports a wire protocol `OTLP` over `gRPC`. `OTLP` is a request/response protocol. `OTLP`
is used to send data to an `OpenTelemetry Collector` that can be configured to forward and export
data to any data analysis tool that supports OpenTelemetry.

## Motivation

As platforms and applications become more distributed and built on microservices or serverless, tracing
provides an overall picture of system performance. This visibility reveals service dependencies and how
one component affects another, things which are difficult to observe otherwise. For example, many OpenShift
bugs or issues are not contained to a single component. Instead, several teams and component owners often
work together to solve issues and make system improvements. Distributed tracing aids this by
tracking events across service boundaries. Furthermore, tracing can shrink the time it takes to diagnose issues,
giving useful information and pinpointing problems without the need for extra code. Upstream, etcd has been
instrumented to export gRPC traces. CRI-O is also adding instrumentation. There is an open PR with
kubernetes-apiserver that will add opentelemetry tracing, and work has been underway to instrument kubelet
and kube-scheduler as a POC. With these components instrumented, it will be possible to view traces with
CRI-O <-> Kubelet <-> Kube-Scheduler <-> Kube-Apiserver <-> ETCD. At this point, there is much to gain in instrumenting
other components and extenting the opentelemetry train to give a complete view of the system.

### Goals

Provide an easy way for OpenShift components to add instrumentation for distributed tracing using OpenTelemetry.
Adding opentelemetry tracing spans requires only a few lines of code. The more components that add tracing,
the more complete the picture will be for anyone who is debugging or trying to understand cluster performance.
Also, a vendor-agnostic OpenTelemetry operator available through the OpenShift operator hub can
be temporarily deployed in times of debugging to turn on tracing, and removed when no longer needed. Any component
that adds instrumentation should add a noop exporter and/or a switch to turn tracing on.

### Non-Goals

The OpenTelemetry operator for OpenShift will not be part of core OpenShift. Instead, the operator is available
on the OperatorHub in the OpenShift Console.

## Proposal

This will be a living document that will be a record of adding tracing to CRI-O. The end
result will be a merged document that will serve as a guideline for anyone else who wishes to add tracing
to their components or applications.

### User Stories

* As a cluster administrator, I want to diagnose performance issues with my component or service (CRI-O).
* As a cluster administrator, I want to inspect the service boundary between CRI-O and kubelet.
* As a component owner, I want to instrument code to enable opentelemetry tracing.
* As a component owner, I want to propagate opentelemetry data to other components.

### Implementation Details/Notes/Constraints [optional]

TODO: Add all steps required to add tracing to a component and how to implement a back-end such as
Jaeger, that is currently available on the OperatorHub in the OpenShift Console.

### Risks and Mitigations

TODO

## Design Details

TODO

### Open Questions [optional]

TODO

### Upgrade / Downgrade Strategy

TODO

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Drawbacks

The idea is to find the best form of an argument why this enhancement should _not_ be implemented.

0 comments on commit 81d43ee

Please sign in to comment.