Skip to content

Commit

Permalink
Copy 2.16 docs to 2-edge (#1832)
Browse files Browse the repository at this point in the history
Signed-off-by: Alex Leong <[email protected]>
Co-authored-by: Flynn <[email protected]>
  • Loading branch information
adleong and kflynn authored Sep 9, 2024
1 parent b7c9278 commit 26acc99
Show file tree
Hide file tree
Showing 42 changed files with 1,838 additions and 1,003 deletions.
27 changes: 15 additions & 12 deletions linkerd.io/content/2-edge/checks/index.html
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="refresh" content="0; url=../tasks/troubleshooting/">
<script type="text/javascript">
window.onload = function() {

<head>
<meta charset="UTF-8">
<meta http-equiv="refresh" content="0; url=../tasks/troubleshooting/">
<script type="text/javascript">
window.onload = function () {
var hash = window.location.hash;
window.location.href = window.location.origin + "/2-edge/tasks/troubleshooting/" + hash;
}
</script>
<title>Linkerd Check Redirection</title>
</head>
<body>
If you are not redirected automatically, follow this
<a href='../tasks/troubleshooting/'>link</a>.
</body>
</script>
<title>Linkerd Check Redirection</title>
</head>

<body>
If you are not redirected automatically, follow this
<a href='../tasks/troubleshooting/'>link</a>.
</body>

</html>
21 changes: 21 additions & 0 deletions linkerd.io/content/2-edge/common-errors/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
+++
title = "Common Errors"
weight = 10
[sitemap]
priority = 1.0
+++

Linkerd is generally robust, but things can always go wrong! You'll find
information here about the most common things that cause people trouble.

## When in Doubt, Start With `linkerd check`

Whenever you see anything that looks unusual about your mesh, **always** start
with `linkerd check`. It will check a long series of things that have caused
trouble for others and make sure that your configuration is sane, and it will
point you to help for any problems it finds. It's hard to overstate how useful
this command is.

## Common Errors

{{% sectiontoc "common-errors" %}}
18 changes: 18 additions & 0 deletions linkerd.io/content/2-edge/common-errors/failfast.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
+++
title = "Failfast"
description = "Failfast means that no endpoints are available."
+++

If Linkerd reports that a given service is in the _failfast_ state, it
means that the proxy has determined that there are no available endpoints
for that service. In this situation there's no point in the proxy trying
to actually make a connection to the service - it already knows that it
can't talk to it - so it reports that the service is in failfast and
immediately returns an error from the proxy.

The error will be either a 503 or a 504; see below for more information,
but if you already know that the service is in failfast because you saw
it in the logs, that's the important part.

To get out of failfast, some endpoints for the service have to
become available.
11 changes: 11 additions & 0 deletions linkerd.io/content/2-edge/common-errors/http-502.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
+++
title = "HTTP 502 Errors"
description = "HTTP 502 means connection errors between proxies."
+++

The Linkerd proxy will return a 502 error for connection errors between
proxies. Unfortunately it's fairly common to see an uptick in 502s when
first meshing a workload that hasn't previously been used with a mesh,
because the mesh surfaces errors that were previously invisible!

There's actually a whole page on [debugging 502s](../../tasks/debugging-502s/).
27 changes: 27 additions & 0 deletions linkerd.io/content/2-edge/common-errors/http-503-504.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
+++
title = "HTTP 503 and 504 Errors"
description = "HTTP 503 and 504 mean overloaded workloads."
+++

503s and 504s show up when a Linkerd proxy is trying to make so many
requests to a workload that it gets overwhelmed.

When the workload next to a proxy makes a request, the proxy adds it
to an internal dispatch queue. When things are going smoothly, the
request is pulled from the queue and dispatched almost immediately.
If the queue gets too long, though (which can generally happen only
if the called service is slow to respond), the proxy will go into
_load-shedding_, where any new request gets an immediate 503. The
proxy can only get _out_ of load-shedding when the queue shrinks.

Failfast also plays a role here: if the proxy puts a service into
failfast while there are requests in the dispatch queue, all the
requests in the dispatch queue get an immediate 504 before the
proxy goes into load-shedding.

To get out of failfast, some endpoints for the service have to
become available.

To get out of load-shedding, the dispatch queue has to start
emptying, which implies that the service has to get more capacity
to process requests or that the incoming request rate has to drop.
35 changes: 35 additions & 0 deletions linkerd.io/content/2-edge/common-errors/protocol-detection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
+++
title = "Protocol Detection Errors"
description = "Protocol detection errors indicate that Linkerd doesn't understand the protocol in use."
+++

Linkerd is capable of proxying all TCP traffic, including TLS connections,
WebSockets, and HTTP tunneling. In most cases where the client speaks first
when a new connection is made, Linkerd can detect the protocol in use,
allowing it to perform per-request routing and metrics.

If your proxy logs contain messages like `protocol detection timed out after
10s`, or you're experiencing 10-second delays when establishing connections,
you're probably running a situation where Linkerd cannot detect the protocol.
This is most common for protocols where the server speaks first, and the
client is waiting for information from the server. It may also occur with
non-HTTP protocols for which Linkerd doesn't yet understand the wire format of
a request.

You'll need to understand exactly what the situation is to fix this:

- A server-speaks-first protocol will probably need to be configured as a
`skip` or `opaque` port, as described in the [protocol detection
documentation](../../features/protocol-detection/#configuring-protocol-detection).

- If you're seeing transient protocol detection timeouts, this is more likely
to indicate a misbehaving workload.

- If you know the protocol is client-speaks-first but you're getting
consistent protocol detection timeouts, you'll probably need to fall back on
a `skip` or `opaque` port.

Note that marking ports as `skip` or `opaque` has ramifications beyond
protocol detection timeouts; see the [protocol detection
documentation](../../features/protocol-detection/#configuring-protocol-detection)
for more information.
7 changes: 7 additions & 0 deletions linkerd.io/content/2-edge/features/cni.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,13 @@ plugin, using _CNI chaining_. It handles only the Linkerd-specific
configuration and does not replace the need for a CNI plugin.
{{< /note >}}

{{< note >}}
If you're installing Linkerd's CNI plugin on top of Cilium, make sure to install
the latter with the option `cni.exclusive=false`, so Cilium doesn't take
ownership over the CNI configurations directory, and allows other plugins to
deploy their configurations there.
{{< /note >}}

## Installation

Usage of the Linkerd CNI plugin requires that the `linkerd-cni` DaemonSet be
Expand Down
20 changes: 0 additions & 20 deletions linkerd.io/content/2-edge/features/ha.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,26 +79,6 @@ See the Kubernetes
for more information on the admission webhook failure policy.
{{< /note >}}

## Exclude the kube-system namespace

Per recommendation from the Kubernetes
[documentation](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#avoiding-operating-on-the-kube-system-namespace),
the proxy injector should be disabled for the `kube-system` namespace.

This can be done by labeling the `kube-system` namespace with the following
label:

```bash
kubectl label namespace kube-system config.linkerd.io/admission-webhooks=disabled
```

The Kubernetes API server will not call the proxy injector during the admission
phase of workloads in namespace with this label.

If your Kubernetes cluster have built-in reconcilers that would revert any changes
made to the `kube-system` namespace, you should loosen the proxy injector
failure policy following these [instructions](#proxy-injector-failure-policy).

## Pod anti-affinity rules

All critical control plane components are deployed with pod anti-affinity rules
Expand Down
7 changes: 7 additions & 0 deletions linkerd.io/content/2-edge/features/httproute.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@ documentation](../../reference/httproute/#linkerd-and-gateway-api-httproutes)
for details.
{{< /note >}}

If the Gateway API CRDs already exist in your cluster, then Linkerd must be
installed with the `--set enableHttpRoutes=false` flag during the
`linkerd install --crds` step or with the `enableHttpRoutes=false` Helm value
when installing the `linkerd-crds` Helm chart. This avoid conflicts by
instructing Linkerd to not install the Gateway API CRDs and instead rely on the
Gateway CRDs which already exist.

An HTTPRoute is a Kubernetes resource which attaches to a parent resource, such
as a [Service]. The HTTPRoute defines a set of rules which match HTTP requests
to that resource, based on parameters such as the request's path, method, and
Expand Down
14 changes: 14 additions & 0 deletions linkerd.io/content/2-edge/features/ipv6.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
+++
title = "IPv6 Support"
description = "Linkerd is compatible with both IPv6-only and dual-stack clusters."
+++

As of version 2.16 (and edge-24.8.2) Linkerd fully supports Kubernetes clusters
configured for IPv6-only or dual-stack networking.

This is disabled by default; to enable just set `proxy.disableIPv6=false` when
installing the control plane and, if you use it, the linkerd-cni plugin.

Enabling IPv6 support does not generally change how Linkerd operates, except in
one way: when enabled on a dual-stack cluster, Linkerd will only use the IPv6
endpoints of destinations and will not use the IPv4 endpoints.
2 changes: 1 addition & 1 deletion linkerd.io/content/2-edge/features/multicluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ topology. This multi-cluster capability is designed to provide:
3. **Support for any type of network.** Linkerd does not require any specific
network topology between clusters, and can function both with hierarchical
networks as well as when clusters [share the same flat
network](#multi-cluster-for-flat-networks).
network](#flat-networks).
4. **A unified model alongside in-cluster communication.** The same
observability, reliability, and security features that Linkerd provides for
in-cluster communication extend to cross-cluster communication.
Expand Down
16 changes: 16 additions & 0 deletions linkerd.io/content/2-edge/features/non-kubernetes-workloads.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
---
title: Non-Kubernetes workloads (mesh expansion)
---

Linkerd features *mesh expansion*, or the ability to add non-Kubernetes
workloads to your service mesh by deploying the Linkerd proxy to the remote
machine and connecting it back to the Linkerd control plane within the mesh.
This allows you to use Linkerd to establish communication to and from the
workload that is secure, reliable, and observable, just like communication to
and from your Kubernetes workloads.

Related content:

* [Guide: Adding non-Kubernetes workloads to your mesh]({{< relref
"../tasks/adding-non-kubernetes-workloads" >}})
* [ExternalWorkload Reference]({{< relref "../reference/external-workload" >}})
12 changes: 11 additions & 1 deletion linkerd.io/content/2-edge/features/proxy-injection.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ For each pod, two containers are injected:
Container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
that configures `iptables` to automatically forward all incoming and
outgoing TCP traffic through the proxy. (Note that this container is not
present if the [Linkerd CNI Plugin](../cni/) has been enabled.)
injected if the [Linkerd CNI Plugin](../cni/) has been enabled.)
1. `linkerd-proxy`, the Linkerd data plane proxy itself.

Note that simply adding the annotation to a resource with pre-existing pods
Expand All @@ -43,6 +43,16 @@ will not automatically inject those pods. You will need to update the pods
because Kubernetes does not call the webhook until it needs to update the
underlying resources.

## Exclusions

At install time, Kubernetes is configured to avoid calling Linkerd's proxy
injector for resources in the `kube-system` and `cert-manager` namespaces. This
is to prevent injection on components that are themselves required for Linkerd
to function.

The injector will not run on components in these namespaces, regardless of any
`linkerd.io/inject` annotations.

## Overriding injection

Automatic injection can be disabled for a pod or deployment for which it would
Expand Down
69 changes: 7 additions & 62 deletions linkerd.io/content/2-edge/features/retries-and-timeouts.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,16 @@ description = "Linkerd can perform service-specific retries and timeouts."
weight = 3
+++

Automatic retries are one the most powerful and useful mechanisms a service mesh
has for gracefully handling partial or transient application failures. If
implemented incorrectly retries can amplify small errors into system wide
outages. For that reason, we made sure they were implemented in a way that would
increase the reliability of the system while limiting the risk.
Timeouts and automatic retries are two of the most powerful and useful
mechanisms a service mesh has for gracefully handling partial or transient
application failures.

Timeouts work hand in hand with retries. Once requests are retried a certain
number of times, it becomes important to limit the total amount of time a client
waits before giving up entirely. Imagine a number of retries forcing a client
to wait for 10 seconds.

Timeouts can be configured using either the [HTTPRoute] or [ServiceProfile]
resources. Currently, retries can only be configured using [ServiceProfile]s,
but support for configuring retries using [HTTPRoutes] will be added in a future
release. Creating these policy resources will cause the Linkerd proxy to perform
the appropriate retries or timeouts when calling that service. Retries and
timeouts are always performed on the *outbound* (client) side.
Timeouts and retries can be configured using [HTTPRoute], GRPCRoute, or Service
resources. Retries and timeouts are always performed on the *outbound* (client)
side.

{{< note >}}
If working with headless services, service profiles cannot be retrieved. Linkerd
If working with headless services, outbound policy cannot be retrieved. Linkerd
reads service discovery information based off the target IP address, and if that
happens to be a pod IP address then it cannot tell which service the pod belongs
to.
Expand All @@ -34,49 +24,4 @@ These can be setup by following the guides:
- [Configuring Retries](../../tasks/configuring-retries/)
- [Configuring Timeouts](../../tasks/configuring-timeouts/)

## How Retries Can Go Wrong

Traditionally, when performing retries, you must specify a maximum number of
retry attempts before giving up. Unfortunately, there are two major problems
with configuring retries this way.

### Choosing a maximum number of retry attempts is a guessing game

You need to pick a number that’s high enough to make a difference; allowing
more than one retry attempt is usually prudent and, if your service is less
reliable, you’ll probably want to allow several retry attempts. On the other
hand, allowing too many retry attempts can generate a lot of extra requests and
extra load on the system. Performing a lot of retries can also seriously
increase the latency of requests that need to be retried. In practice, you
usually pick a maximum retry attempts number out of a hat (3?) and then tweak
it through trial and error until the system behaves roughly how you want it to.

### Systems configured this way are vulnerable to retry storms

A [retry storm](https://twitter.github.io/finagle/guide/Glossary.html)
begins when one service starts (for any reason) to experience a larger than
normal failure rate. This causes its clients to retry those failed requests.
The extra load from the retries causes the service to slow down further and
fail more requests, triggering more retries. If each client is configured to
retry up to 3 times, this can quadruple the number of requests being sent! To
make matters even worse, if any of the clients’ clients are configured with
retries, the number of retries compounds multiplicatively and can turn a small
number of errors into a self-inflicted denial of service attack.

## Retry Budgets to the Rescue

To avoid the problems of retry storms and arbitrary numbers of retry attempts,
retries are configured using retry budgets. Rather than specifying a fixed
maximum number of retry attempts per request, Linkerd keeps track of the ratio
between regular requests and retries and keeps this number below a configurable
limit. For example, you may specify that you want retries to add at most 20%
more requests. Linkerd will then retry as much as it can while maintaining that
ratio.

Configuring retries is always a trade-off between improving success rate and
not adding too much extra load to the system. Retry budgets make that trade-off
explicit by letting you specify exactly how much extra load your system is
willing to accept from retries.

[ServiceProfile]: ../service-profiles/
[HTTPRoute]: ../httproute/
Loading

0 comments on commit 26acc99

Please sign in to comment.