Copy 2.16 docs to 2-edge (#1832)

Signed-off-by: Alex Leong <[email protected]> Co-authored-by: Flynn <[email protected]>
linkerd · Sep 9, 2024 · 26acc99 · 26acc99
1 parent b7c9278
commit 26acc99
Show file tree

Hide file tree

Showing 42 changed files with 1,838 additions and 1,003 deletions.
diff --git a/linkerd.io/content/2-edge/checks/index.html b/linkerd.io/content/2-edge/checks/index.html
@@ -1,18 +1,21 @@
 <!doctype html>
 <html lang="en">
-  <head>
-    <meta charset="UTF-8">
-    <meta http-equiv="refresh" content="0; url=../tasks/troubleshooting/">
-    <script type="text/javascript">
-    window.onload = function() {
+
+<head>
+  <meta charset="UTF-8">
+  <meta http-equiv="refresh" content="0; url=../tasks/troubleshooting/">
+  <script type="text/javascript">
+    window.onload = function () {
       var hash = window.location.hash;
       window.location.href = window.location.origin + "/2-edge/tasks/troubleshooting/" + hash;
     }
-    </script>
-    <title>Linkerd Check Redirection</title>
-  </head>
-  <body>
-    If you are not redirected automatically, follow this
-    <a href='../tasks/troubleshooting/'>link</a>.
-  </body>
+  </script>
+  <title>Linkerd Check Redirection</title>
+</head>
+
+<body>
+  If you are not redirected automatically, follow this
+  <a href='../tasks/troubleshooting/'>link</a>.
+</body>
+
 </html>
diff --git a/linkerd.io/content/2-edge/common-errors/_index.md b/linkerd.io/content/2-edge/common-errors/_index.md
@@ -0,0 +1,21 @@
++++
+title = "Common Errors"
+weight = 10
+[sitemap]
+  priority = 1.0
++++
+
+Linkerd is generally robust, but things can always go wrong! You'll find
+information here about the most common things that cause people trouble.
+
+## When in Doubt, Start With `linkerd check`
+
+Whenever you see anything that looks unusual about your mesh, **always** start
+with `linkerd check`. It will check a long series of things that have caused
+trouble for others and make sure that your configuration is sane, and it will
+point you to help for any problems it finds. It's hard to overstate how useful
+this command is.
+
+## Common Errors
+
+{{% sectiontoc "common-errors" %}}
diff --git a/linkerd.io/content/2-edge/common-errors/failfast.md b/linkerd.io/content/2-edge/common-errors/failfast.md
@@ -0,0 +1,18 @@
++++
+title = "Failfast"
+description = "Failfast means that no endpoints are available."
++++
+
+If Linkerd reports that a given service is in the _failfast_ state, it
+means that the proxy has determined that there are no available endpoints
+for that service. In this situation there's no point in the proxy trying
+to actually make a connection to the service - it already knows that it
+can't talk to it - so it reports that the service is in failfast and
+immediately returns an error from the proxy.
+
+The error will be either a 503 or a 504; see below for more information,
+but if you already know that the service is in failfast because you saw
+it in the logs, that's the important part.
+
+To get out of failfast, some endpoints for the service have to
+become available.
diff --git a/linkerd.io/content/2-edge/common-errors/http-502.md b/linkerd.io/content/2-edge/common-errors/http-502.md
@@ -0,0 +1,11 @@
++++
+title = "HTTP 502 Errors"
+description = "HTTP 502 means connection errors between proxies."
++++
+
+The Linkerd proxy will return a 502 error for connection errors between
+proxies. Unfortunately it's fairly common to see an uptick in 502s when
+first meshing a workload that hasn't previously been used with a mesh,
+because the mesh surfaces errors that were previously invisible!
+
+There's actually a whole page on [debugging 502s](../../tasks/debugging-502s/).
diff --git a/linkerd.io/content/2-edge/common-errors/http-503-504.md b/linkerd.io/content/2-edge/common-errors/http-503-504.md
@@ -0,0 +1,27 @@
++++
+title = "HTTP 503 and 504 Errors"
+description = "HTTP 503 and 504 mean overloaded workloads."
++++
+
+503s and 504s show up when a Linkerd proxy is trying to make so many
+requests to a workload that it gets overwhelmed.
+
+When the workload next to a proxy makes a request, the proxy adds it
+to an internal dispatch queue. When things are going smoothly, the
+request is pulled from the queue and dispatched almost immediately.
+If the queue gets too long, though (which can generally happen only
+if the called service is slow to respond), the proxy will go into
+_load-shedding_, where any new request gets an immediate 503. The
+proxy can only get _out_ of load-shedding when the queue shrinks.
+
+Failfast also plays a role here: if the proxy puts a service into
+failfast while there are requests in the dispatch queue, all the
+requests in the dispatch queue get an immediate 504 before the
+proxy goes into load-shedding.
+
+To get out of failfast, some endpoints for the service have to
+become available.
+
+To get out of load-shedding, the dispatch queue has to start
+emptying, which implies that the service has to get more capacity
+to process requests or that the incoming request rate has to drop.
diff --git a/linkerd.io/content/2-edge/common-errors/protocol-detection.md b/linkerd.io/content/2-edge/common-errors/protocol-detection.md
@@ -0,0 +1,35 @@
++++
+title = "Protocol Detection Errors"
+description = "Protocol detection errors indicate that Linkerd doesn't understand the protocol in use."
++++
+
+Linkerd is capable of proxying all TCP traffic, including TLS connections,
+WebSockets, and HTTP tunneling. In most cases where the client speaks first
+when a new connection is made, Linkerd can detect the protocol in use,
+allowing it to perform per-request routing and metrics.
+
+If your proxy logs contain messages like `protocol detection timed out after
+10s`, or you're experiencing 10-second delays when establishing connections,
+you're probably running a situation where Linkerd cannot detect the protocol.
+This is most common for protocols where the server speaks first, and the
+client is waiting for information from the server. It may also occur with
+non-HTTP protocols for which Linkerd doesn't yet understand the wire format of
+a request.
+
+You'll need to understand exactly what the situation is to fix this:
+
+- A server-speaks-first protocol will probably need to be configured as a
+  `skip` or `opaque` port, as described in the [protocol detection
+  documentation](../../features/protocol-detection/#configuring-protocol-detection).
+
+- If you're seeing transient protocol detection timeouts, this is more likely
+  to indicate a misbehaving workload.
+
+- If you know the protocol is client-speaks-first but you're getting
+  consistent protocol detection timeouts, you'll probably need to fall back on
+  a `skip` or `opaque` port.
+
+Note that marking ports as `skip` or `opaque` has ramifications beyond
+protocol detection timeouts; see the [protocol detection
+documentation](../../features/protocol-detection/#configuring-protocol-detection)
+for more information.
diff --git a/linkerd.io/content/2-edge/features/cni.md b/linkerd.io/content/2-edge/features/cni.md
@@ -25,6 +25,13 @@ plugin, using _CNI chaining_. It handles only the Linkerd-specific
 configuration and does not replace the need for a CNI plugin.
 {{< /note >}}
 
+{{< note >}}
+If you're installing Linkerd's CNI plugin on top of Cilium, make sure to install
+the latter with the option `cni.exclusive=false`, so Cilium doesn't take
+ownership over the CNI configurations directory, and allows other plugins to
+deploy their configurations there.
+{{< /note >}}
+
 ## Installation
 
 Usage of the Linkerd CNI plugin requires that the `linkerd-cni` DaemonSet be

diff --git a/linkerd.io/content/2-edge/features/ha.md b/linkerd.io/content/2-edge/features/ha.md
@@ -79,26 +79,6 @@ See the Kubernetes
 for more information on the admission webhook failure policy.
 {{< /note >}}
 
-## Exclude the kube-system namespace
-
-Per recommendation from the Kubernetes
-[documentation](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#avoiding-operating-on-the-kube-system-namespace),
-the proxy injector should be disabled for the `kube-system` namespace.
-
-This can be done by labeling the `kube-system` namespace with the following
-label:
-
-```bash
-kubectl label namespace kube-system config.linkerd.io/admission-webhooks=disabled
-```
-
-The Kubernetes API server will not call the proxy injector during the admission
-phase of workloads in namespace with this label.
-
-If your Kubernetes cluster have built-in reconcilers that would revert any changes
-made to the `kube-system` namespace, you should loosen the proxy injector
-failure policy following these [instructions](#proxy-injector-failure-policy).
-
 ## Pod anti-affinity rules
 
 All critical control plane components are deployed with pod anti-affinity rules

diff --git a/linkerd.io/content/2-edge/features/httproute.md b/linkerd.io/content/2-edge/features/httproute.md
@@ -24,6 +24,13 @@ documentation](../../reference/httproute/#linkerd-and-gateway-api-httproutes)
 for details.
 {{< /note >}}
 
+If the Gateway API CRDs already exist in your cluster, then Linkerd must be
+installed with the `--set enableHttpRoutes=false` flag during the
+`linkerd install --crds` step or with the `enableHttpRoutes=false` Helm value
+when installing the `linkerd-crds` Helm chart. This avoid conflicts by
+instructing Linkerd to not install the Gateway API CRDs and instead rely on the
+Gateway CRDs which already exist.
+
 An HTTPRoute is a Kubernetes resource which attaches to a parent resource, such
 as a [Service]. The HTTPRoute defines a set of rules which match HTTP requests
 to that resource, based on parameters such as the request's path, method, and

diff --git a/linkerd.io/content/2-edge/features/ipv6.md b/linkerd.io/content/2-edge/features/ipv6.md
@@ -0,0 +1,14 @@
++++
+title = "IPv6 Support"
+description = "Linkerd is compatible with both IPv6-only and dual-stack clusters."
++++
+
+As of version 2.16 (and edge-24.8.2) Linkerd fully supports Kubernetes clusters
+configured for IPv6-only or dual-stack networking.
+
+This is disabled by default; to enable just set `proxy.disableIPv6=false` when
+installing the control plane and, if you use it, the linkerd-cni plugin.
+
+Enabling IPv6 support does not generally change how Linkerd operates, except in
+one way: when enabled on a dual-stack cluster, Linkerd will only use the IPv6
+endpoints of destinations and will not use the IPv4 endpoints.
diff --git a/linkerd.io/content/2-edge/features/multicluster.md b/linkerd.io/content/2-edge/features/multicluster.md
@@ -15,7 +15,7 @@ topology. This multi-cluster capability is designed to provide:
 3. **Support for any type of network.** Linkerd does not require any specific
    network topology between clusters, and can function both with hierarchical
    networks as well as when clusters [share the same flat
-   network](#multi-cluster-for-flat-networks).
+   network](#flat-networks).
 4. **A unified model alongside in-cluster communication.** The same
    observability, reliability, and security features that Linkerd provides for
    in-cluster communication extend to cross-cluster communication.

diff --git a/linkerd.io/content/2-edge/features/non-kubernetes-workloads.md b/linkerd.io/content/2-edge/features/non-kubernetes-workloads.md
@@ -0,0 +1,16 @@
+---
+title: Non-Kubernetes workloads (mesh expansion)
+---
+
+Linkerd features *mesh expansion*, or the ability to add non-Kubernetes
+workloads to your service mesh by deploying the Linkerd proxy to the remote
+machine and connecting it back to the Linkerd control plane within the mesh.
+This allows you to use Linkerd to establish communication to and from the
+workload that is secure, reliable, and observable, just like communication to
+and from your Kubernetes workloads.
+
+Related content:
+
+* [Guide: Adding non-Kubernetes workloads to your mesh]({{< relref
+  "../tasks/adding-non-kubernetes-workloads" >}})
+* [ExternalWorkload Reference]({{< relref "../reference/external-workload" >}})
diff --git a/linkerd.io/content/2-edge/features/proxy-injection.md b/linkerd.io/content/2-edge/features/proxy-injection.md
@@ -34,7 +34,7 @@ For each pod, two containers are injected:
    Container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/)
    that configures `iptables` to automatically forward all incoming and
    outgoing TCP traffic through the proxy. (Note that this container is not
-   present if the [Linkerd CNI Plugin](../cni/) has been enabled.)
+   injected if the [Linkerd CNI Plugin](../cni/) has been enabled.)
 1. `linkerd-proxy`, the Linkerd data plane proxy itself.
 
 Note that simply adding the annotation to a resource with pre-existing pods
@@ -43,6 +43,16 @@ will not automatically inject those pods. You will need to update the pods
 because Kubernetes does not call the webhook until it needs to update the
 underlying resources.
 
+## Exclusions
+
+At install time, Kubernetes is configured to avoid calling Linkerd's proxy
+injector for resources in the `kube-system` and `cert-manager` namespaces. This
+is to prevent injection on components that are themselves required for Linkerd
+to function.
+
+The injector will not run on components in these namespaces, regardless of any
+`linkerd.io/inject` annotations.
+
 ## Overriding injection
 
 Automatic injection can be disabled for a pod or deployment for which it would

diff --git a/linkerd.io/content/2-edge/features/retries-and-timeouts.md b/linkerd.io/content/2-edge/features/retries-and-timeouts.md
@@ -4,26 +4,16 @@ description = "Linkerd can perform service-specific retries and timeouts."
 weight = 3
 +++
 
-Automatic retries are one the most powerful and useful mechanisms a service mesh
-has for gracefully handling partial or transient application failures. If
-implemented incorrectly retries can amplify small errors into system wide
-outages. For that reason, we made sure they were implemented in a way that would
-increase the reliability of the system while limiting the risk.
+Timeouts and automatic retries are two of the most powerful and useful
+mechanisms a service mesh has for gracefully handling partial or transient
+application failures.
 
-Timeouts work hand in hand with retries. Once requests are retried a certain
-number of times, it becomes important to limit the total amount of time a client
-waits before giving up entirely. Imagine a number of retries forcing a client
-to wait for 10 seconds.
-
-Timeouts can be configured using either the [HTTPRoute] or [ServiceProfile]
-resources. Currently, retries can only be configured using [ServiceProfile]s,
-but support for configuring retries using [HTTPRoutes] will be added in a future
-release. Creating these policy resources will cause the Linkerd proxy to perform
-the appropriate retries or timeouts when calling that service. Retries and
-timeouts are always performed on the *outbound* (client) side.
+Timeouts and retries can be configured using [HTTPRoute], GRPCRoute, or Service
+resources. Retries and timeouts are always performed on the *outbound* (client)
+side.
 
 {{< note >}}
-If working with headless services, service profiles cannot be retrieved. Linkerd
+If working with headless services, outbound policy cannot be retrieved. Linkerd
 reads service discovery information based off the target IP address, and if that
 happens to be a pod IP address then it cannot tell which service the pod belongs
 to.
@@ -34,49 +24,4 @@ These can be setup by following the guides:
 - [Configuring Retries](../../tasks/configuring-retries/)
 - [Configuring Timeouts](../../tasks/configuring-timeouts/)
 
-## How Retries Can Go Wrong
-
-Traditionally, when performing retries, you must specify a maximum number of
-retry attempts before giving up. Unfortunately, there are two major problems
-with configuring retries this way.
-
-### Choosing a maximum number of retry attempts is a guessing game
-
-You need to pick a number that’s high enough to make a difference; allowing
-more than one retry attempt is usually prudent and, if your service is less
-reliable, you’ll probably want to allow several retry attempts. On the other
-hand, allowing too many retry attempts can generate a lot of extra requests and
-extra load on the system. Performing a lot of retries can also seriously
-increase the latency of requests that need to be retried. In practice, you
-usually pick a maximum retry attempts number out of a hat (3?) and then tweak
-it through trial and error until the system behaves roughly how you want it to.
-
-### Systems configured this way are vulnerable to retry storms
-
-A [retry storm](https://twitter.github.io/finagle/guide/Glossary.html)
-begins when one service starts (for any reason) to experience a larger than
-normal failure rate. This causes its clients to retry those failed requests.
-The extra load from the retries causes the service to slow down further and
-fail more requests, triggering more retries. If each client is configured to
-retry up to 3 times, this can quadruple the number of requests being sent! To
-make matters even worse, if any of the clients’ clients are configured with
-retries, the number of retries compounds multiplicatively and can turn a small
-number of errors into a self-inflicted denial of service attack.
-
-## Retry Budgets to the Rescue
-
-To avoid the problems of retry storms and arbitrary numbers of retry attempts,
-retries are configured using retry budgets. Rather than specifying a fixed
-maximum number of retry attempts per request, Linkerd keeps track of the ratio
-between regular requests and retries and keeps this number below a configurable
-limit. For example, you may specify that you want retries to add at most 20%
-more requests. Linkerd will then retry as much as it can while maintaining that
-ratio.
-
-Configuring retries is always a trade-off between improving success rate and
-not adding too much extra load to the system. Retry budgets make that trade-off
-explicit by letting you specify exactly how much extra load your system is
-willing to accept from retries.
-
-[ServiceProfile]: ../service-profiles/
 [HTTPRoute]: ../httproute/