diff --git a/linkerd.io/content/2-edge/checks/index.html b/linkerd.io/content/2-edge/checks/index.html index 1c52b210dd..e577f3b926 100644 --- a/linkerd.io/content/2-edge/checks/index.html +++ b/linkerd.io/content/2-edge/checks/index.html @@ -1,18 +1,21 @@ - - - - - Linkerd Check Redirection - - - If you are not redirected automatically, follow this - link. - + + Linkerd Check Redirection + + + + If you are not redirected automatically, follow this + link. + + diff --git a/linkerd.io/content/2-edge/common-errors/_index.md b/linkerd.io/content/2-edge/common-errors/_index.md new file mode 100644 index 0000000000..d635b5ef95 --- /dev/null +++ b/linkerd.io/content/2-edge/common-errors/_index.md @@ -0,0 +1,21 @@ ++++ +title = "Common Errors" +weight = 10 +[sitemap] + priority = 1.0 ++++ + +Linkerd is generally robust, but things can always go wrong! You'll find +information here about the most common things that cause people trouble. + +## When in Doubt, Start With `linkerd check` + +Whenever you see anything that looks unusual about your mesh, **always** start +with `linkerd check`. It will check a long series of things that have caused +trouble for others and make sure that your configuration is sane, and it will +point you to help for any problems it finds. It's hard to overstate how useful +this command is. + +## Common Errors + +{{% sectiontoc "common-errors" %}} diff --git a/linkerd.io/content/2-edge/common-errors/failfast.md b/linkerd.io/content/2-edge/common-errors/failfast.md new file mode 100644 index 0000000000..5cd78c354e --- /dev/null +++ b/linkerd.io/content/2-edge/common-errors/failfast.md @@ -0,0 +1,18 @@ ++++ +title = "Failfast" +description = "Failfast means that no endpoints are available." ++++ + +If Linkerd reports that a given service is in the _failfast_ state, it +means that the proxy has determined that there are no available endpoints +for that service. In this situation there's no point in the proxy trying +to actually make a connection to the service - it already knows that it +can't talk to it - so it reports that the service is in failfast and +immediately returns an error from the proxy. + +The error will be either a 503 or a 504; see below for more information, +but if you already know that the service is in failfast because you saw +it in the logs, that's the important part. + +To get out of failfast, some endpoints for the service have to +become available. diff --git a/linkerd.io/content/2-edge/common-errors/http-502.md b/linkerd.io/content/2-edge/common-errors/http-502.md new file mode 100644 index 0000000000..7205d049a1 --- /dev/null +++ b/linkerd.io/content/2-edge/common-errors/http-502.md @@ -0,0 +1,11 @@ ++++ +title = "HTTP 502 Errors" +description = "HTTP 502 means connection errors between proxies." ++++ + +The Linkerd proxy will return a 502 error for connection errors between +proxies. Unfortunately it's fairly common to see an uptick in 502s when +first meshing a workload that hasn't previously been used with a mesh, +because the mesh surfaces errors that were previously invisible! + +There's actually a whole page on [debugging 502s](../../tasks/debugging-502s/). diff --git a/linkerd.io/content/2-edge/common-errors/http-503-504.md b/linkerd.io/content/2-edge/common-errors/http-503-504.md new file mode 100644 index 0000000000..a8777413af --- /dev/null +++ b/linkerd.io/content/2-edge/common-errors/http-503-504.md @@ -0,0 +1,27 @@ ++++ +title = "HTTP 503 and 504 Errors" +description = "HTTP 503 and 504 mean overloaded workloads." ++++ + +503s and 504s show up when a Linkerd proxy is trying to make so many +requests to a workload that it gets overwhelmed. + +When the workload next to a proxy makes a request, the proxy adds it +to an internal dispatch queue. When things are going smoothly, the +request is pulled from the queue and dispatched almost immediately. +If the queue gets too long, though (which can generally happen only +if the called service is slow to respond), the proxy will go into +_load-shedding_, where any new request gets an immediate 503. The +proxy can only get _out_ of load-shedding when the queue shrinks. + +Failfast also plays a role here: if the proxy puts a service into +failfast while there are requests in the dispatch queue, all the +requests in the dispatch queue get an immediate 504 before the +proxy goes into load-shedding. + +To get out of failfast, some endpoints for the service have to +become available. + +To get out of load-shedding, the dispatch queue has to start +emptying, which implies that the service has to get more capacity +to process requests or that the incoming request rate has to drop. diff --git a/linkerd.io/content/2-edge/common-errors/protocol-detection.md b/linkerd.io/content/2-edge/common-errors/protocol-detection.md new file mode 100644 index 0000000000..515b065515 --- /dev/null +++ b/linkerd.io/content/2-edge/common-errors/protocol-detection.md @@ -0,0 +1,35 @@ ++++ +title = "Protocol Detection Errors" +description = "Protocol detection errors indicate that Linkerd doesn't understand the protocol in use." ++++ + +Linkerd is capable of proxying all TCP traffic, including TLS connections, +WebSockets, and HTTP tunneling. In most cases where the client speaks first +when a new connection is made, Linkerd can detect the protocol in use, +allowing it to perform per-request routing and metrics. + +If your proxy logs contain messages like `protocol detection timed out after +10s`, or you're experiencing 10-second delays when establishing connections, +you're probably running a situation where Linkerd cannot detect the protocol. +This is most common for protocols where the server speaks first, and the +client is waiting for information from the server. It may also occur with +non-HTTP protocols for which Linkerd doesn't yet understand the wire format of +a request. + +You'll need to understand exactly what the situation is to fix this: + +- A server-speaks-first protocol will probably need to be configured as a + `skip` or `opaque` port, as described in the [protocol detection + documentation](../../features/protocol-detection/#configuring-protocol-detection). + +- If you're seeing transient protocol detection timeouts, this is more likely + to indicate a misbehaving workload. + +- If you know the protocol is client-speaks-first but you're getting + consistent protocol detection timeouts, you'll probably need to fall back on + a `skip` or `opaque` port. + +Note that marking ports as `skip` or `opaque` has ramifications beyond +protocol detection timeouts; see the [protocol detection +documentation](../../features/protocol-detection/#configuring-protocol-detection) +for more information. diff --git a/linkerd.io/content/2-edge/features/cni.md b/linkerd.io/content/2-edge/features/cni.md index 999e5443fb..0d314e1de9 100644 --- a/linkerd.io/content/2-edge/features/cni.md +++ b/linkerd.io/content/2-edge/features/cni.md @@ -25,6 +25,13 @@ plugin, using _CNI chaining_. It handles only the Linkerd-specific configuration and does not replace the need for a CNI plugin. {{< /note >}} +{{< note >}} +If you're installing Linkerd's CNI plugin on top of Cilium, make sure to install +the latter with the option `cni.exclusive=false`, so Cilium doesn't take +ownership over the CNI configurations directory, and allows other plugins to +deploy their configurations there. +{{< /note >}} + ## Installation Usage of the Linkerd CNI plugin requires that the `linkerd-cni` DaemonSet be diff --git a/linkerd.io/content/2-edge/features/ha.md b/linkerd.io/content/2-edge/features/ha.md index 20fa62f4f3..5cb9dd4116 100644 --- a/linkerd.io/content/2-edge/features/ha.md +++ b/linkerd.io/content/2-edge/features/ha.md @@ -79,26 +79,6 @@ See the Kubernetes for more information on the admission webhook failure policy. {{< /note >}} -## Exclude the kube-system namespace - -Per recommendation from the Kubernetes -[documentation](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#avoiding-operating-on-the-kube-system-namespace), -the proxy injector should be disabled for the `kube-system` namespace. - -This can be done by labeling the `kube-system` namespace with the following -label: - -```bash -kubectl label namespace kube-system config.linkerd.io/admission-webhooks=disabled -``` - -The Kubernetes API server will not call the proxy injector during the admission -phase of workloads in namespace with this label. - -If your Kubernetes cluster have built-in reconcilers that would revert any changes -made to the `kube-system` namespace, you should loosen the proxy injector -failure policy following these [instructions](#proxy-injector-failure-policy). - ## Pod anti-affinity rules All critical control plane components are deployed with pod anti-affinity rules diff --git a/linkerd.io/content/2-edge/features/httproute.md b/linkerd.io/content/2-edge/features/httproute.md index 37aa350d2c..6dd00a4b40 100644 --- a/linkerd.io/content/2-edge/features/httproute.md +++ b/linkerd.io/content/2-edge/features/httproute.md @@ -24,6 +24,13 @@ documentation](../../reference/httproute/#linkerd-and-gateway-api-httproutes) for details. {{< /note >}} +If the Gateway API CRDs already exist in your cluster, then Linkerd must be +installed with the `--set enableHttpRoutes=false` flag during the +`linkerd install --crds` step or with the `enableHttpRoutes=false` Helm value +when installing the `linkerd-crds` Helm chart. This avoid conflicts by +instructing Linkerd to not install the Gateway API CRDs and instead rely on the +Gateway CRDs which already exist. + An HTTPRoute is a Kubernetes resource which attaches to a parent resource, such as a [Service]. The HTTPRoute defines a set of rules which match HTTP requests to that resource, based on parameters such as the request's path, method, and diff --git a/linkerd.io/content/2-edge/features/ipv6.md b/linkerd.io/content/2-edge/features/ipv6.md new file mode 100644 index 0000000000..ed70051ff3 --- /dev/null +++ b/linkerd.io/content/2-edge/features/ipv6.md @@ -0,0 +1,14 @@ ++++ +title = "IPv6 Support" +description = "Linkerd is compatible with both IPv6-only and dual-stack clusters." ++++ + +As of version 2.16 (and edge-24.8.2) Linkerd fully supports Kubernetes clusters +configured for IPv6-only or dual-stack networking. + +This is disabled by default; to enable just set `proxy.disableIPv6=false` when +installing the control plane and, if you use it, the linkerd-cni plugin. + +Enabling IPv6 support does not generally change how Linkerd operates, except in +one way: when enabled on a dual-stack cluster, Linkerd will only use the IPv6 +endpoints of destinations and will not use the IPv4 endpoints. diff --git a/linkerd.io/content/2-edge/features/multicluster.md b/linkerd.io/content/2-edge/features/multicluster.md index fedd3a9423..2714d8983a 100644 --- a/linkerd.io/content/2-edge/features/multicluster.md +++ b/linkerd.io/content/2-edge/features/multicluster.md @@ -15,7 +15,7 @@ topology. This multi-cluster capability is designed to provide: 3. **Support for any type of network.** Linkerd does not require any specific network topology between clusters, and can function both with hierarchical networks as well as when clusters [share the same flat - network](#multi-cluster-for-flat-networks). + network](#flat-networks). 4. **A unified model alongside in-cluster communication.** The same observability, reliability, and security features that Linkerd provides for in-cluster communication extend to cross-cluster communication. diff --git a/linkerd.io/content/2-edge/features/non-kubernetes-workloads.md b/linkerd.io/content/2-edge/features/non-kubernetes-workloads.md new file mode 100644 index 0000000000..315edb880b --- /dev/null +++ b/linkerd.io/content/2-edge/features/non-kubernetes-workloads.md @@ -0,0 +1,16 @@ +--- +title: Non-Kubernetes workloads (mesh expansion) +--- + +Linkerd features *mesh expansion*, or the ability to add non-Kubernetes +workloads to your service mesh by deploying the Linkerd proxy to the remote +machine and connecting it back to the Linkerd control plane within the mesh. +This allows you to use Linkerd to establish communication to and from the +workload that is secure, reliable, and observable, just like communication to +and from your Kubernetes workloads. + +Related content: + +* [Guide: Adding non-Kubernetes workloads to your mesh]({{< relref + "../tasks/adding-non-kubernetes-workloads" >}}) +* [ExternalWorkload Reference]({{< relref "../reference/external-workload" >}}) diff --git a/linkerd.io/content/2-edge/features/proxy-injection.md b/linkerd.io/content/2-edge/features/proxy-injection.md index f15bfcb125..954f2104d2 100644 --- a/linkerd.io/content/2-edge/features/proxy-injection.md +++ b/linkerd.io/content/2-edge/features/proxy-injection.md @@ -34,7 +34,7 @@ For each pod, two containers are injected: Container](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) that configures `iptables` to automatically forward all incoming and outgoing TCP traffic through the proxy. (Note that this container is not - present if the [Linkerd CNI Plugin](../cni/) has been enabled.) + injected if the [Linkerd CNI Plugin](../cni/) has been enabled.) 1. `linkerd-proxy`, the Linkerd data plane proxy itself. Note that simply adding the annotation to a resource with pre-existing pods @@ -43,6 +43,16 @@ will not automatically inject those pods. You will need to update the pods because Kubernetes does not call the webhook until it needs to update the underlying resources. +## Exclusions + +At install time, Kubernetes is configured to avoid calling Linkerd's proxy +injector for resources in the `kube-system` and `cert-manager` namespaces. This +is to prevent injection on components that are themselves required for Linkerd +to function. + +The injector will not run on components in these namespaces, regardless of any +`linkerd.io/inject` annotations. + ## Overriding injection Automatic injection can be disabled for a pod or deployment for which it would diff --git a/linkerd.io/content/2-edge/features/retries-and-timeouts.md b/linkerd.io/content/2-edge/features/retries-and-timeouts.md index 6978f8147c..5786e60883 100644 --- a/linkerd.io/content/2-edge/features/retries-and-timeouts.md +++ b/linkerd.io/content/2-edge/features/retries-and-timeouts.md @@ -4,26 +4,16 @@ description = "Linkerd can perform service-specific retries and timeouts." weight = 3 +++ -Automatic retries are one the most powerful and useful mechanisms a service mesh -has for gracefully handling partial or transient application failures. If -implemented incorrectly retries can amplify small errors into system wide -outages. For that reason, we made sure they were implemented in a way that would -increase the reliability of the system while limiting the risk. +Timeouts and automatic retries are two of the most powerful and useful +mechanisms a service mesh has for gracefully handling partial or transient +application failures. -Timeouts work hand in hand with retries. Once requests are retried a certain -number of times, it becomes important to limit the total amount of time a client -waits before giving up entirely. Imagine a number of retries forcing a client -to wait for 10 seconds. - -Timeouts can be configured using either the [HTTPRoute] or [ServiceProfile] -resources. Currently, retries can only be configured using [ServiceProfile]s, -but support for configuring retries using [HTTPRoutes] will be added in a future -release. Creating these policy resources will cause the Linkerd proxy to perform -the appropriate retries or timeouts when calling that service. Retries and -timeouts are always performed on the *outbound* (client) side. +Timeouts and retries can be configured using [HTTPRoute], GRPCRoute, or Service +resources. Retries and timeouts are always performed on the *outbound* (client) +side. {{< note >}} -If working with headless services, service profiles cannot be retrieved. Linkerd +If working with headless services, outbound policy cannot be retrieved. Linkerd reads service discovery information based off the target IP address, and if that happens to be a pod IP address then it cannot tell which service the pod belongs to. @@ -34,49 +24,4 @@ These can be setup by following the guides: - [Configuring Retries](../../tasks/configuring-retries/) - [Configuring Timeouts](../../tasks/configuring-timeouts/) -## How Retries Can Go Wrong - -Traditionally, when performing retries, you must specify a maximum number of -retry attempts before giving up. Unfortunately, there are two major problems -with configuring retries this way. - -### Choosing a maximum number of retry attempts is a guessing game - -You need to pick a number that’s high enough to make a difference; allowing -more than one retry attempt is usually prudent and, if your service is less -reliable, you’ll probably want to allow several retry attempts. On the other -hand, allowing too many retry attempts can generate a lot of extra requests and -extra load on the system. Performing a lot of retries can also seriously -increase the latency of requests that need to be retried. In practice, you -usually pick a maximum retry attempts number out of a hat (3?) and then tweak -it through trial and error until the system behaves roughly how you want it to. - -### Systems configured this way are vulnerable to retry storms - -A [retry storm](https://twitter.github.io/finagle/guide/Glossary.html) -begins when one service starts (for any reason) to experience a larger than -normal failure rate. This causes its clients to retry those failed requests. -The extra load from the retries causes the service to slow down further and -fail more requests, triggering more retries. If each client is configured to -retry up to 3 times, this can quadruple the number of requests being sent! To -make matters even worse, if any of the clients’ clients are configured with -retries, the number of retries compounds multiplicatively and can turn a small -number of errors into a self-inflicted denial of service attack. - -## Retry Budgets to the Rescue - -To avoid the problems of retry storms and arbitrary numbers of retry attempts, -retries are configured using retry budgets. Rather than specifying a fixed -maximum number of retry attempts per request, Linkerd keeps track of the ratio -between regular requests and retries and keeps this number below a configurable -limit. For example, you may specify that you want retries to add at most 20% -more requests. Linkerd will then retry as much as it can while maintaining that -ratio. - -Configuring retries is always a trade-off between improving success rate and -not adding too much extra load to the system. Retry budgets make that trade-off -explicit by letting you specify exactly how much extra load your system is -willing to accept from retries. - -[ServiceProfile]: ../service-profiles/ [HTTPRoute]: ../httproute/ diff --git a/linkerd.io/content/2-edge/features/server-policy.md b/linkerd.io/content/2-edge/features/server-policy.md index 256f0dea97..2843f8940f 100644 --- a/linkerd.io/content/2-edge/features/server-policy.md +++ b/linkerd.io/content/2-edge/features/server-policy.md @@ -46,9 +46,11 @@ policy at that point in the hierarchy. Valid default policies include: - `all-unauthenticated`: allow all requests. This is the default. - `all-authenticated`: allow requests from meshed clients only. -- `cluster-authenticated`: allow requests form meshed clients in the same +- `cluster-authenticated`: allow requests from meshed clients in the same cluster. - `deny`: deny all requests. +- `audit`: Same as `all-unauthenticated` but requests get flagged in logs and + metrics. As well as several other default policies—see the [Policy reference](../../reference/authorization-policy/) for more. @@ -128,6 +130,36 @@ be denied at the TCP level, i.e. by refusing the connection. Note that dynamically changing the policy to deny existing connections may result in an abrupt termination of those connections. +## Audit mode + +A [`Server`]'s default policy is defined in its `accessPolicy` field, which +defaults to `deny`. That means that, by default, traffic that doesn't conform to +the rules associated to that Server is denied (the same applies to `Servers` +that don't have associated rules yet). This can inadvertently prevent traffic if +you apply rules that don't account for all the possible sources/routes for your +services. + +This is why we recommend that when first setting authorization policies, you +explicitly set `accessPolicy:audit` for complex-enough services. In this mode, +if a request doesn't abide to the policy rules, it won't get blocked, but it +will generate a log entry in the proxy at the INFO level with the tag +`authz.name=audit` along with other useful information. Likewise, the proxy will +add entries to metrics like `request_total` with the label `authz_name=audit`. +So when you're in the process of fine-tuning a new authorization policy, you can +filter by those tags/labels in your observability stack to keep an eye on +requests which weren't caught by the policy. + +### Audit mode for default policies + +Audit mode is also supported at cluster, namespace, or workload level. To set +the whole cluster to audit mode, set `proxy.defaultInboundPolicy=audit` when +installing Linkerd; for a namespace or a workload, use the annotation +`config.linkerd.io/default-inbound-policy:audit`. For example, if you had +`config.linkerd.io/default-inbound-policy:all_authenticated` for a namespace and +no `Servers` declared, all unmeshed traffic would be denied. By using +`config.linkerd.io/default-inbound-policy:audit` instead, unmeshed traffic would +be allowed but it would be logged and surfaced in metrics as detailed above. + ## Learning more - [Authorization policy reference](../../reference/authorization-policy/) diff --git a/linkerd.io/content/2-edge/features/service-profiles.md b/linkerd.io/content/2-edge/features/service-profiles.md index 00075e8da8..e27a540d74 100644 --- a/linkerd.io/content/2-edge/features/service-profiles.md +++ b/linkerd.io/content/2-edge/features/service-profiles.md @@ -6,6 +6,12 @@ aliases = [ ] +++ +{{< note >}} +[HTTPRoutes](../httproute/) are the recommended method for getting per-route +metrics, specifying timeouts, and specifying retries. Service profiles continue +to be supported for backwards compatibility. +{{< /note >}} + A service profile is a custom Kubernetes resource ([CRD][crd]) that can provide Linkerd additional information about a service. In particular, it allows you to define a list of routes for the service. Each route uses a regular expression @@ -24,10 +30,6 @@ To get started with service profiles you can: - Look into [setting up service profiles](../../tasks/setting-up-service-profiles/) for your own services. -- Understand what is required to see - [per-route metrics](../../tasks/getting-per-route-metrics/). -- [Configure retries](../../tasks/configuring-retries/) on your own services. -- [Configure timeouts](../../tasks/configuring-timeouts/) on your own services. - Glance at the [reference](../../reference/service-profiles/) documentation. [crd]: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ diff --git a/linkerd.io/content/2-edge/features/telemetry.md b/linkerd.io/content/2-edge/features/telemetry.md index 0cf9059287..d5f8a9fef8 100644 --- a/linkerd.io/content/2-edge/features/telemetry.md +++ b/linkerd.io/content/2-edge/features/telemetry.md @@ -56,7 +56,7 @@ problems with the service). ### Traffic (Requests Per Second) This gives an overview of how much demand is placed on the service/route. As -with success rates, `linkerd viz routes --o wide` splits this metric into +with success rates, `linkerd viz routes -o wide` splits this metric into EFFECTIVE_RPS and ACTUAL_RPS, corresponding to rates after and before retries respectively. diff --git a/linkerd.io/content/2-edge/getting-started/_index.md b/linkerd.io/content/2-edge/getting-started/_index.md index 861517e7ad..3adedd5f40 100644 --- a/linkerd.io/content/2-edge/getting-started/_index.md +++ b/linkerd.io/content/2-edge/getting-started/_index.md @@ -21,13 +21,9 @@ Linkerd can do. This guide is designed to walk you through the basics of Linkerd. First, you'll install the *CLI* (command-line interface) onto your local machine. Using this CLI, you'll then install the *control plane* onto your Kubernetes cluster. -Finally, you'll "mesh" a application by adding Linkerd's *data plane* to it. +Finally, you'll "mesh" an application by adding Linkerd's *data plane* to it. -{{< note >}} -This page contains quick start instructions intended for non-production -installations. For production-oriented configurations, we suggest reviewing -resources in [Going to Production](/going-to-production/). -{{< /note >}} +{{< releases >}} ## Step 0: Setup @@ -52,9 +48,10 @@ Now that we have our cluster, we'll install the Linkerd CLI and use it validate that your cluster is capable of hosting Linkerd. {{< note >}} -If you're using a GKE "private cluster" or Calico CNI, there are some [extra steps -required](../reference/cluster-configuration/#private-clusters) before you can -proceed to the next step. +If you're using a GKE "private cluster", or if you're using Cilium as a CNI, +there may be some [cluster-specific +configuration](../reference/cluster-configuration/) before you can proceed to +the next step. {{< /note >}} ## Step 1: Install the CLI @@ -66,14 +63,18 @@ your Linkerd deployment. To install the CLI manually, run: ```bash -curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh +curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install-edge | sh ``` -Be sure to follow the instructions to add it to your path. +Be sure to follow the instructions to add it to your path: + +```bash +export PATH=$HOME/.linkerd2/bin:$PATH +``` -(Alternatively, if you use [Homebrew](https://brew.sh), you can install the CLI -with `brew install linkerd`. You can also download the CLI directly via the -[Linkerd releases page](https://github.com/linkerd/linkerd2/releases/).) +This will install the CLI for the latest _edge release_ of Linkerd. (For more +information about what edge releases are, see our [Releases and +Versions](../../releases/) page.) Once installed, verify the CLI is running correctly with: @@ -85,6 +86,10 @@ You should see the CLI version, and also `Server version: unavailable`. This is because you haven't installed the control plane on your cluster. Don't worry—we'll fix that soon enough. +Make sure that your Linkerd version and Kubernetes version are compatible by +checking Linkerd's [supported Kubernetes +versions](../reference/k8s-versions/). + ## Step 2: Validate your Kubernetes cluster Kubernetes clusters can be configured in many different ways. Before we can diff --git a/linkerd.io/content/2-edge/overview/_index.md b/linkerd.io/content/2-edge/overview/_index.md index 73265cb3ea..b8d3b6e243 100644 --- a/linkerd.io/content/2-edge/overview/_index.md +++ b/linkerd.io/content/2-edge/overview/_index.md @@ -52,13 +52,10 @@ Linkerd2-proxy](/2020/07/23/under-the-hood-of-linkerds-state-of-the-art-rust-pro post, [Why Linkerd doesn't use Envoy](/2020/12/03/why-linkerd-doesnt-use-envoy/).) -## Versions and channels +## Getting Linkerd -Linkerd is currently published in several tracks: - -* [Linkerd 2.x stable releases](/edge/) -* [Linkerd 2.x edge releases](/edge/) -* [Linkerd 1.x.](/1/overview/) +Linkerd is available in a variety of packages and channels. See the [Linkerd +Releases](/releases/) page for details. ## Next steps diff --git a/linkerd.io/content/2-edge/reference/architecture.md b/linkerd.io/content/2-edge/reference/architecture.md index 01c5c0e997..2bb53414bd 100644 --- a/linkerd.io/content/2-edge/reference/architecture.md +++ b/linkerd.io/content/2-edge/reference/architecture.md @@ -96,6 +96,19 @@ You can read more about these micro-proxies here: * [Under the hood of Linkerd's state-of-the-art Rust proxy, Linkerd2-proxy](/2020/07/23/under-the-hood-of-linkerds-state-of-the-art-rust-proxy-linkerd2-proxy/) +### Meshed Conncections + +When one pod establishes a TCP connection to another pod and both of those pods +are injected with the Linkerd proxy, we say that the connection is *meshed*. +The proxy in the pod that initiated the connection is called the *outbound* +proxy and the proxy in the pod that accepted the connection is called the +*inbound* proxy. + +The *outbound* proxy is responsible for service discovery, load balancing, +circuit breakers, retries, and timeouts. The *inbound* proxy is responsible for +enforcing authorization policy. Both *inbound* and *outbound* proxies report +traffic metrics about the traffic they send and receive. + ### Linkerd init container The `linkerd-init` container is added to each meshed pod as a Kubernetes [init diff --git a/linkerd.io/content/2-edge/reference/authorization-policy.md b/linkerd.io/content/2-edge/reference/authorization-policy.md index 1089b8f463..46ba134c93 100644 --- a/linkerd.io/content/2-edge/reference/authorization-policy.md +++ b/linkerd.io/content/2-edge/reference/authorization-policy.md @@ -27,6 +27,8 @@ specify the cluster-wide default policy. This field can be one of the following: - `cluster-unauthenticated`: allow traffic from both meshed and non-meshed clients in the same cluster. - `deny`: all traffic are denied. +- `audit`: Same as `all-unauthenticated` but requests get flagged in logs and + metrics. This cluster-wide default can be overridden for specific resources by setting the annotation `config.linkerd.io/default-inbound-policy` on either a pod spec @@ -62,9 +64,10 @@ overlapping `Server`s from being created. {{< note >}} When a Server resource is present, all traffic to the port on its pods will be -denied (regardless of the default policy) unless explicitly authorized. Thus, -Servers are typically paired with e.g. an AuthorizationPolicy that references -the Server, or that reference an HTTPRoute that in turn references the Server. +denied unless explicitly authorized or audit mode is enabled (with +`accessPolicy:audit`). Thus, Servers are typically paired with e.g. an +AuthorizationPolicy that references the Server, or that reference an HTTPRoute +that in turn references the Server. {{< /note >}} ### Server Spec @@ -74,11 +77,21 @@ A `Server` spec may contain the following top level fields: {{< table >}} | field| value | |------|-------| +| `accessPolicy`| [accessPolicy](#accessPolicy) declares the policy applied to traffic not matching any associated authorization policies (defaults to `deny`). | | `podSelector`| A [podSelector](#podselector) selects pods in the same namespace. | | `port`| A port name or number. Only ports in a pod spec's `ports` are considered. | | `proxyProtocol`| Configures protocol discovery for inbound connections. Supersedes the `config.linkerd.io/opaque-ports` annotation. Must be one of `unknown`,`HTTP/1`,`HTTP/2`,`gRPC`,`opaque`,`TLS`. Defaults to `unknown` if not set. | {{< /table >}} +#### accessPolicy + +Traffic that doesn't conform to the authorization policies associated to the +Server are denied by default. You can alter that behavior by overriding the +`accessPolicy` field, which accepts the same values as the [default +policies](#default-policies). Of particular interest is the `audit` value, which +enables [audit mode](../../features/server-policy/#audit-mode), that you can use +to test policies before enforcing them. + #### podSelector This is the [same labelSelector field in Kubernetes](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/label-selector/#LabelSelector). diff --git a/linkerd.io/content/2-edge/reference/circuit-breaking.md b/linkerd.io/content/2-edge/reference/circuit-breaking.md index 372c4d53e4..062f9b01d8 100644 --- a/linkerd.io/content/2-edge/reference/circuit-breaking.md +++ b/linkerd.io/content/2-edge/reference/circuit-breaking.md @@ -21,7 +21,7 @@ in a [load balancer](../../features/load-balancing/) (i.e., each Pod in a given Service), and failures are tracked at the level of HTTP response status codes. Circuit breaking is a client-side behavior, and is therefore performed by the -outbound side of the Linkerd proxy.[^1] Outbound proxies implement circuit +[outbound] side of the Linkerd proxy.[^1] Outbound proxies implement circuit breaking in the load balancer, by marking failing endpoints as _unavailable_. When an endpoint is unavailable, the load balancer will not select it when determining where to send a given request. This means that if only some @@ -155,3 +155,4 @@ configure parameters for the consecutive-failures failure accrual policy: [5xx server error]: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status#server_error_responses [exp-backoff]: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ +[outbound]: ../architecture/#meshed-conncections diff --git a/linkerd.io/content/2-edge/reference/cluster-configuration.md b/linkerd.io/content/2-edge/reference/cluster-configuration.md index b7d7ac6364..99172cd797 100644 --- a/linkerd.io/content/2-edge/reference/cluster-configuration.md +++ b/linkerd.io/content/2-edge/reference/cluster-configuration.md @@ -78,6 +78,8 @@ gcloud compute firewall-rules describe gke-to-linkerd-control-plane ## Cilium +### Turn Off Socket-Level Load Balancing + Cilium can be configured to replace kube-proxy functionality through eBPF. When running in kube-proxy replacement mode, connections to a `ClusterIP` service will be established directly to the service's backend at the socket level (i.e. @@ -97,6 +99,15 @@ pods](https://docs.cilium.io/en/v1.13/network/istio/#setup-cilium) through the CLI option `--config bpf-lb-sock-hostns-only=true`, or through the Helm value `socketLB.hostNamespaceOnly=true`. +### Disable Exclusive Mode + +If you're using Cilium as your CNI and then want to install +[linkerd-cni](../../features/cni/) on top of it, make sure you install Cilium +with the option `cni.exclusive=false`. This avoids Cilium taking ownership over +the CNI configurations directory. Other CNI plugins like linkerd-cni install +themselves and operate in chain mode with the other deployed plugins by +deploying their configuration into this directory. + ## Lifecycle Hook Timeout Linkerd uses a `postStart` lifecycle hook for all control plane components, and diff --git a/linkerd.io/content/2-edge/reference/external-workload.md b/linkerd.io/content/2-edge/reference/external-workload.md new file mode 100644 index 0000000000..21f70bf986 --- /dev/null +++ b/linkerd.io/content/2-edge/reference/external-workload.md @@ -0,0 +1,105 @@ +--- +title: ExternalWorkload +--- + +Linkerd's [mesh expansion]({{< relref "../features/non-kubernetes-workloads" +>}}) functionality allows you to join workloads outside of Kubernetes into the +mesh. + +At its core, this behavior is controlled by an `ExternalWorkload` resource, +which is used by Linkerd to describe a workload that lives outside of Kubernetes +for discovery and policy. This resource contains information such as the +workload's identity, the concrete IP address as well as ports that this workload +accepts connections on. + +## ExternalWorkloads + +An ExternalWorkload is a namespace resource that defines a set of ports and an +IP address that is reachable from within the mesh. Linkerd uses that information +and translates it into `EndpointSlice`s that are then attached to `Service` objects. + +### Spec + +- `meshTLS` (required) - specified the identity information that Linkerd + requires to establish encrypted connections to this workload +- `workloadIPs` (required, at most 1) - an IP address that this workload is + reachable on +- `ports` - a list of port definitions that the workload exposes + +### MeshTLS + +- `identity` (required) - the TLS identity of the workload, proxies require this + value to establish TLS connections with the workload +- `serverName` (required) - this value is what the workload's proxy expects to + see in the `ClientHello` SNI TLS extension when other peers attempt to + initiate a TLS connection + +### Port + +- `name` - must be unique within the ports set. Each named port can be referred + to by services. +- `port` (required) - a port number that the workload is listening on +- `protocol` - protocol exposed by the port + +### Status + +- `conditions` - a list of condition objects + +### Condition + +- `lastProbeTime` - the last time the healthcheck endpoint was probed +- `lastTransitionTime` - the last time the condition transitioned from one + status to another +- `status` - status of the condition (one of True, False, Unknown) +- `type` - type of the condition (Ready is used for indicating discoverability) +- `reason` - contains a programmatic identifier indicating the reason for the + condition's last transition +- `message` - message is a human-readable message indicating details about the transition. + +## Example + +Below is an example of an `ExternalWorkload` resource that specifies a number of +ports and is selected by a service. + +```yaml +apiVersion: workload.linkerd.io/v1beta1 +kind: ExternalWorkload +metadata: + name: external-workload + namespace: mixed-env + labels: + location: vm + workload_name: external-workload +spec: + meshTLS: + identity: "spiffe://root.linkerd.cluster.local/external-workload" + serverName: "external-workload.cluster.local" + workloadIPs: + - ip: 193.1.4.11 + ports: + - port: 80 + name: http + - port: 9980 + name: admin +status: + conditions: + - type: Ready + status: "True" +--- +apiVersion: v1 +kind: Service +metadata: + name: external-workload + namespace: mixed-env +spec: + type: ClusterIP + selector: + workload_name: external-workload + ports: + - port: 80 + protocol: TCP + name: http + - port: 9980 + protocol: TCP + name: admin +``` diff --git a/linkerd.io/content/2-edge/reference/helm-chart-version-matrix.md b/linkerd.io/content/2-edge/reference/helm-chart-version-matrix.md new file mode 100644 index 0000000000..5785c30ab3 --- /dev/null +++ b/linkerd.io/content/2-edge/reference/helm-chart-version-matrix.md @@ -0,0 +1,13 @@ ++++ +title = "Helm Chart Version Matrix" ++++ + +The following version matrices include only the latest versions of the stable +releases along with corresponding app and Helm versions for Linkerd and +extensions. Use these to guide you to the right Helm chart version or to +automate workflows you might have. + +* [YAML matrix](/releases/release_matrix.yaml) +* [JSON matrix](/releases/release_matrix.json) + +{{< release-data-table />}} diff --git a/linkerd.io/content/2-edge/reference/httproute.md b/linkerd.io/content/2-edge/reference/httproute.md index 21d970943b..4f699cbd89 100644 --- a/linkerd.io/content/2-edge/reference/httproute.md +++ b/linkerd.io/content/2-edge/reference/httproute.md @@ -14,8 +14,15 @@ largely the same, the `policy.linkerd.io` HTTPRoute resource is an experimental version that contains features not yet stabilized in the upstream `gateway.networking.k8s.io` HTTPRoute resource, such as [timeouts](#httproutetimeouts). Both the Linkerd and Gateway API resource -definitions may coexist within the same cluster, and both can be used to -configure policies for use with Linkerd. +definitions coexist within the same cluster, and both can be used to configure +policies for use with Linkerd. + +If the Gateway API CRDs already exist in your cluster, then Linkerd must be +installed with the `--set enableHttpRoutes=false` flag during the +`linkerd install --crds` step or with the `enableHttpRoutes=false` Helm value +when installing the `linkerd-crds` Helm chart. This avoid conflicts by +instructing Linkerd to not install the Gateway API CRDs and instead rely on the +Gateway CRDs which already exist. This documentation describes the `policy.linkerd.io` HTTPRoute resource. For a similar description of the upstream Gateway API HTTPRoute resource, refer to the diff --git a/linkerd.io/content/2-edge/reference/k8s-versions.md b/linkerd.io/content/2-edge/reference/k8s-versions.md new file mode 100644 index 0000000000..6e6bd52262 --- /dev/null +++ b/linkerd.io/content/2-edge/reference/k8s-versions.md @@ -0,0 +1,40 @@ ++++ +title = "Supported Kubernetes Versions" +description = "Reference documentation for which Linkerd version supports which Kubernetes version" ++++ + +Linkerd supports all versions of Kubernetes that were supported at the time +that a given Linkerd version ships. For example, at the time that Linkerd 2.14 +shipped, Kubernetes versions 1.26, 1.27, and 1.28 were supported, so Linkerd +2.14 supports all of those Kubernetes versions. (In many cases, as you'll see +below, Linkerd versions will also support older Kubernetes versions.) + +Obviously, Linkerd 2.14 has no knowledge of what changes will come _after_ +Kubernetes 1.28. In some cases, later versions of Kubernetes end up making +changes that cause older versions of Linkerd to not work: we will update the +chart below as these situations arise. + +{{< table >}} +| Linkerd Version | Minimum Kubernetes Version | Maximum Kubernetes Version | +|-----------------|----------------------------|----------------------------| +| `2.10` | `1.16` | `1.23` | +| `2.11` | `1.17` | `1.23` | +| `2.12` | `1.21` | `1.24` | +| `2.13` | `1.21` | `1.28` | +| `2.14` | `1.21` | `1.28` | +| `2.15` | `1.22` | `1.29` | +{{< /table >}} + +Note that Linkerd will almost never change the supported Kubernetes version in +a minor release, which is why the table above only lists major versions. One +known exception: Linkerd 2.11.0 supported Kubernetes 1.16, but 2.11.1 and +later required Kubernetes 1.17 as shown in the table above. + +## Edge Releases + +{{< table >}} +| Linkerd Version | Minimum Kubernetes Version | Maximum Kubernetes Version | +|-----------------|----------------------------|----------------------------| +| `edge-22.10.1` - `edge-23.12.1` | `1.21` | `1.29` | +| `edge-23.12.2` and newer | `1.22` | `1.29` | +{{< /table >}} diff --git a/linkerd.io/content/2-edge/reference/retries.md b/linkerd.io/content/2-edge/reference/retries.md new file mode 100644 index 0000000000..9cc3d3cc6a --- /dev/null +++ b/linkerd.io/content/2-edge/reference/retries.md @@ -0,0 +1,105 @@ ++++ +title = "Retries" +description = "How Linkerd implements retries." ++++ + +Linkerd can be configured to automatically retry requests when it receives a +failed response instead of immediately returning that failure to the client. +This is valuable tool for improving success rate in the face of transient +failures. + +Retries are a client-side behavior, and are therefore performed by the +outbound side of the Linkerd proxy.[^1] If retries are configured on an +HTTPRoute or GRPCRoute with multiple backends, each retry of a request can +potentially get sent to a different backend. If a request has a body larger than +64KiB then it will not be retried. + +## Configuring Retries + +Retries are configured by a set of annotations which can be set on a Kubernetes +Service resource or on a HTTPRoute or GRPCRoute which has a Service as a parent. +Client proxies will then retry failed requests to that Service or route. If any +retry configuration annotations are present on a route resource, they override +all retry configuration annotations on the parent Service. + +{{< warning >}} +Retries configured in this way are **incompatible with ServiceProfiles**. If a +[ServiceProfile](../../features/service-profiles/) is defined for a Service, +proxies will use the ServiceProfile retry configuration and ignore any retry +annotations. +{{< /warning >}} + ++ `retry.linkerd.io/http`: A comma separated list of HTTP response codes which +should be retried. Each element of the list may be + + `xxx` to retry a single response code (for example, `"504"` -- remember, + annotation values must be strings!); + + `xxx-yyy` to retry a range of response codes (for example, `500-504`); + + `gateway-error` to retry response codes 502-504; or + + `5xx` to retry all 5XX response codes. +This annotation is not valid on GRPCRoute resources. ++ `retry.linkerd.io/grpc`: A comma seperated list of gRPC status codes which +should be retried. Each element of the list may be + + `cancelled` + + `deadline-exceeded` + + `internal` + + `resource-exhausted` + + `unavailable` +This annotation is not valid on HTTPRoute resources. ++ `retry.linkerd.io/limit`: The maximum number of times a request can be +retried. If unspecified, the default is `1`. ++ `retry.linkerd.io/timeout`: A retry timeout after which a request is cancelled +and retried (if the retry limit has not yet been reached). If unspecified, no +retry timeout is applied. Units must be specified in this value e.g. `5s` or +`200ms`. + +## Examples + +```yaml +kind: HTTPRoute +apiVersion: gateway.networking.k8s.io/v1beta1 +metadata: + name: schlep-default + namespace: schlep + annotations: + retry.linkerd.io/http: 5xx + retry.linkerd.io/limit: "2" + retry.linkerd.io/timeout: 300ms +spec: + parentRefs: + - name: schlep + kind: Service + group: core + port: 80 + rules: + - matches: + - path: + type: PathPrefix + value: "/" +``` + +```yaml +kind: GRPCRoute +apiVersion: gateway.networking.k8s.io/v1alpha2 +metadata: + name: schlep-default + namespace: schlep + annotations: + retry.linkerd.io/grpc: internal + retry.linkerd.io/limit: "2" + retry.linkerd.io/timeout: 400ms +spec: + parentRefs: + - name: schlep + kind: Service + group: core + port: 8080 + rules: + - matches: + - method: + type: Exact + service: schlep.Schlep + method: Get +``` + +[^1]: The part of the proxy which handles connections from within the pod to the + rest of the cluster. diff --git a/linkerd.io/content/2-edge/reference/timeouts.md b/linkerd.io/content/2-edge/reference/timeouts.md new file mode 100644 index 0000000000..d651b64ef9 --- /dev/null +++ b/linkerd.io/content/2-edge/reference/timeouts.md @@ -0,0 +1,68 @@ ++++ +title = "Timeouts" +description = "How Linkerd implements timeouts." ++++ + +Linkerd can be configured with timeouts to limit the maximum amount of time on +a request before aborting. + +Timeouts are a client-side behavior, and are therefore performed by the +outbound side of the Linkerd proxy.[^1] Note that timeouts configured in this +way are not retryable -- if these timeouts are reached, the request will not be +retried. Retryable timeouts can be configured as part of +[retry configuration](../retries/). + +## Configuring Timeouts + +Timeous are configured by a set of annotations which can be set on a Kubernetes +Service resource or on a HTTPRoute or GRPCRoute which has a Service as a parent. +Client proxies will then fail requests to that Service or route once they exceed +the timeout. If any timeout configuration annotations are present on a route +resource, they override all timeout configuration annotations on the parent +Service. + +{{< warning >}} +Timeouts configured in this way are **incompatible with ServiceProfiles**. If a +[ServiceProfile](../../features/service-profiles/) is defined for a Service, +proxies will use the ServiceProfile timeout configuration and ignore any timeout +annotations. +{{< /warning >}} + ++ `timeout.linkerd.io/request`: The maximum amount of time a full +request-response stream is in flight. ++ `timeout.linkerd.io/response`: The maximum amount of time a backend response +may be in-flight. ++ `timeout.linkerd.io/idle`: The maximum amount of time a stream may be idle, +regardless of its state. + +If the [request timeout](https://gateway-api.sigs.k8s.io/api-types/httproute/#timeouts-optional) +field is set on an HTTPRoute resource, it will be used as the +`timeout.linkerd.io/request` timeout. However, if both the field and the +annotation are specified, the annotation will take priority. + +## Examples + +```yaml +kind: HTTPRoute +apiVersion: gateway.networking.k8s.io/v1beta1 +metadata: + name: schlep-default + namespace: schlep + annotations: + timeout.linkerd.io/request: 2s + timeout.linkerd.io/response: 1s +spec: + parentRefs: + - name: schlep + kind: Service + group: core + port: 80 + rules: + - matches: + - path: + type: PathPrefix + value: "/" +``` + +[^1]: The part of the proxy which handles connections from within the pod to the + rest of the cluster. diff --git a/linkerd.io/content/2-edge/tasks/adding-non-kubernetes-workloads.md b/linkerd.io/content/2-edge/tasks/adding-non-kubernetes-workloads.md new file mode 100644 index 0000000000..e70e751893 --- /dev/null +++ b/linkerd.io/content/2-edge/tasks/adding-non-kubernetes-workloads.md @@ -0,0 +1,540 @@ +--- +title: Adding non-Kubernetes workloads to your mesh +--- + +In this guide, we'll walk you through an example of [mesh expansion]({{< relref +"../features/non-kubernetes-workloads" >}}): setting up and configuring an +example non-Kubernetes workload and adding it to your Linkerd mesh. + +## Overall flow + +In this guide, we'll take you through how to: + +1. Install the Linkerd proxy onto a virtual or physical machine outside the + Kubernetes cluster. +1. Configure network rules so traffic is routed through the proxy. +1. Register the external workload in the mesh. +1. Exercise traffic patterns and apply authorization policies that affect the + external workload. + +We'll be using [SPIRE](https://github.com/spiffe/spire) as our identity +mechanism to generate a workload identity. + +## Prerequisites + +You will need: + +- A functioning Linkerd installation and its trust anchor. +- A cluster that you have elevated privileges to. For local development, you can + use [kind](https://kind.sigs.k8s.io/) or [k3d](https://k3d.io/). +- A physical or virtual machine. +- `NET_CAP` privileges on the machine, so iptables rules can be modified. +- IP connectivity from the machine to every pod in the mesh. +- A working DNS setup such that the machine is able to resolve DNS names for + in-cluster Kubernetes workloads. + +## Getting the current trust anchor and key + +To be able to use mutual TLS across cluster boundaries, the off-cluster machine +and the cluster need to have a shared trust anchor. For the purposes of this +tutorial, we will assume that you have access to the trust anchor certificate +and secret key for your Linkerd deployment and placed it in files called +`ca.key` and `ca.crt`. + +## Install SPIRE on your machine + +Linkerd's proxies normally obtain TLS certificates from the identity component +of Linkerd's control plane. In order to attest their identity, they use the +Kubernetes Service Account token that is provided to each Pod. + +Since our external workload lives outside of Kubernetes, the concept of Service +Account tokens does not exist. Instead, we turn to the [SPIFFE +framework](https://spiffee.io) and its SPIRE implementation to create identities +for off-cluster resources. Thus, for mesh expansion, we configure the Linkerd +proxy to obtain its certificates directly from SPIRE instead of the Linkerd's +identity service. The magic of SPIFFE is that these certificates are compatible +with those generated by Linkerd on the cluster. + +In production, you may already have your own identity infrastructure built on +top of SPIFFE that can be used by the proxies on external machines. For this +tutorial however, we can take you through installing and setting up a minimal +SPIRE environment on your machine. To begin with you need to install SPIRE by +downloading it from the [SPIRE GitHub releases +page](https://github.com/spiffe/spire/releases). For example: + +```bash +wget https://github.com/spiffe/SPIRE/releases/download/v1.8.2/SPIRE-1.8.2-linux-amd64-musl.tar.gz +tar zvxf SPIRE-1.8.2-linux-amd64-musl.tar.gz +cp -r SPIRE-1.8.2/. /opt/SPIRE/ +``` + +Then you need to configure the SPIRE server on your machine: + +```bash +cat >/opt/SPIRE/server.cfg </opt/SPIRE/agent.cfg < +kubectl --context=west apply -f - < + while true; do + sleep 3600; + done + serviceAccountName: client +EOF +``` + +You can also create a service that selects over both the machine as well as an +in-cluster workload: + +```yaml +kubectl apply -f - </dev/null & ``` +(We redirect to `/dev/null` just so you don't get flooded with "Handling +connection" messages for the rest of the exercise.) + Open [http://localhost:7000/](http://localhost:7000/) in your browser to see the frontend. @@ -101,372 +104,268 @@ more details on how this works.) ## Debugging -Let's use Linkerd to discover the root cause of this app's failures. To check -out the Linkerd dashboard, run: +Let's use Linkerd to discover the root cause of this app's failures. Linkerd's +proxy exposes rich metrics about the traffic that it processes, including HTTP +response codes. The metric that we're interested is `outbound_http_route_backend_response_statuses_total` +and will help us identify where HTTP errors are occuring. We can use the +`linkerd diagnostics proxy-metrics` command to get proxy metrics. Pick one of +your webapp pods and run the following command to get the metrics for HTTP 500 +responses: ```bash -linkerd viz dashboard & +linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \ +| grep outbound_http_route_backend_response_statuses_total \ +| grep http_status=\"500\" ``` -{{< fig src="/images/books/dashboard.png" title="Dashboard" >}} - -Select `booksapp` from the namespace dropdown and click on the -[Deployments](http://localhost:50750/namespaces/booksapp/deployments) workload. -You should see all the deployments in the `booksapp` namespace show up. There -will be success rate, requests per second, and latency percentiles. - -That’s cool, but you’ll notice that the success rate for `webapp` is not 100%. -This is because the traffic generator is submitting new books. You can do the -same thing yourself and push that success rate even lower. Click on `webapp` in -the Linkerd dashboard for a live debugging session. - -You should now be looking at the detail view for the `webapp` service. You’ll -see that `webapp` is taking traffic from `traffic` (the load generator), and it -has two outgoing dependencies: `authors` and `book`. One is the service for -pulling in author information and the other is the service for pulling in book -information. - -{{< fig src="/images/books/webapp-detail.png" title="Detail" >}} - -A failure in a dependent service may be exactly what’s causing the errors that -`webapp` is returning (and the errors you as a user can see when you click). We -can see that the `books` service is also failing. Let’s scroll a little further -down the page, we’ll see a live list of all traffic endpoints that `webapp` is -receiving. This is interesting: - -{{< fig src="/images/books/top.png" title="Top" >}} - -Aha! We can see that inbound traffic coming from the `webapp` service going to -the `books` service is failing a significant percentage of the time. That could -explain why `webapp` was throwing intermittent failures. Let’s click on the tap -(🔬) icon and then on the Start button to look at the actual request and -response stream. - -{{< fig src="/images/books/tap.png" title="Tap" >}} - -Indeed, many of these requests are returning 500’s. - -It was surprisingly easy to diagnose an intermittent issue that affected only a -single route. You now have everything you need to open a detailed bug report -explaining exactly what the root cause is. If the `books` service was your own, -you know exactly where to look in the code. - -## Service Profiles - -To understand the root cause, we used live traffic. For some issues this is -great, but what happens if the issue is intermittent and happens in the middle of -the night? [Service profiles](../../features/service-profiles/) provide Linkerd -with some additional information about your services. These define the routes -that you're serving and, among other things, allow for the collection of metrics -on a per route basis. With Prometheus storing these metrics, you'll be able to -sleep soundly and look up intermittent issues in the morning. - -One of the easiest ways to get service profiles setup is by using existing -[OpenAPI (Swagger)](https://swagger.io/docs/specification/about/) specs. This -demo has published specs for each of its services. You can create a service -profile for `webapp` by running: - -```bash -curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/booksapp/webapp.swagger \ - | linkerd -n booksapp profile --open-api - webapp \ - | kubectl -n booksapp apply -f - +This should return a metric that looks something like: + +```text +outbound_http_route_backend_response_statuses_total{ + parent_group="core", + parent_kind="Service", + parent_namespace="booksapp", + parent_name="books", + parent_port="7002", + parent_section_name="", + route_group="", + route_kind="default", + route_namespace="", + route_name="http", + backend_group="core", + backend_kind="Service", + backend_namespace="booksapp", + backend_name="books", + backend_port="7002", + backend_section_name="", + http_status="500", + error="" +} 207 ``` -This command will do three things: +This counter tells us that the webapp pod received a total of 207 HTTP 500 +responses from the `books` Service on port 7002. -1. Fetch the swagger specification for `webapp`. -1. Take the spec and convert it into a service profile by using the `profile` - command. -1. Apply this configuration to the cluster. +## HTTPRoute -Alongside `install` and `inject`, `profile` is also a pure text operation. Check -out the profile that is generated: +We know that the webapp component is getting 500s from the books component, but +it would be great to narrow this down further and get per route metrics. To do +this, we take advantage of the Gateway API and define a set of HTTPRoute +resources, each attached to the `books` Service by specifying it as their +`parent_ref`. -```yaml -apiVersion: linkerd.io/v1alpha2 -kind: ServiceProfile +```bash +kubectl apply -f - <}} -Routes configured in service profiles are different from [HTTPRoute] resources. -Service profile routes allow you to collect per-route metrics and configure -client-side behavior such as retries and timeouts. [HTTPRoute] resources, on the -other hand, can be the target of AuthorizationPolicies and allow you to specify -per-route authorization. -{{< /note >}} +Both of these questions can be answered by adding annotations to the Service, +HTTPRoute, or GRPCRoute resource you're sending requests to. The reason why these pieces of configuration are required is because retries can potentially be dangerous. Automatically retrying a request that changes state @@ -32,67 +23,29 @@ to recover. Check out the [retries section](../books/#retries) of the books demo for a tutorial of how to configure retries. -## Retries - -For routes that are idempotent, you can edit the service profile and add -`isRetryable` to the retryable route: - -```yaml -spec: - routes: - - name: GET /api/annotations - condition: - method: GET - pathRegex: /api/annotations - isRetryable: true ### ADD THIS LINE ### -``` +{{< warning >}} +Retries configured in this way are **incompatible with ServiceProfiles**. If a +[ServiceProfile](../../features/service-profiles/) is defined for a Service, +proxies will use the ServiceProfile retry configuration and ignore any retry +annotations. +{{< /warning >}} -Retries are supported for _all_ idempotent requests, whatever verb they use, -and [whether or not they have a body]. In particular, this mean that gRPC -requests can be retried. However, requests will not be retried if the body -exceeds 64KiB. - -[whether or not they have a body]:../../../2021/10/26/how-linkerd-retries-http-requests-with-bodies/ - -## Retry Budgets - -A retry budget is a mechanism that limits the number of retries that can be -performed against a service as a percentage of original requests. This -prevents retries from overwhelming your system. By default, retries may add at -most an additional 20% to the request load (plus an additional 10 "free" -retries per second). These settings can be adjusted by setting a `retryBudget` -on your service profile. +## Retries -```yaml -spec: - retryBudget: - retryRatio: 0.2 - minRetriesPerSecond: 10 - ttl: 10s -``` +For HTTPRoutes that are idempotent, you can add the `retry.linkerd.io/http: 5xx` +annotation which instructs Linkerd to retry any requests which fail with an HTTP +response status in the 500s. -## Monitoring Retries +Note that requests will not be retried if the body exceeds 64KiB. -Retries can be monitored by using the `linkerd viz routes` command with the `--to` -flag and the `-o wide` flag. Since retries are performed on the client-side, -we need to use the `--to` flag to see metrics for requests that one resource -is sending to another (from the server's point of view, retries are just -regular requests). When both of these flags are specified, the `linkerd routes` -command will differentiate between "effective" and "actual" traffic. +## Retry Limits -```bash -ROUTE SERVICE EFFECTIVE_SUCCESS EFFECTIVE_RPS ACTUAL_SUCCESS ACTUAL_RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 -HEAD /authors/{id}.json authors 100.00% 2.8rps 58.45% 4.7rps 7ms 25ms 37ms -[DEFAULT] authors 0.00% 0.0rps 0.00% 0.0rps 0ms 0ms 0ms -``` +You can also add the `retry.linkerd.io/limit` annotation to specify the maximum +number of times a request may be retried. By default, this limit is `1`. -Actual requests represent all requests that the client actually sends, including -original requests and retries. Effective requests only count the original -requests. Since an original request may trigger one or more retries, the actual -request volume is usually higher than the effective request volume when retries -are enabled. Since an original request may fail the first time, but a retry of -that request might succeed, the effective success rate is usually ([but not -always](../configuring-timeouts/#monitoring-timeouts)) higher than the -actual success rate. +## gRPC Retries -[HTTPRoute]: ../../features/httproute/ +Retries can also be configured for gRPC traffic by adding the +`retry.linkerd.io/grpc` annotation to a GRPCRoute or Service resource. The value +of this annotation is a comma seperated list of gRPC status codes that should +be retried. diff --git a/linkerd.io/content/2-edge/tasks/configuring-timeouts.md b/linkerd.io/content/2-edge/tasks/configuring-timeouts.md index 5fdfb6637f..45005a5d15 100644 --- a/linkerd.io/content/2-edge/tasks/configuring-timeouts.md +++ b/linkerd.io/content/2-edge/tasks/configuring-timeouts.md @@ -9,107 +9,17 @@ of time to wait for a response from a remote service to complete after the request is sent. If the timeout elapses without receiving a response, Linkerd will cancel the request and return a [504 Gateway Timeout] response. -Timeouts can be specified either [using HTTPRoutes](#using-httproutes) or [using -legacy ServiceProfiles](#using-serviceprofiles). Since [HTTPRoute] is a newer -configuration mechanism intended to replace [ServiceProfile]s, prefer the use of -HTTPRoute timeouts unless a ServiceProfile already exists for the Service. - -## Using HTTPRoutes - -Linkerd supports timeouts as specified in [GEP-1742], for [outbound -HTTPRoutes](../../features/httproute/#inbound-and-outbound-httproutes) -with Service parents. +Timeouts can be specified by adding annotations to HTTPRoute, GRPCRoute, or +Service resources. {{< warning >}} -Support for [GEP-1742](https://gateway-api.sigs.k8s.io/geps/gep-1742/) has not -yet been implemented by the upstream Gateway API HTTPRoute resource. The GEP has -been accepted, but it has not yet been added to the definition of the HTTPRoute -resource. This means that HTTPRoute timeout fields can currently be used only in -HTTPRoute resources with the `policy.linkerd.io` API group, *not* the -`gateway.networking.k8s.io` API group. - -When the [GEP-1742](https://gateway-api.sigs.k8s.io/geps/gep-1742/) timeout -fields are added to the upstream resource definition, Linkerd will support -timeout configuration for HTTPRoutes with both API groups. - -See [the HTTPRoute reference -documentation](../../reference/httproute/#linkerd-and-gateway-api-httproutes) -for details on the two versions of the HTTPRoute resource. +Timeouts configured in this way are **incompatible with ServiceProfiles**. If a +[ServiceProfile](../../features/service-profiles/) is defined for a Service, +proxies will use the ServiceProfile timeout configuration and ignore any timeout +annotations. {{< /warning >}} -Each [rule](../../reference/httproute/#httprouterule) in an [HTTPRoute] may -define an optional [`timeouts`](../../reference/httproute/#httproutetimeouts) -object, which can define `request` and/or `backendRequest` fields: - -- `timeouts.request` specifies the *total time* to wait for a request matching - this rule to complete (including retries). This timeout starts when the proxy - receives a request, and ends when successful response is sent to the client. -- `timeouts.backendRequest` specifies the time to wait for a single request to a - backend to complete. This timeout starts when a request is dispatched to a - [backend](../../reference/httproute/#httpbackendref), and ends when a response - is received from that backend. This is a subset of the `timeouts.request` - timeout. If the request fails and is retried (if applicable), the - `backendRequest` timeout will be restarted for each retry request. - -Timeout durations are specified specified as strings using the [Gateway API -duration format] specified by -[GEP-2257](https://gateway-api.sigs.k8s.io/geps/gep-2257/) -(e.g. 1h/1m/1s/1ms), and must be at least 1ms. If either field is unspecified or -set to 0, the timeout configured by that field will not be enforced. - -For example: - -```yaml -spec: - rules: - - matches: - - path: - type: RegularExpression - value: /authors/[^/]*\.json" - method: GET - timeouts: - request: 600ms - backendRequest: 300ms -``` - -## Using ServiceProfiles - -Each [route](../../reference/service-profiles/#route) in a [ServiceProfile] may -define a request timeout for requests matching that route. This timeout secifies -the maximum amount of time to wait for a response (including retries) to -complete after the request is sent. If unspecified, the default timeout is 10 -seconds. - -```yaml -spec: - routes: - - condition: - method: HEAD - pathRegex: /authors/[^/]*\.json - name: HEAD /authors/{id}.json - timeout: 300ms -``` - -Check out the [timeouts section](../books/#timeouts) of the books demo for -a tutorial of how to configure timeouts using ServiceProfiles. - -## Monitoring Timeouts - -Requests which reach the timeout will be canceled, return a [504 Gateway -Timeout] response, and count as a failure for the purposes of [effective success -rate](../configuring-retries/#monitoring-retries). Since the request was -canceled before any actual response was received, a timeout will not count -towards the actual request volume at all. This means that effective request -rate can be higher than actual request rate when timeouts are configured. -Furthermore, if a response is received just as the timeout is exceeded, it is -possible for the request to be counted as an actual success but an effective -failure. This can result in effective success rate being lower than actual -success rate. +## Timeouts -[HTTPRoute]: ../../features/httproute/ -[ServiceProfile]: ../../features/service-profiles/ -[504 Gateway Timeout]: - https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504 -[GEP-1742]: https://gateway-api.sigs.k8s.io/geps/gep-1742/ -[Gateway API duration format]: - https://gateway-api.sigs.k8s.io/geps/gep-2257/#gateway-api-duration-format +Check out the [timeouts section](../books/#timeouts) of the books demo +for a tutorial of how to configure timeouts. diff --git a/linkerd.io/content/2-edge/tasks/getting-per-route-metrics.md b/linkerd.io/content/2-edge/tasks/getting-per-route-metrics.md index 86e6a35cb7..971a01acc7 100644 --- a/linkerd.io/content/2-edge/tasks/getting-per-route-metrics.md +++ b/linkerd.io/content/2-edge/tasks/getting-per-route-metrics.md @@ -3,101 +3,22 @@ title = "Getting Per-Route Metrics" description = "Configure per-route metrics for your application." +++ -To get per-route metrics, you must first create a -[service profile](../../features/service-profiles/). Once a service -profile has been created, Linkerd will add labels to the Prometheus metrics that -associate a specific request to a specific route. - -For a tutorial that shows this functionality off, check out the +To get per-route metrics, you must create [HTTPRoute] resources. If a route has +a `parent_ref` which points to a **Service** resource, Linkerd will generate +outbound per-route traffic metrics for all HTTP traffic that it sends to that +Service. If a route has a `parent_ref` which points to a **Server** resource, +Linkerd will generate inbound per-route traffic metrcs for all HTTP traffic that +it receives on that Server. Note that an [HTTPRoute] can have multiple +`parent_ref`s which means that the same [HTTPRoute] resource can be used to +describe both outbound and inbound routes. + +For a tutorial that shows off per-route metrics, check out the [books demo](../books/#service-profiles). {{< note >}} Routes configured in service profiles are different from [HTTPRoute] resources. -Service profile routes allow you to collect per-route metrics and configure -client-side behavior such as retries and timeouts. [HTTPRoute] resources, on the -other hand, can be the target of AuthorizationPolicies and allow you to specify -per-route authorization. +If a [ServiceProfile](../../features/service-profiles/) is defined for a +Service, proxies will ignore any [HTTPRoute] for that Service. {{< /note >}} -You can view per-route metrics in the CLI by running `linkerd viz routes`: - -```bash -$ linkerd viz routes svc/webapp -ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 -GET / webapp 100.00% 0.6rps 25ms 30ms 30ms -GET /authors/{id} webapp 100.00% 0.6rps 22ms 29ms 30ms -GET /books/{id} webapp 100.00% 1.2rps 18ms 29ms 30ms -POST /authors webapp 100.00% 0.6rps 32ms 46ms 49ms -POST /authors/{id}/delete webapp 100.00% 0.6rps 45ms 87ms 98ms -POST /authors/{id}/edit webapp 0.00% 0.0rps 0ms 0ms 0ms -POST /books webapp 50.76% 2.2rps 26ms 38ms 40ms -POST /books/{id}/delete webapp 100.00% 0.6rps 24ms 29ms 30ms -POST /books/{id}/edit webapp 60.71% 0.9rps 75ms 98ms 100ms -[DEFAULT] webapp 0.00% 0.0rps 0ms 0ms 0ms -``` - -The `[DEFAULT]` route is a catch-all, anything that does not match the regexes -specified in your service profile will end up there. - -It is also possible to look the metrics up by other resource types, such as: - -```bash -$ linkerd viz routes deploy/webapp -ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 -[DEFAULT] kubernetes 0.00% 0.0rps 0ms 0ms 0ms -GET / webapp 100.00% 0.5rps 27ms 38ms 40ms -GET /authors/{id} webapp 100.00% 0.6rps 18ms 29ms 30ms -GET /books/{id} webapp 100.00% 1.1rps 17ms 28ms 30ms -POST /authors webapp 100.00% 0.5rps 25ms 30ms 30ms -POST /authors/{id}/delete webapp 100.00% 0.5rps 58ms 96ms 99ms -POST /authors/{id}/edit webapp 0.00% 0.0rps 0ms 0ms 0ms -POST /books webapp 45.58% 2.5rps 33ms 82ms 97ms -POST /books/{id}/delete webapp 100.00% 0.6rps 33ms 48ms 50ms -POST /books/{id}/edit webapp 55.36% 0.9rps 79ms 160ms 192ms -[DEFAULT] webapp 0.00% 0.0rps 0ms 0ms 0ms -``` - -Then, it is possible to filter all the way down to requests going from a -specific resource to other services: - -```bash -$ linkerd viz routes deploy/webapp --to svc/books -ROUTE SERVICE SUCCESS RPS LATENCY_P50 LATENCY_P95 LATENCY_P99 -DELETE /books/{id}.json books 100.00% 0.5rps 18ms 29ms 30ms -GET /books.json books 100.00% 1.1rps 7ms 12ms 18ms -GET /books/{id}.json books 100.00% 2.5rps 6ms 10ms 10ms -POST /books.json books 52.24% 2.2rps 23ms 34ms 39ms -PUT /books/{id}.json books 41.98% 1.4rps 73ms 97ms 99ms -[DEFAULT] books 0.00% 0.0rps 0ms 0ms 0ms -``` - -## Troubleshooting - -If you're not seeing any metrics, there are two likely culprits. In both cases, -`linkerd viz tap` can be used to understand the problem. For the resource that -the service points to, run: - -```bash -linkerd viz tap deploy/webapp -o wide | grep req -``` - -A sample output is: - -```bash -req id=3:1 proxy=in src=10.4.0.14:58562 dst=10.4.1.4:7000 tls=disabled :method=POST :authority=webapp:7000 :path=/books/24783/edit src_res=deploy/traffic src_ns=default dst_res=deploy/webapp dst_ns=default rt_route=POST /books/{id}/edit -``` - -This will select only the requests observed and show the `:authority` and -`rt_route` that was used for each request. - -- Linkerd discovers the right service profile to use via `:authority` or - `Host` headers. The name of your service profile must match these headers. - There are many reasons why these would not match, see - [ingress](../../features/ingress/) for one reason. Another would be clients that - use IPs directly such as Prometheus. -- Getting regexes to match can be tough and the ordering is important. Pay - attention to `rt_route`. If it is missing entirely, compare the `:path` to - the regex you'd like for it to match, and use a - [tester](https://regex101.com/) with the Golang flavor of regex. - [HTTPRoute]: ../../features/httproute/ diff --git a/linkerd.io/content/2-edge/tasks/graceful-shutdown.md b/linkerd.io/content/2-edge/tasks/graceful-shutdown.md index ea96f17e2e..c4584184ca 100644 --- a/linkerd.io/content/2-edge/tasks/graceful-shutdown.md +++ b/linkerd.io/content/2-edge/tasks/graceful-shutdown.md @@ -134,9 +134,20 @@ containers in the pod complete. However, the Linkerd proxy container runs continuously until it receives a TERM signal. Since Kubernetes does not give the proxy a means to know when the Cronjob has completed, by default, Job and Cronjob pods which have been meshed will continue to run even once the main -container has completed. +container has completed. You can address this either by running Linkerd as a +native sidecar or by manually shutting down the proxy. -To address this, you can issue a POST to the `/shutdown` endpoint on the proxy +### Native Sidecar + +If you use the `--set proxy.nativeSidecar=true` flag when installing Linkerd, the +Linkerd proxy will run as a [sidecar container](https://kubernetes.io/docs/concepts/workloads/pods/sidecar-containers/) +and will automatically shutdown when the main containers in the pod terminate. +Native sidecars were added in Kubernetes v1.28 and are available by default in +Kubernetes v1.29. + +### Manual shutdown + +Alternatively, you can issue a POST to the `/shutdown` endpoint on the proxy once the application completes (e.g. via `curl -X POST http://localhost:4191/shutdown`). This will terminate the proxy gracefully and allow the Job or Cronjob to complete. These shutdown requests must come on the @@ -147,5 +158,7 @@ One convenient way to call this endpoint is to wrap your application with the application that is called this way (e.g. via `linkerd-await -S $MYAPP`) will automatically call the proxy's `/shutdown` endpoint when it completes. -In the future, Kubernetes will hopefully support more container lifecycle hooks -that will allow Linkerd to handle these situations automatically. +For security reasons, the proxy's `/shutdown` endpoint is disabled by default. +In order to be able to manually shutdown the proxy, you must enable this +endpoint by installing Linkerd with the `--set proxy.enableShutdownEndpoint=true` +flag. diff --git a/linkerd.io/content/2-edge/tasks/install-helm.md b/linkerd.io/content/2-edge/tasks/install-helm.md index b604602aea..af0249a009 100644 --- a/linkerd.io/content/2-edge/tasks/install-helm.md +++ b/linkerd.io/content/2-edge/tasks/install-helm.md @@ -6,62 +6,50 @@ description = "Install Linkerd onto your Kubernetes cluster using Helm." Linkerd can be installed via Helm rather than with the `linkerd install` command. This is recommended for production, since it allows for repeatability. -{{< trylpt >}} +{{< releases >}} -## Prerequisite: generate identity certificates +## Prerequisite: generate mTLS certificates To do [automatic mutual TLS](../../features/automatic-mtls/), Linkerd requires trust anchor certificate and an issuer certificate and key pair. When you're -using `linkerd install`, we can generate these for you. However, for Helm, -you will need to generate these yourself. +using `linkerd install`, we can generate these for you. However, for Helm, you +will need to generate these yourself. -Please follow the instructions in [Generating your own mTLS root -certificates](../generate-certificates/) to generate these. +Please follow the instructions in +[Generating your own mTLS root certificates](../generate-certificates/) to +generate these. -## Helm install procedure for stable releases +## Helm install procedure ```bash -# To add the repo for Linkerd stable releases: -helm repo add linkerd https://helm.linkerd.io/stable - -# To add the repo for Linkerd edge releases: +# Add the Helm repo for Linkerd edge releases: helm repo add linkerd-edge https://helm.linkerd.io/edge ``` -The following instructions use the `linkerd` repo. For installing an edge -release, just replace with `linkerd-edge`, and add the `--devel` flag to all -commands. - -## Helm install procedure - You need to install two separate charts in succession: first `linkerd-crds` and then `linkerd-control-plane`. -{{< note >}} -If installing Linkerd in a cluster that uses Cilium in kube-proxy replacement -mode, additional steps may be needed to ensure service discovery works as -intended. Instrunctions are on the [Cilium cluster -configuration](../../reference/cluster-configuration/#cilium) page. -{{< /note >}} +{{< note >}} If installing Linkerd in a cluster that uses Cilium in kube-proxy +replacement mode, additional steps may be needed to ensure service discovery +works as intended. Instrunctions are on the +[Cilium cluster configuration](../../reference/cluster-configuration/#cilium) +page. {{< /note >}} ### linkerd-crds The `linkerd-crds` chart sets up the CRDs linkerd requires: ```bash -helm install linkerd-crds linkerd/linkerd-crds \ +helm install linkerd-crds linkerd-edge/linkerd-crds \ -n linkerd --create-namespace ``` -{{< note >}} -This will create the `linkerd` namespace. If it already exists or you're -creating it beforehand elsewhere in your pipeline, just omit the -`--create-namespace` flag. -{{< /note >}} +{{< note >}} This will create the `linkerd` namespace. If it already exists or +you're creating it beforehand elsewhere in your pipeline, just omit the +`--create-namespace` flag. {{< /note >}} -{{< note >}} -If you are using [Linkerd's CNI plugin](../../features/cni/), you must also add the -`--set cniEnabled=true` flag to your `helm install` command. +{{< note >}} If you are using [Linkerd's CNI plugin](../../features/cni/), you +must also add the `--set cniEnabled=true` flag to your `helm install` command. {{< /note >}} ### linkerd-control-plane @@ -74,25 +62,25 @@ helm install linkerd-control-plane \ --set-file identityTrustAnchorsPEM=ca.crt \ --set-file identity.issuer.tls.crtPEM=issuer.crt \ --set-file identity.issuer.tls.keyPEM=issuer.key \ - linkerd/linkerd-control-plane + linkerd-edge/linkerd-control-plane ``` -{{< note >}} -If you are using [Linkerd's CNI plugin](../../features/cni/), you must also add the -`--set cniEnabled=true` flag to your `helm install` command. +{{< note >}} If you are using [Linkerd's CNI plugin](../../features/cni/), you +must also add the `--set cniEnabled=true` flag to your `helm install` command. {{< /note >}} ## Enabling high availability mode -The `linkerd-control-plane` chart contains a file `values-ha.yaml` that overrides -some default values to set things up under a high-availability scenario, analogous -to the `--ha` option in `linkerd install`. Values such as higher number of -replicas, higher memory/cpu limits, and affinities are specified in those files. +The `linkerd-control-plane` chart contains a file `values-ha.yaml` that +overrides some default values to set things up under a high-availability +scenario, analogous to the `--ha` option in `linkerd install`. Values such as +higher number of replicas, higher memory/cpu limits, and affinities are +specified in those files. You can get `values-ha.yaml` by fetching the chart file: ```bash -helm fetch --untar linkerd/linkerd-control-plane +helm fetch --untar linkerd-edge/linkerd-control-plane ``` Then use the `-f` flag to provide this override file. For example: @@ -104,7 +92,7 @@ helm install linkerd-control-plane \ --set-file identity.issuer.tls.crtPEM=issuer.crt \ --set-file identity.issuer.tls.keyPEM=issuer.key \ -f linkerd-control-plane/values-ha.yaml \ - linkerd/linkerd-control-plane + linkerd-edge/linkerd-control-plane ``` ## Upgrading with Helm @@ -115,46 +103,43 @@ First, make sure your local Helm repos are updated: helm repo update helm search repo linkerd -NAME CHART VERSION APP VERSION DESCRIPTION -linkerd/linkerd-crds Linkerd gives you observability, reliability, and securit... -linkerd/linkerd-control-plane {{% latestversion %}} Linkerd gives you observability, reliability, and securit... +NAME CHART VERSION APP VERSION DESCRIPTION +linkerd-edge/linkerd-crds Linkerd gives you observability, reliability, and securit... +linkerd-edge/linkerd-control-plane {{% latestedge %}} Linkerd gives you observability, reliability, and securit... ``` During an upgrade, you must choose whether you want to reuse the values in the -chart or move to the values specified in the newer chart. Our advice is to use -a `values.yaml` file that stores all custom overrides that you have for your +chart or move to the values specified in the newer chart. Our advice is to use a +`values.yaml` file that stores all custom overrides that you have for your chart. The `helm upgrade` command has a number of flags that allow you to customize its behavior. Special attention should be paid to `--reuse-values` and `--reset-values` and how they behave when charts change from version to version -and/or overrides are applied through `--set` and `--set-file`. For example: +and/or overrides are applied through `--set` and `--set-file`. For example: - `--reuse-values` with no overrides - all values are reused - `--reuse-values` with overrides - all except the values that are overridden -are reused -- `--reset-values` with no overrides - no values are reused and all changes -from provided release are applied during the upgrade + are reused +- `--reset-values` with no overrides - no values are reused and all changes from + provided release are applied during the upgrade - `--reset-values` with overrides - no values are reused and changed from -provided release are applied together with the overrides + provided release are applied together with the overrides - no flag and no overrides - `--reuse-values` will be used by default - no flag and overrides - `--reset-values` will be used by default -Finally, before upgrading, check whether there are breaking changes to the chart -(i.e. renamed or moved keys, etc). You can consult the -[stable](https://artifacthub.io/packages/helm/linkerd2/linkerd-control-plane#values) -or the -[edge](https://artifacthub.io/packages/helm/linkerd2-edge/linkerd-control-plane#values) -chart docs, depending on -which one your are upgrading to. If there are, make the corresponding changes to +Finally, before upgrading, you can consult the +[edge chart](https://artifacthub.io/packages/helm/linkerd2-edge/linkerd-control-plane#values) +docs to check whether there are breaking changes to the chart (i.e. +renamed or moved keys, etc). If there are, make the corresponding changes to your `values.yaml` file. Then you can use: ```bash # the linkerd-crds chart currently doesn't have a values.yaml file -helm upgrade linkerd-crds linkerd/linkerd-crds +helm upgrade linkerd-crds linkerd-edge/linkerd-crds # whereas linkerd-control-plane does -helm upgrade linkerd-control-plane linkerd/linkerd-control-plane --reset-values -f values.yaml --atomic +helm upgrade linkerd-control-plane linkerd-edge/linkerd-control-plane --reset-values -f values.yaml --atomic ``` The `--atomic` flag will ensure that all changes are rolled back in case the diff --git a/linkerd.io/content/2-edge/tasks/install.md b/linkerd.io/content/2-edge/tasks/install.md index c36111e645..143255c109 100644 --- a/linkerd.io/content/2-edge/tasks/install.md +++ b/linkerd.io/content/2-edge/tasks/install.md @@ -12,12 +12,23 @@ Before you can use Linkerd, you'll need to install the [control plane](../../reference/architecture/#control-plane). This page covers how to accomplish that. +{{< note >}} + +The Linkerd project itself only produces [edge release](/releases/) artifacts. +(For more information about the different kinds of Linkerd releases, see the +[Releases and Versions](/releases/) page.) + +As such, this page contains instructions for installing the latest edge +release of Linkerd. If you are using a [stable +distribution](/releases/#stable) of Linkerd, the vendor should provide +additional guidance on installing Linkerd. + +{{< /note >}} + Linkerd's control plane can be installed in two ways: with the CLI and with Helm. The CLI is convenient and easy, but for production use cases we recommend Helm which allows for repeatability. -{{< trylpt >}} - In either case, we recommend installing the CLI itself so that you can validate the success of the installation. See the [Getting Started Guide](../../getting-started/) for how to install the CLI if you haven't done @@ -29,6 +40,10 @@ Linkerd requires a Kubernetes cluster on which to run. Where this cluster lives is not important: it might be hosted on a cloud provider, may be running on your local machine, or even somewhere else. +Make sure that your Linkerd version and Kubernetes version are compatible by +checking Linkerd's [supported Kubernetes +versions](../../reference/k8s-versions/). + Before installing the control plane, validate that this Kubernetes cluster is configured appropriately for Linkerd by running: diff --git a/linkerd.io/content/2-edge/tasks/per-request-policy.md b/linkerd.io/content/2-edge/tasks/per-request-policy.md new file mode 100644 index 0000000000..95dfcdd3bc --- /dev/null +++ b/linkerd.io/content/2-edge/tasks/per-request-policy.md @@ -0,0 +1,34 @@ ++++ +title = "Per-Request Policy" +description = "Using HTTP headers to specify per-request policy" +aliases = [] ++++ + +[Retries](../configuring-retries/) and [timeouts](../configuring-timeouts/) can +be configured by annotating Service, HTTPRoute, or GRPCRoute resources. This +will apply the retry or timeout policy to all requests that are sent to that +service/route. + +Additionally, retry and timeout policy can be configured for individual HTTP +requests by adding special HTTP headers to those requests. + +## Enabling Per-Request Policy + +In order to enable per-request policy, Linkerd must be installed with the +`--set policyController.additionalArgs="--allow-l5d-request-headers"` flag or +the corresponding Helm value. Enabling per-request policy is **not** +recommended if your application accepts requests from untrusted sources (e.g. +if it is an ingress) since this allows untrusted clients to specify Linkerd +policy. + +## Per-Request Policy Headers + +Once per-request policy is enabled, the following HTTP headers can be added to +a request to set or override retry and/or timeout policy for that request: + ++ `l5d-retry-http`: Overrides the `retry.linkerd.io/http` annotation ++ `l5d-retry-grpc`: Overrides the `retry.linkerd.io/grpc` annotation ++ `l5d-retry-limit`: Overrides the `retry.linkerd.io/limit` annotation ++ `l5d-retry-timeout`: Overrides the `retry.linkerd.io/timeout` annotation ++ `l5d-timeout`: Overrides the `timeout.linkerd.io/request` annotation ++ `l5d-response-timeout`: Overrides the `timeout.linkerd.io/response` annotation diff --git a/linkerd.io/content/2-edge/tasks/restricting-access.md b/linkerd.io/content/2-edge/tasks/restricting-access.md index afe2ca4f16..61f619ff17 100644 --- a/linkerd.io/content/2-edge/tasks/restricting-access.md +++ b/linkerd.io/content/2-edge/tasks/restricting-access.md @@ -167,14 +167,16 @@ explicitly create an authorization to allow those probe requests. For more information about adding route-scoped authorizations, see [Configuring Per-Route Policy](../configuring-per-route-policy/). -## Further Considerations +## Further Considerations - Audit Mode You may have noticed that there was a period of time after we created the `Server` resource but before we created the `ServerAuthorization` where all requests were being rejected. To avoid this situation in live systems, we -recommend you either create the policy resources before deploying your services -or to create the `ServiceAuthorizations` BEFORE creating the `Server` so that -clients will be authorized immediately. +recommend that you enable [audit mode](../../features/server-policy/#audit-mode) +in the `Server` resource (via `accessPolicy:audit`) and check the proxy +logs/metrics in the target services to see if traffic would get inadvertently +denied. Afterwards, when you're sure about your policy rules, you can fully +enable them by resetting `accessPolicy` back to `deny`. ## Per-Route Policy diff --git a/linkerd.io/content/2-edge/tasks/troubleshooting.md b/linkerd.io/content/2-edge/tasks/troubleshooting.md index 9b3072f6e2..d07d4e89b0 100644 --- a/linkerd.io/content/2-edge/tasks/troubleshooting.md +++ b/linkerd.io/content/2-edge/tasks/troubleshooting.md @@ -385,8 +385,9 @@ try installing linkerd via --set proxyInit.runAsRoot=true see https://linkerd.io/2.11/checks/#l5d-proxy-init-run-as-root for hints ``` -Kubernetes nodes running with docker as the container runtime ([CRI](https://kubernetes.io/docs/concepts/architecture/cri/)) -require the init container to run as root for iptables. +Kubernetes nodes running with docker as the container runtime +([CRI](https://kubernetes.io/docs/concepts/architecture/cri/)) require the init +container to run as root for iptables. Newer distributions of managed k8s use containerd where this is not an issue. @@ -399,8 +400,8 @@ time="2021-11-15T04:41:31Z" level=info msg="iptables-save v1.8.7 (legacy): Canno ``` See [linkerd/linkerd2#7283](https://github.com/linkerd/linkerd2/issues/7283) and -[linkerd/linkerd2#7308](https://github.com/linkerd/linkerd2/issues/7308) -for further details. +[linkerd/linkerd2#7308](https://github.com/linkerd/linkerd2/issues/7308) for +further details. ## The "linkerd-existence" checks {#l5d-existence} @@ -501,8 +502,8 @@ Example failure: Failures of such nature indicate that your roots have expired. If that is the case you will have to update both the root and issuer certificates at once. You can follow the process outlined in -[Replacing Expired Certificates](../replacing_expired_certificates/) to -get your cluster back to a stable state. +[Replacing Expired Certificates](../replacing_expired_certificates/) to get your +cluster back to a stable state. ### √ trust roots are valid for at least 60 days {#l5d-identity-trustAnchors-not-expiring-soon} @@ -646,9 +647,9 @@ Example failure: see https://linkerd.io/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints ``` -This warning indicates that the expiry of proxy-injnector webhook -cert is approaching. In order to address this -problem without incurring downtime, you can follow the process outlined in +This warning indicates that the expiry of proxy-injnector webhook cert is +approaching. In order to address this problem without incurring downtime, you +can follow the process outlined in [Automatically Rotating your webhook TLS Credentials](../automatically-rotating-webhook-tls-credentials/). ### √ sp-validator webhook has valid cert {#l5d-sp-validator-webhook-cert-valid} @@ -685,9 +686,9 @@ Example failure: see https://linkerd.io/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints ``` -This warning indicates that the expiry of sp-validator webhook -cert is approaching. In order to address this -problem without incurring downtime, you can follow the process outlined in +This warning indicates that the expiry of sp-validator webhook cert is +approaching. In order to address this problem without incurring downtime, you +can follow the process outlined in [Automatically Rotating your webhook TLS Credentials](../automatically-rotating-webhook-tls-credentials/). ### √ policy-validator webhook has valid cert {#l5d-policy-validator-webhook-cert-valid} @@ -700,8 +701,8 @@ Example failure: see https://linkerd.io/checks/#l5d-policy-validator-webhook-cert-valid for hints ``` -Ensure that the `linkerd-policy-validator-k8s-tls` secret exists and contains the -appropriate `tls.crt` and `tls.key` data entries. +Ensure that the `linkerd-policy-validator-k8s-tls` secret exists and contains +the appropriate `tls.crt` and `tls.key` data entries. ```bash × policy-validator webhook has valid cert @@ -722,9 +723,9 @@ Example failure: see https://linkerd.io/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints ``` -This warning indicates that the expiry of policy-validator webhook -cert is approaching. In order to address this -problem without incurring downtime, you can follow the process outlined in +This warning indicates that the expiry of policy-validator webhook cert is +approaching. In order to address this problem without incurring downtime, you +can follow the process outlined in [Automatically Rotating your webhook TLS Credentials](../automatically-rotating-webhook-tls-credentials/). ## The "linkerd-identity-data-plane" checks {#l5d-identity-data-plane} @@ -820,9 +821,9 @@ Example failure: ``` Linkerd has a `clusterNetworks` setting which allows it to differentiate between -intra-cluster and egress traffic. This warning indicates that the cluster has -a podCIDR which is not included in Linkerd's `clusterNetworks`. Traffic to pods -in this network may not be meshed properly. To remedy this, update the +intra-cluster and egress traffic. This warning indicates that the cluster has a +podCIDR which is not included in Linkerd's `clusterNetworks`. Traffic to pods in +this network may not be meshed properly. To remedy this, update the `clusterNetworks` setting to include all pod networks in the cluster. ### √ cluster networks contains all pods {#l5d-cluster-networks-pods} @@ -840,8 +841,8 @@ Example failures: ``` Linkerd has a `clusterNetworks` setting which allows it to differentiate between -intra-cluster and egress traffic. This warning indicates that the cluster has -a pod or ClusterIP service which is not included in Linkerd's `clusterNetworks`. +intra-cluster and egress traffic. This warning indicates that the cluster has a +pod or ClusterIP service which is not included in Linkerd's `clusterNetworks`. Traffic to pods or services in this network may not be meshed properly. To remedy this, update the `clusterNetworks` setting to include all pod and service networks in the cluster. @@ -867,27 +868,83 @@ $ curl "https://versioncheck.linkerd.io/version.json?version=edge-19.1.2&uuid=te ### √ cli is up-to-date {#l5d-version-cli} -Example failure: +Example failures: + + + +**unsupported version channel** + + + +```bash +‼ cli is up-to-date + unsupported version channel: stable-2.14.10 +``` + +As of February 2024, the Linkerd project itself only produces [edge +release](/releases/) artifacts. For more details, read the [Releases and +Versions](/releases/) page. + + + +**is running version X but the latest version is Y** + + ```bash ‼ cli is up-to-date is running version 19.1.1 but the latest edge version is 19.1.2 ``` -See the page on [Upgrading Linkerd](../../upgrade/). +There is a newer version of the `linkerd` cli. See the page on +[Upgrading Linkerd](../../upgrade/). ## The "control-plane-version" checks {#l5d-version-control} +### √ control plane is up-to-date {#l5d-version-control-up-to-date} + Example failures: + + +**unsupported version channel** + + + +```bash +‼ control plane is up-to-date + unsupported version channel: stable-2.14.10 +``` + +As of February 2024, the Linkerd project itself only produces [edge +release](/releases/) artifacts. For more details, read the [Releases and +Versions](/releases/) page. + + + +**is running version X but the latest version is Y** + + + ```bash ‼ control plane is up-to-date is running version 19.1.1 but the latest edge version is 19.1.2 +``` + +There is a newer version of the control plane. See the page on +[Upgrading Linkerd](../../upgrade/). + +### √ control plane and cli versions match {#l5d-version-control-mismatched} + +Example failure: + +```bash ‼ control plane and cli versions match mismatched channels: running stable-2.1.0 but retrieved edge-19.1.2 ``` -See the page on [Upgrading Linkerd](../../upgrade/). +Your CLI and your control plane are running different types of releases. This +may cause issues. ## The "linkerd-control-plane-proxy" checks {#linkerd-control-plane-proxy} @@ -900,8 +957,8 @@ setting or re-install Linkerd as necessary. ### √ control plane proxies are up-to-date {#l5d-cp-proxy-version} This warning indicates the proxies running in the Linkerd control plane are -running an old version. We recommend downloading the latest Linkerd release -and [Upgrading Linkerd](../../upgrade/). +running an old version. We recommend downloading the latest Linkerd release and +[Upgrading Linkerd](../../upgrade/). ### √ control plane proxies and cli versions match {#l5d-cp-proxy-cli-version} @@ -990,8 +1047,8 @@ Example failure: config.linkerd.io/control-port ``` -`config.linkerd.io/*` or `config.alpha.linkerd.io/*` should -be annotations in order to take effect. +`config.linkerd.io/*` or `config.alpha.linkerd.io/*` should be annotations in +order to take effect. ### √ data plane service annotations are configured correctly {#l5d-data-plane-services-annotations} @@ -1004,8 +1061,7 @@ Example failure: mirror.linkerd.io/exported ``` -`mirror.linkerd.io/exported` should -be a label in order to take effect. +`mirror.linkerd.io/exported` should be a label in order to take effect. ### √ opaque ports are properly annotated {#linkerd-opaque-ports-definition} @@ -1020,42 +1076,17 @@ Example failure: If a Pod marks a port as opaque by using the `config.linkerd.io/opaque-ports` annotation, then any Service which targets that port must also use the `config.linkerd.io/opaque-ports` annotation to mark that port as opaque. Having -a port marked as opaque on the Pod but not the Service (or vice versa) can -cause inconsistent behavior depending on if traffic is sent to the Pod directly -(for example with a headless Service) or through a ClusterIP Service. This -error can be remedied by adding the `config.linkerd.io/opaque-ports` annotation -to both the Pod and Service. See +a port marked as opaque on the Pod but not the Service (or vice versa) can cause +inconsistent behavior depending on if traffic is sent to the Pod directly (for +example with a headless Service) or through a ClusterIP Service. This error can +be remedied by adding the `config.linkerd.io/opaque-ports` annotation to both +the Pod and Service. See [Protocol Detection](../../features/protocol-detection/) for more information. ## The "linkerd-ha-checks" checks {#l5d-ha} These checks are ran if Linkerd has been installed in HA mode. -### √ pod injection disabled on kube-system {#l5d-injection-disabled} - -Example warning: - -```bash -‼ pod injection disabled on kube-system - kube-system namespace needs to have the label config.linkerd.io/admission-webhooks: disabled if HA mode is enabled - see https://linkerd.io/checks/#l5d-injection-disabled for hints -``` - -Ensure the kube-system namespace has the -`config.linkerd.io/admission-webhooks:disabled` label: - -```bash -$ kubectl get namespace kube-system -oyaml -kind: Namespace -apiVersion: v1 -metadata: - name: kube-system - annotations: - linkerd.io/inject: disabled - labels: - config.linkerd.io/admission-webhooks: disabled -``` - ### √ multiple replicas of control plane pods {#l5d-control-plane-replicas} Example warning: @@ -1071,11 +1102,10 @@ replicas running. This is likely caused by insufficient node resources. ### The "extensions" checks {#extensions} -When any [Extensions](../extensions/) are installed, The Linkerd binary -tries to invoke `check --output json` on the extension binaries. -It is important that the extension binaries implement it. -For more information, See [Extension -developer docs](https://github.com/linkerd/linkerd2/blob/main/EXTENSIONS.md) +When any [Extensions](../extensions/) are installed, The Linkerd binary tries to +invoke `check --output json` on the extension binaries. It is important that the +extension binaries implement it. For more information, See +[Extension developer docs](https://github.com/linkerd/linkerd2/blob/main/EXTENSIONS.md) Example error: @@ -1083,8 +1113,9 @@ Example error: invalid extension check output from \"jaeger\" (JSON object expected) ``` -Make sure that the extension binary implements `check --output json` -which returns the healthchecks in the [expected json format](https://github.com/linkerd/linkerd2/blob/main/EXTENSIONS.md#linkerd-name-check). +Make sure that the extension binary implements `check --output json` which +returns the healthchecks in the +[expected json format](https://github.com/linkerd/linkerd2/blob/main/EXTENSIONS.md#linkerd-name-check). Example error: @@ -1262,10 +1293,10 @@ Done configuring CNI. Sleep=true ## The "linkerd-multicluster checks {#l5d-multicluster} These checks run if the service mirroring controller has been installed. -Additionally they can be ran with `linkerd multicluster check`. -Most of these checks verify that the service mirroring controllers are working -correctly along with remote gateways. Furthermore the checks ensure that -end to end TLS is possible between paired clusters. +Additionally they can be ran with `linkerd multicluster check`. Most of these +checks verify that the service mirroring controllers are working correctly along +with remote gateways. Furthermore the checks ensure that end to end TLS is +possible between paired clusters. ### √ Link CRD exists {#l5d-multicluster-link-crd-exists} @@ -1308,8 +1339,8 @@ Example error: see https://linkerd.io/checks/#l5d-smc-target-clusters-access for hints ``` -Make sure the relevant Kube-config with relevant permissions. -for the specific target cluster is present as a secret correctly +Make sure the relevant Kube-config with relevant permissions. for the specific +target cluster is present as a secret correctly ### √ clusters share trust anchors {#l5d-multicluster-clusters-share-anchors} @@ -1483,12 +1514,11 @@ Example errors: ``` The error above indicates that some mirror services in the source cluster do not -have associated link. These mirror services are created by the Linkerd -service mirror controller when a remote service is marked to be -mirrored. +have associated link. These mirror services are created by the Linkerd service +mirror controller when a remote service is marked to be mirrored. -Make sure services are marked to be mirrored correctly at remote, and delete -if there are any unnecessary ones. +Make sure services are marked to be mirrored correctly at remote, and delete if +there are any unnecessary ones. ### √ multicluster extension proxies are healthy {#l5d-multicluster-proxy-healthy} @@ -1499,37 +1529,37 @@ correct setting or re-install as necessary. ### √ multicluster extension proxies are up-to-date {#l5d-multicluster-proxy-cp-version} This warning indicates the proxies running in the multicluster extension are -running an old version. We recommend downloading the latest linkerd-multicluster +running an old version. We recommend downloading the latest linkerd-multicluster and upgrading. ### √ multicluster extension proxies and cli versions match {#l5d-multicluster-proxy-cli-version} -This warning indicates that the proxies running in the multicluster extension are -running a different version from the Linkerd CLI. We recommend keeping this -versions in sync by updating either the CLI or linkerd-multicluster as necessary. +This warning indicates that the proxies running in the multicluster extension +are running a different version from the Linkerd CLI. We recommend keeping this +versions in sync by updating either the CLI or linkerd-multicluster as +necessary. ## The "linkerd-viz" checks {#l5d-viz} -These checks only run when the `linkerd-viz` extension is installed. -This check is intended to verify the installation of linkerd-viz -extension which comprises of `tap`, `web`, -`metrics-api` and optional `grafana` and `prometheus` instances -along with `tap-injector` which injects the specific -tap configuration to the proxies. +These checks only run when the `linkerd-viz` extension is installed. This check +is intended to verify the installation of linkerd-viz extension which comprises +of `tap`, `web`, `metrics-api` and optional `grafana` and `prometheus` instances +along with `tap-injector` which injects the specific tap configuration to the +proxies. ### √ linkerd-viz Namespace exists {#l5d-viz-ns-exists} -This is the basic check used to verify if the linkerd-viz extension -namespace is installed or not. The extension can be installed by running -the following command: +This is the basic check used to verify if the linkerd-viz extension namespace is +installed or not. The extension can be installed by running the following +command: ```bash linkerd viz install | kubectl apply -f - ``` -The installation can be configured by using the -`--set`, `--values`, `--set-string` and `--set-file` flags. -See [Linkerd Viz Readme](https://www.github.com/linkerd/linkerd2/tree/main/viz/charts/linkerd-viz/README.md) +The installation can be configured by using the `--set`, `--values`, +`--set-string` and `--set-file` flags. See +[Linkerd Viz Readme](https://www.github.com/linkerd/linkerd2/tree/main/viz/charts/linkerd-viz/README.md) for a full list of configurable fields. ### √ linkerd-viz ClusterRoles exist {#l5d-viz-cr-exists} @@ -1591,21 +1621,20 @@ yes ### √ viz extension proxies are healthy {#l5d-viz-proxy-healthy} -This error indicates that the proxies running in the viz extension are -not healthy. Ensure that linkerd-viz has been installed with all of the -correct setting or re-install as necessary. +This error indicates that the proxies running in the viz extension are not +healthy. Ensure that linkerd-viz has been installed with all of the correct +setting or re-install as necessary. ### √ viz extension proxies are up-to-date {#l5d-viz-proxy-cp-version} -This warning indicates the proxies running in the viz extension are -running an old version. We recommend downloading the latest linkerd-viz -and upgrading. +This warning indicates the proxies running in the viz extension are running an +old version. We recommend downloading the latest linkerd-viz and upgrading. ### √ viz extension proxies and cli versions match {#l5d-viz-proxy-cli-version} -This warning indicates that the proxies running in the viz extension are -running a different version from the Linkerd CLI. We recommend keeping this -versions in sync by updating either the CLI or linkerd-viz as necessary. +This warning indicates that the proxies running in the viz extension are running +a different version from the Linkerd CLI. We recommend keeping this versions in +sync by updating either the CLI or linkerd-viz as necessary. ### √ tap API server has valid cert {#l5d-tap-cert-valid} @@ -1641,9 +1670,9 @@ Example failure: see https://linkerd.io/checks/#l5d-webhook-cert-not-expiring-soon for hints ``` -This warning indicates that the expiry of the tap API Server webhook -cert is approaching. In order to address this -problem without incurring downtime, you can follow the process outlined in +This warning indicates that the expiry of the tap API Server webhook cert is +approaching. In order to address this problem without incurring downtime, you +can follow the process outlined in [Automatically Rotating your webhook TLS Credentials](../automatically-rotating-webhook-tls-credentials/). ### √ tap api service is running {#l5d-tap-api} @@ -1720,8 +1749,7 @@ Make sure that the `proxy-injector` is working correctly by running see https://linkerd.io/checks/#l5d-viz-cr-exists for hints ``` -Ensure all the prometheus related resources are present and running -correctly. +Ensure all the prometheus related resources are present and running correctly. ```bash ❯ kubectl -n linkerd-viz get deploy,cm | grep prometheus @@ -1800,7 +1828,7 @@ Prometheus to scrape the data plane proxies in a namespace: linkerd viz allow-scrapes --namespace emojivoto | kubectl apply -f - ``` -Note that this warning *only* checks for the existence of the policy resources +Note that this warning _only_ checks for the existence of the policy resources generated by `linkerd viz allow-scrapes` in namespaces that contain pods with the `deny` default inbound policy. In some cases, Prometheus scrapes may also be authorized by other, user-generated authorization policies. If metrics from the @@ -1834,38 +1862,37 @@ You should see all your pods here. If they are not: ## The "linkerd-jaeger" checks {#l5d-jaeger} -These checks only run when the `linkerd-jaeger` extension is installed. -This check is intended to verify the installation of linkerd-jaeger -extension which comprises of open-census collector and jaeger -components along with `jaeger-injector` which injects the specific -trace configuration to the proxies. +These checks only run when the `linkerd-jaeger` extension is installed. This +check is intended to verify the installation of linkerd-jaeger extension which +comprises of open-census collector and jaeger components along with +`jaeger-injector` which injects the specific trace configuration to the proxies. ### √ linkerd-jaeger extension Namespace exists {#l5d-jaeger-ns-exists} -This is the basic check used to verify if the linkerd-jaeger extension -namespace is installed or not. The extension can be installed by running -the following command +This is the basic check used to verify if the linkerd-jaeger extension namespace +is installed or not. The extension can be installed by running the following +command ```bash linkerd jaeger install | kubectl apply -f - ``` -The installation can be configured by using the -`--set`, `--values`, `--set-string` and `--set-file` flags. -See [Linkerd Jaeger Readme](https://www.github.com/linkerd/linkerd2/tree/main/jaeger/charts/linkerd-jaeger/README.md) +The installation can be configured by using the `--set`, `--values`, +`--set-string` and `--set-file` flags. See +[Linkerd Jaeger Readme](https://www.github.com/linkerd/linkerd2/tree/main/jaeger/charts/linkerd-jaeger/README.md) for a full list of configurable fields. ### √ jaeger extension proxies are healthy {#l5d-jaeger-proxy-healthy} -This error indicates that the proxies running in the jaeger extension are -not healthy. Ensure that linkerd-jaeger has been installed with all of the -correct setting or re-install as necessary. +This error indicates that the proxies running in the jaeger extension are not +healthy. Ensure that linkerd-jaeger has been installed with all of the correct +setting or re-install as necessary. ### √ jaeger extension proxies are up-to-date {#l5d-jaeger-proxy-cp-version} -This warning indicates the proxies running in the jaeger extension are -running an old version. We recommend downloading the latest linkerd-jaeger -and upgrading. +This warning indicates the proxies running in the jaeger extension are running +an old version. We recommend downloading the latest linkerd-jaeger and +upgrading. ### √ jaeger extension proxies and cli versions match {#l5d-jaeger-proxy-cli-version} @@ -1917,10 +1944,10 @@ Make sure that the `proxy-injector` is working correctly by running ## The "linkerd-buoyant" checks {#l5d-buoyant} -These checks only run when the `linkerd-buoyant` extension is installed. -This check is intended to verify the installation of linkerd-buoyant -extension which comprises `linkerd-buoyant` CLI, the `buoyant-cloud-agent` -Deployment, and the `buoyant-cloud-metrics` DaemonSet. +These checks only run when the `linkerd-buoyant` extension is installed. This +check is intended to verify the installation of linkerd-buoyant extension which +comprises `linkerd-buoyant` CLI, the `buoyant-cloud-agent` Deployment, and the +`buoyant-cloud-metrics` DaemonSet. ### √ Linkerd extension command linkerd-buoyant exists diff --git a/linkerd.io/content/2-edge/tasks/upgrade.md b/linkerd.io/content/2-edge/tasks/upgrade.md index 71fecedc61..dec5d3e1de 100644 --- a/linkerd.io/content/2-edge/tasks/upgrade.md +++ b/linkerd.io/content/2-edge/tasks/upgrade.md @@ -10,21 +10,31 @@ aliases = [ In this guide, we'll walk you through how to perform zero-downtime upgrades for Linkerd. +{{< note >}} + +This page contains instructions for upgrading to the latest edge release of +Linkerd. If you have installed a [stable distribution](/releases/#stable) of +Linkerd, the vendor may have alternative guidance on how to upgrade. You can +find more information about the different kinds of Linkerd releases on the +[Releases and Versions](/releases/) page. + +{{< /note >}} + Read through this guide carefully. Additionally, before starting a specific upgrade, please read through the version-specific upgrade notices below, which may contain important information about your version. -- [Upgrade notice: stable-2.14.0](#upgrade-notice-stable-2-14-0) -- [Upgrade notice: stable-2.13.0](#upgrade-notice-stable-2-13-0) -- [Upgrade notice: stable-2.12.0](#upgrade-notice-stable-2-12-0) -- [Upgrade notice: stable-2.11.0](#upgrade-notice-stable-2-11-0) -- [Upgrade notice: stable-2.10.0](#upgrade-notice-stable-2-10-0) -- [Upgrade notice: stable-2.9.4](#upgrade-notice-stable-2-9-4) -- [Upgrade notice: stable-2.9.3](#upgrade-notice-stable-2-9-3) -- [Upgrade notice: stable-2.9.0](#upgrade-notice-stable-2-9-0) +- [Upgrade notice: 2.15 and beyond](#upgrade-notice-stable-215-and-beyond) +- [Upgrade notice: stable-2.14.0](#upgrade-notice-stable-2140) +- [Upgrade notice: stable-2.13.0](#upgrade-notice-stable-2130) +- [Upgrade notice: stable-2.12.0](#upgrade-notice-stable-2120) +- [Upgrade notice: stable-2.11.0](#upgrade-notice-stable-2110) +- [Upgrade notice: stable-2.10.0](#upgrade-notice-stable-2100) ## Version numbering +### Stable releases + For stable releases, Linkerd follows a version numbering scheme of the form `2..`. In other words, "2" is a static prefix, followed by the major version, then the minor. @@ -33,6 +43,21 @@ Changes in minor versions are intended to be backwards compatible with the previous version. Changes in major version *may* introduce breaking changes, although we try to avoid that whenever possible. +### Edge releases + +For edge releases, Linkerd issues explicit [guidance about each +release](../../../releases/#edge-release-guidance). Be sure to consult this +guidance before installing any release artifact. + +{{< note >}} + +Edge releases are **not** semantically versioned; the edge release number +itself does not give you any assurance about breaking changes, +incompatibilities, etc. Instead, this information is available in the [release +notes](https://github.com/linkerd/linkerd2/releases). + +{{< /note >}} + ## Upgrade paths The following upgrade paths are generally safe. However, before starting a @@ -40,35 +65,54 @@ deploy, it is important to check the upgrade notes before proceeding—occasionally, specific minor releases may have additional restrictions. -**Within the same major version**. It is usually safe to upgrade to the latest -minor version within the same major version. In other words, if you are -currently running version *2.x.y*, upgrading to *2.x.z*, where *z* is the latest -minor version for major version *x*, is safe. This is true even if you would -skip intermediate intermediate minor versions, i.e. it is still safe even if *z -> y + 1*. +**Stable within the same major version**. It is usually safe to upgrade to the +latest minor version within the same major version. In other words, if you are +currently running version *2.x.y*, upgrading to *2.x.z*, where *z* is the +latest minor version for major version *x*, is safe. This is true even if you +would skip intermediate intermediate minor versions, i.e. it is still safe +even if *z* > *y + 1*. -**To the next major version**. It is usually safe to upgrade to the latest minor -version of the *next* major version. In other words, if you are currently -running version *2.x.y*, upgrading to *2.x + 1.w* will be safe, where *w* is the -latest minor version available for major version *x + 1*. +**Stable to the next major version**. It is usually safe to upgrade to the +latest minor version of the *next* major version. In other words, if you are +currently running version *2.x.y*, upgrading to *2.x + 1.w* will be safe, +where *w* is the latest minor version available for major version *x + 1*. -**To later major versions**. Upgrades that skip one or more major versions -are not supported. Instead, you should upgrade major versions incrementally. +**Stable to a later major version**. Upgrades that skip one or more major +versions are not supported. Instead, you should upgrade major versions +incrementally. -Again, please check the upgrade notes for the specific version you are upgrading -*to* for any version-specific caveats. +**Edge release to a later edge release**. This is generally safe unless +the `Cautions` for the later edge release indicate otherwise. + +Again, please check the upgrade notes or release guidance for the specific +version you are upgrading *to* for any version-specific caveats. ## Data plane vs control plane version skew -It is usually safe to run Linkerd's control plane with the data plane from one -major version earlier. (This skew is a natural consequence of upgrading.) This -is independent of minor version, i.e. a *2.x.y* data plane and a *2.x + 1.z* -control plane will work regardless of *y* and *z*. +Since a Linkerd upgrade always starts by upgrading the control plane, there is +a period during which the control plane is running the new version, but the +data plane is still running the older version. The extent to which this skew +can be supported depends on what kind of release you're running. Note that new +features introduced by the release may not be available for workloads with +older data planes. + +### Stable releases -Please check the version-specific upgrade notes before proceeding. +For stable releases, it is usually safe to upgrade one major version at a +time. This is independent of minor version, i.e. a *2.x.y* data plane and a +*2.x + 1.z* control plane will work regardless of *y* and *z*. Please check +the version-specific upgrade notes before proceeding. -Note that new features introduced by the release may not be available for -workloads with older data planes. +### Edge releases + +For edge releases, it is also usually safe to upgrade one major version at a +time. The major version of an edge release is included in the release notes +for each edge release: for example, `edge-24.4.1` is part of Linkerd 2.15, so +it should be safe to upgrade from `edge-24.4.1` to any edge release within +Linkerd 2.15 or Linkerd 2.16. + +For any situation where this is not the case, the edge release guidance will +have more information. ## Overall upgrade process @@ -88,14 +132,25 @@ of Linkerd is healthy, e.g. by using `linkerd check`. For major version upgrades, you should also ensure that your data plane is up-to-date, e.g. with `linkerd check --proxy`, to avoid unintentional version skew. +Make sure that your Linkerd version and Kubernetes version are compatible by +checking Linkerd's [supported Kubernetes +versions](../../reference/k8s-versions/). + ## Upgrading the CLI The CLI can be used to validate whether Linkerd was installed correctly. +### Stable releases + +Consult the upgrade instructions from the vendor supplying your stable release +for information about how to upgrade the CLI. + +### Edge releases + To upgrade the CLI, run: ```bash -curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install | sh +curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install-edge | sh ``` Alternatively, you can download the CLI directly via the [Linkerd releases @@ -109,7 +164,7 @@ linkerd version --client ## Upgrading the control plane -### With the Linkerd CLI +### Upgrading the control plane with the CLI For users who have installed Linkerd via the CLI, the `linkerd upgrade` command will upgrade the control plane. This command ensures that all of the control @@ -117,6 +172,9 @@ plane's existing configuration and TLS secrets are retained. Linkerd's CRDs should be upgraded first, using the `--crds` flag, followed by upgrading the control plane. +(If you are using a stable release, your vendor's upgrade instructions may +have more information.) + ```bash linkerd upgrade --crds | kubectl apply -f - linkerd upgrade | kubectl apply -f - @@ -129,7 +187,7 @@ present in the previous version but should not be present in this one. linkerd prune | kubectl delete -f - ``` -### With Helm +### Upgrading the control plane with Helm For Helm control plane installations, please follow the instructions at [Helm upgrade procedure](../install-helm/#helm-upgrade-procedure). @@ -230,6 +288,14 @@ version. ## Upgrade notices +This section contains release-specific information about upgrading. + +### Upgrade notice: stable-2.15 and beyond + +As of February 2024, the Linkerd project itself only produces [edge +release](/releases/) artifacts. The [Releases and Versions](/releases/) page +contains more information about the different kinds of Linkerd releases. + ### Upgrade notice: stable-2.14.0 For this release, if you're using the multicluster extension, you should re-link @@ -617,79 +683,3 @@ dropped, moving the config values underneath it into the root scope. Any values you had customized there will need to be migrated; in particular `identityTrustAnchorsPEM` in order to conserve the value you set during install." - -### Upgrade notice: stable-2.9.4 - -See upgrade notes for 2.9.3 below. - -### Upgrade notice: stable-2.9.3 - -#### Linkerd Repair - -Due to a known issue in versions stable-2.9.0, stable-2.9.1, and stable-2.9.2, -users who upgraded to one of those versions with the --prune flag (as described -above) will have deleted the `secret/linkerd-config-overrides` resource which is -necessary for performing any subsequent upgrades. Linkerd stable-2.9.3 includes -a new `linkerd repair` command which restores this deleted resource. If you see -unexpected error messages during upgrade such as "failed to read CA: not -PEM-encoded", please upgrade your CLI to stable-2.9.3 and run: - -```bash -linkerd repair | kubectl apply -f - -``` - -This will restore the `secret/linkerd-config-overrides` resource and allow you -to proceed with upgrading your control plane. - -### Upgrade notice: stable-2.9.0 - -#### Images are now hosted on ghcr.io - -As of this version images are now hosted under `ghcr.io` instead of `gcr.io`. If -you're pulling images into a private repo please make the necessary changes. - -#### Upgrading multicluster environments - -Linkerd 2.9 changes the way that some of the multicluster components work and -are installed compared to Linkerd 2.8.x. Users installing the multicluster -components for the first time with Linkerd 2.9 can ignore these instructions and -instead refer directly to the [installing -multicluster instructions](../installing-multicluster/). - -Users who installed the multicluster component in Linkerd 2.8.x and wish to -upgrade to Linkerd 2.9 should follow the [upgrade multicluster -instructions](/2.11/tasks/upgrade-multicluster/). - -#### Ingress behavior changes - -In previous versions when you injected your ingress controller (Nginx, Traefik, -Ambassador, etc), then the ingress' balancing/routing choices would be -overridden with Linkerd's (using service profiles, traffic splits, etc.). - -As of 2.9 the ingress's choices are honored instead, which allows preserving -things like session-stickiness. Note however that this means per-route metrics -are not collected, traffic splits will not be honored and retries/timeouts are -not applied. - -If you want to revert to the previous behavior, inject the proxy into the -ingress controller using the annotation `linkerd.io/inject: ingress`, as -explained in [using ingress](../using-ingress/) - -#### Breaking changes in Helm charts - -Some entries like `controllerLogLevel` and all the Prometheus config have -changed their position in the settings hierarchy. To get a precise view of what -has changed you can compare the -[stable-2.8.1](https://github.com/linkerd/linkerd2/blob/stable-2.8.1/charts/linkerd2/values.yaml) -and -[stable-2.9.0](https://github.com/linkerd/linkerd2/blob/stable-2.9.0/charts/linkerd2/values.yaml) -`values.yaml` files. - -#### Post-upgrade cleanup - -In order to better support cert-manager, the secrets -`linkerd-proxy-injector-tls`, `linkerd-sp-validator-tls` and `linkerd-tap-tls` -have been replaced by the secrets `linkerd-proxy-injector-k8s-tls`, -`linkerd-sp-validator-k8s-tls` and `linkerd-tap-k8s-tls` respectively. If you -upgraded through the CLI, please delete the old ones (if you upgraded through -Helm the cleanup was automated). diff --git a/linkerd.io/content/2-edge/tasks/using-ingress.md b/linkerd.io/content/2-edge/tasks/using-ingress.md index 853b7a9a20..8f96da0e34 100644 --- a/linkerd.io/content/2-edge/tasks/using-ingress.md +++ b/linkerd.io/content/2-edge/tasks/using-ingress.md @@ -69,7 +69,8 @@ details](#ingress-details) below. Common ingress options that Linkerd has been used with include: - [Ambassador (aka Emissary)](#ambassador) -- [Nginx](#nginx) +- [Nginx (community version)](#nginx-community-version) +- [Nginx (F5 NGINX version)](#nginx-f5-nginx-version) - [Traefik](#traefik) - [Traefik 1.x](#traefik-1x) - [Traefik 2.x](#traefik-2x) @@ -107,7 +108,11 @@ For a more detailed guide, we recommend reading [Installing the Emissary ingress with the Linkerd service mesh](https://buoyant.io/2021/05/24/emissary-and-linkerd-the-best-of-both-worlds/). -## Nginx +## Nginx (community version) + +This section refers to the Kubernetes community version +of the Nginx ingress controller +[kubernetes/ingress-nginx](https://github.com/kubernetes/ingress-nginx). Nginx can be meshed normally: it does not require the [ingress mode](#ingress-mode) annotation. @@ -160,6 +165,41 @@ Kubernetes resources: Setting the injection annotation at the namespace level would mesh the short-lived pod, which would prevent it from terminating as designed. +## Nginx (F5 NGINX version) + +This section refers to the Nginx ingress controller +developed and maintained by F5 NGINX +[nginxinc/kubernetes-ingress](https://github.com/nginxinc/kubernetes-ingress). + +This version of Nginx can also be meshed normally +and does not require the [ingress mode](#ingress-mode) annotation. + +The [VirtualServer/VirtualServerRoute CRD resource](https://docs.nginx.com/nginx-ingress-controller/configuration/virtualserver-and-virtualserverroute-resources/#virtualserverroute) +should be used in favor of the `ingress` resource (see +[this Github issue](https://github.com/nginxinc/kubernetes-ingress/issues/2529) +for more information). + +The `use-cluster-ip` field should be set to `true`. For example: + +```yaml +apiVersion: k8s.nginx.org/v1 +kind: VirtualServer +metadata: + name: emojivoto-web-ingress + namespace: emojivoto +spec: + ingressClassName: nginx + upstreams: + - name: web + service: web-svc + port: 80 + use-cluster-ip: true + routes: + - path: / + action: + pass: web +``` + ## Traefik Traefik should be meshed with [ingress mode enabled](#ingress-mode), i.e. with