Update docs to use stat-inbound and stat-outbound (#1833)

* Copy 2.16 docs to 2-edge Signed-off-by: Alex Leong <[email protected]> * update docs to use stat-inbound and stat-outbound Signed-off-by: Alex Leong <[email protected]> * Update authorization-policy.md This should be all lowercase. --------- Signed-off-by: Alex Leong <[email protected]> Co-authored-by: Flynn <[email protected]>
linkerd · Oct 21, 2024 · 8029b03 · 8029b03
1 parent ea4e366
commit 8029b03
Show file tree

Hide file tree

Showing 4 changed files with 113 additions and 136 deletions.
diff --git a/linkerd.io/content/2-edge/features/telemetry.md b/linkerd.io/content/2-edge/features/telemetry.md
@@ -34,8 +34,8 @@ requiring any work on the part of the developer. These features include:
 
 This data can be consumed in several ways:
 
-* Through the [Linkerd CLI](../../reference/cli/), e.g. with `linkerd viz stat` and
-  `linkerd viz routes`.
+* Through the [Linkerd CLI](../../reference/cli/), e.g. with `linkerd viz stat-inbound`
+  and `linkerd viz stat-outbound`.
 * Through the [Linkerd dashboard](../dashboard/), and
   [pre-built Grafana dashboards](../../tasks/grafana/).
 * Directly from Linkerd's built-in Prometheus instance
@@ -47,17 +47,17 @@ This data can be consumed in several ways:
 This is the percentage of successful requests during a time window (1 minute by
 default).
 
-In the output of the command `linkerd viz routes -o wide`, this metric is split
-into EFFECTIVE_SUCCESS and ACTUAL_SUCCESS. For routes configured with retries,
+In the output of the command `linkerd viz stat-outbound`, this metric is shown
+for routes and for individual backends. For routes configured with retries,
 the former calculates the percentage of success after retries (as perceived by
 the client-side), and the latter before retries (which can expose potential
 problems with the service).
 
 ### Traffic (Requests Per Second)
 
 This gives an overview of how much demand is placed on the service/route. As
-with success rates, `linkerd viz routes -o wide` splits this metric into
-EFFECTIVE_RPS and ACTUAL_RPS, corresponding to rates after and before retries
+with success rates, `linkerd viz stat-outbound` splits this metric into
+route level and backend level, corresponding to rates after and before retries
 respectively.
 
 ### Latencies

diff --git a/linkerd.io/content/2-edge/tasks/books.md b/linkerd.io/content/2-edge/tasks/books.md
@@ -104,53 +104,43 @@ more details on how this works.)
 
 ## Debugging
 
-Let's use Linkerd to discover the root cause of this app's failures. Linkerd's
-proxy exposes rich metrics about the traffic that it processes, including HTTP
-response codes. The metric that we're interested is `outbound_http_route_backend_response_statuses_total`
-and will help us identify where HTTP errors are occuring. We can use the
-`linkerd diagnostics proxy-metrics` command to get proxy metrics. Pick one of
-your webapp pods and run the following command to get the metrics for HTTP 500
-responses:
+Let's use Linkerd to discover the root cause of this app's failures. We can use
+the `stat-inbound` command to see the success rate of the webapp deployment:
 
 ```bash
-linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
-| grep outbound_http_route_backend_response_statuses_total \
-| grep http_status=\"500\"
+linkerd viz -n booksapp stat-inbound deploy/webapp
+NAME    SERVER          ROUTE      TYPE  SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  
+webapp  [default]:4191  [default]        100.00%  0.30          4ms          9ms         10ms  
+webapp  [default]:4191  probe            100.00%  0.60          0ms          1ms          1ms  
+webapp  [default]:7000  probe            100.00%  0.30          2ms          2ms          2ms  
+webapp  [default]:7000  [default]         75.66%  8.22         18ms         65ms         93ms
 ```
 
-This should return a metric that looks something like:
-
-```text
-outbound_http_route_backend_response_statuses_total{
-  parent_group="core",
-  parent_kind="Service",
-  parent_namespace="booksapp",
-  parent_name="books",
-  parent_port="7002",
-  parent_section_name="",
-  route_group="",
-  route_kind="default",
-  route_namespace="",
-  route_name="http",
-  backend_group="core",
-  backend_kind="Service",
-  backend_namespace="booksapp",
-  backend_name="books",
-  backend_port="7002",
-  backend_section_name="",
-  http_status="500",
-  error=""
-} 207
+This shows us inbound traffic statistics. In other words, we see that the webapp
+is receiving 8.22 requests per second on port 7000 and that only 75.66% of those
+requests are successful.
+
+To dig into this further and find the root cause, we can look at the webapp's
+outbound traffic. This will tell us about the requests that the webapp makes to
+other services.
+
+```bash
+linkerd viz -n booksapp stat-outbound deploy/webapp
+NAME    SERVICE       ROUTE      TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
+webapp  books:7002    [default]                            77.36%  7.95         25ms         48ms        176ms     0.00%    0.00%  
+                      └──────────────────►  books:7002     77.36%  7.95         15ms         44ms         64ms     0.00%           
+webapp  authors:7001  [default]                           100.00%  3.53         26ms         72ms        415ms     0.00%    0.00%  
+                      └──────────────────►  authors:7001  100.00%  3.53         16ms         52ms         91ms     0.00%
 ```
 
-This counter tells us that the webapp pod received a total of 207 HTTP 500
-responses from the `books` Service on port 7002.
+We see that webapp sends traffic to both the books service and the authors
+service and that the problem seems to be with the traffic to the books service.
 
 ## HTTPRoute
 
-We know that the webapp component is getting 500s from the books component, but
-it would be great to narrow this down further and get per route metrics. To do
-this, we take advantage of the Gateway API and define a set of HTTPRoute
+We know that the webapp component is getting failures from the books component,
+but it would be great to narrow this down further and get per route metrics. To
+do this, we take advantage of the Gateway API and define a set of HTTPRoute
 resources, each attached to the `books` Service by specifying it as their
 `parent_ref`.
 
@@ -239,36 +229,19 @@ Notice that the `Accepted` and `ResolvedRefs` conditions are `True`.
 [...]
 ```
 
-With those HTTPRoutes in place, we can look at the `outbound_http_route_backend_response_statuses_total`
-metric again, and see that the route labels have been populated:
+With those HTTPRoutes in place, we can look at the outbound stats again:
 
 ```bash
-linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
-| grep outbound_http_route_backend_response_statuses_total \
-| grep http_status=\"500\"
-```
-
-```text
-outbound_http_route_backend_response_statuses_total{
-  parent_group="core",
-  parent_kind="Service",
-  parent_namespace="booksapp",
-  parent_name="books",
-  parent_port="7002",
-  parent_section_name="",
-  route_group="gateway.networking.k8s.io",
-  route_kind="HTTPRoute",
-  route_namespace="booksapp",
-  route_name="books-create",
-  backend_group="core",
-  backend_kind="Service",
-  backend_namespace="booksapp",
-  backend_name="books",
-  backend_port="7002",
-  backend_section_name="",
-  http_status="500",
-  error=""
-} 212
+linkerd viz -n booksapp stat-outbound deploy/webapp
+NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
+webapp  authors:7001  [default]                              100.00%  2.80         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  authors:7001  100.00%  2.80         16ms         45ms         49ms     0.00%           
+webapp  books:7002    books-list    HTTPRoute                100.00%  1.43         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  1.43         12ms         24ms         25ms     0.00%           
+webapp  books:7002    books-create  HTTPRoute                 54.27%  2.73         27ms        207ms        441ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002     54.27%  2.73         14ms        152ms        230ms     0.00%           
+webapp  books:7002    books-delete  HTTPRoute                100.00%  0.72         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  0.72         12ms         24ms         25ms     0.00%
 ```
 
 This tells us that it is requests to the `books-create` HTTPRoute which have
@@ -287,37 +260,54 @@ kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
 retry.linkerd.io/http=5xx
 ```
 
-We can then see the effect of these retries by looking at Linkerd's retry
-metrics:
+We can then see the effect of these retries:
 
 ```bash
-linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
-| grep outbound_http_route_backend_response_statuses_total \
-| grep retry
-```
-
-```text
-outbound_http_route_retry_limit_exceeded_total{...} 222
-outbound_http_route_retry_overflow_total{...} 0
-outbound_http_route_retry_requests_total{...} 469
-outbound_http_route_retry_successes_total{...} 247
+linkerd viz -n booksapp stat-outbound deploy/webapp
+NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
+webapp  books:7002    books-create  HTTPRoute                 73.17%  2.05         98ms        460ms        492ms     0.00%   34.22%  
+                      └─────────────────────►  books:7002     48.13%  3.12         29ms         93ms         99ms     0.00%           
+webapp  books:7002    books-list    HTTPRoute                100.00%  1.50         25ms         48ms         49ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  1.50         12ms         24ms         25ms     0.00%           
+webapp  books:7002    books-delete  HTTPRoute                100.00%  0.73         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  0.73         12ms         24ms         25ms     0.00%           
+webapp  authors:7001  [default]                              100.00%  2.98         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  authors:7001  100.00%  2.98         16ms         44ms         49ms     0.00%
 ```
 
-This tells us that Linkerd made a total of 469 retry requests, of which 247 were
-successful. The remaining 222 failed and could not be retried again, since we
-didn't raise the retry limit from its default of 1.
+Notice that while the success rate of individual requests to the books backend
+on the `books-create` route only have a success rate of about 50%, the overall
+success rate on that route has been raised to 73% due to retries. We can also
+see that 34.22% of the requests on this route are retries and that the improved
+success rate has come at the expense of additional RPS to the backend and
+increased overall latency.
 
-We can improve this further by increasing this limit to allow more than 1 retry
+By default, Linkerd will only attempt 1 retry per failure. We can improve
+success rate further by increasing this limit to allow more than 1 retry
 per request:
 
 ```bash
 kubectl -n booksapp annotate httproutes.gateway.networking.k8s.io/books-create \
 retry.linkerd.io/limit=3
 ```
 
-Over time you will see `outbound_http_route_retry_requests_total` and
-`outbound_http_route_retry_successes_total` increase at a much higher rate than
-`outbound_http_route_retry_limit_exceeded_total`.
+Looking at the stats again:
+
+```bash
+linkerd viz -n booksapp stat-outbound deploy/webapp
+NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
+webapp  books:7002    books-delete  HTTPRoute                100.00%  0.75         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  0.75         12ms         24ms         25ms     0.00%           
+webapp  authors:7001  [default]                              100.00%  2.92         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  authors:7001  100.00%  2.92         18ms         46ms         49ms     0.00%           
+webapp  books:7002    books-create  HTTPRoute                 92.78%  1.62        111ms        461ms        492ms     0.00%   47.28%  
+                      └─────────────────────►  books:7002     48.91%  3.07         42ms        179ms        236ms     0.00%           
+webapp  books:7002    books-list    HTTPRoute                100.00%  1.45         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  1.45         12ms         24ms         25ms     0.00%
+```
+
+We see that these additional retries have increased the overall success rate on
+this route to 92.78%.
 
 ## Timeouts
 
@@ -337,30 +327,19 @@ getting so many that it's hard to see what's going on!)
 We can see the effects of this timeout by running:
 
 ```bash
-linkerd diagnostics proxy-metrics -n booksapp po/webapp-pod-here \
-| grep outbound_http_route_request_statuses_total | grep books-create
+linkerd viz -n booksapp stat-outbound deploy/webapp                         
+NAME    SERVICE       ROUTE         TYPE       BACKEND       SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
+webapp  authors:7001  [default]                              100.00%  2.85         26ms         49ms        370ms     0.00%    0.00%  
+                      └─────────────────────►  authors:7001  100.00%  2.85         19ms         49ms         86ms     0.00%           
+webapp  books:7002    books-create  HTTPRoute                 78.90%  1.82         45ms        449ms        490ms    21.10%   47.34%  
+                      └─────────────────────►  books:7002     41.55%  3.45         24ms        134ms        227ms    11.11%           
+webapp  books:7002    books-list    HTTPRoute                100.00%  1.40         25ms         47ms         49ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  1.40         12ms         24ms         25ms     0.00%           
+webapp  books:7002    books-delete  HTTPRoute                100.00%  0.70         25ms         48ms         50ms     0.00%    0.00%  
+                      └─────────────────────►  books:7002    100.00%  0.70         12ms         24ms         25ms     0.00%
 ```
 
-```text
-outbound_http_route_request_statuses_total{
-  [...]
-  route_name="books-create",
-  http_status="",
-  error="REQUEST_TIMEOUT"
-} 151
-outbound_http_route_request_statuses_total{
-  [...]
-  route_name="books-create",
-  http_status="201",
-  error=""
-} 5548
-outbound_http_route_request_statuses_total{
-  [...]
-  route_name="books-create",
-  http_status="500",
-  error=""
-} 3194
-```
+We see that 21.10% of the requests are hitting this timeout.
 
 ## Clean Up
 

diff --git a/linkerd.io/content/2-edge/tasks/fault-injection.md b/linkerd.io/content/2-edge/tasks/fault-injection.md
@@ -53,17 +53,20 @@ After a little while, the stats will show 100% success rate. You can verify this
 by running:
 
 ```bash
-linkerd viz -n booksapp stat deploy
+linkerd viz -n booksapp stat-inbound deploy
 ```
 
 The output will end up looking at little like:
 
 ```bash
-NAME      MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
-authors      1/1   100.00%   7.1rps           4ms          26ms          33ms          6
-books        1/1   100.00%   8.6rps           6ms          73ms          95ms          6
-traffic      1/1         -        -             -             -             -          -
-webapp       3/3   100.00%   7.9rps          20ms          76ms          95ms          9
+NAME     SERVER          ROUTE      TYPE  SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  
+authors  [default]:4191  [default]        100.00%  0.20          0ms          1ms          1ms  
+authors  [default]:7001  [default]        100.00%  3.00          2ms         36ms         43ms  
+books    [default]:4191  [default]        100.00%  0.23          4ms          4ms          4ms  
+books    [default]:7002  [default]        100.00%  3.60          2ms          2ms          2ms  
+traffic  [default]:4191  [default]        100.00%  0.22          0ms          3ms          1ms  
+webapp   [default]:4191  [default]        100.00%  0.72          4ms          5ms          1ms  
+webapp   [default]:7000  [default]        100.00%  3.25          2ms          2ms         65ms
 ```
 
 ## Create the faulty backend
@@ -182,25 +185,20 @@ for details.
 
 When Linkerd sees traffic going to the `books` service, it will send 9/10
 requests to the original service and 1/10 to the error injector. You can see
-what this looks like by running `stat` and filtering explicitly to just the
-requests from `webapp`:
+what this looks like by running `stat-outbound`:
 
 ```bash
-linkerd viz stat -n booksapp deploy --from deploy/webapp
-NAME             MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
-authors             1/1    98.15%   4.5rps           3ms          36ms          39ms          3
-books               1/1   100.00%   6.7rps           5ms          27ms          67ms          6
-error-injector      1/1     0.00%   0.7rps           1ms           1ms           1ms          3
+linkerd viz stat-outbound -n booksapp deploy/webapp
+NAME    SERVICE       ROUTE        TYPE       BACKEND              SUCCESS   RPS  LATENCY_P50  LATENCY_P95  LATENCY_P99  TIMEOUTS  RETRIES  
+webapp  authors:7001  [default]                                     98.44%  4.28         25ms         47ms         50ms     0.00%    0.00%  
+                      └────────────────────►  authors:7001          98.44%  4.28         15ms         42ms         48ms     0.00%           
+webapp  books:7002    error-split  HTTPRoute                        87.76%  7.22         26ms         49ms        333ms     0.00%    0.00%  
+                      ├────────────────────►  books:7002           100.00%  6.33         14ms         42ms         83ms     0.00%           
+                      └────────────────────►  error-injector:8080    0.00%  0.88         12ms         24ms         25ms     0.00%
 ```
 
-We can also look at the success rate of the `webapp` overall to see the effects
-of the error injector. The success rate should be approximately 90%:
-
-```bash
-linkerd viz stat -n booksapp deploy/webapp
-NAME     MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   TCP_CONN
-webapp      3/3    88.42%   9.5rps          14ms          37ms          75ms         10
-```
+We can see here that 0.88 requests per second are being sent to the error
+injector and that the overall success rate is 87.76%.
 
 ## Cleanup
 

diff --git a/linkerd.io/content/2-edge/tasks/multicluster.md b/linkerd.io/content/2-edge/tasks/multicluster.md
@@ -383,10 +383,10 @@ You'll see the `greeting from east` message! Requests from the `frontend` pod
 running in `west` are being transparently forwarded to `east`. Assuming that
 you're still port forwarding from the previous step, you can also reach this
 with `curl http://localhost:8080/east`.  Make that call a couple times and
-you'll be able to get metrics from `linkerd viz stat` as well.
+you'll be able to get metrics from `linkerd viz stat-outbound` as well.
 
 ```bash
-linkerd --context=west -n test viz stat --from deploy/frontend svc
+linkerd --context=west -n test viz stat-outbound deploy/frontend
 ```
 
 We also provide a grafana dashboard to get a feel for what's going on here (see