ambex: failing to load `envoy.json` can result in "empty" snapshot being pushed to Envoy #5093

dethi · 2023-06-06T11:27:26Z

Describe the bug
If ambex fails to load any Envoy config files, the file in question is completely ignored BUT the new snapshot update continue. This cause ambex to save and push a new snapshot without the correct configuration. See https://github.com/emissary-ingress/emissary/blob/v3.5.2/pkg/ambex/main.go#L405-L409

For example, when envoy/envoy.json is skipped due to a loading error, this cause Envoy to be reconfigured with none of our Listeners/Mappings/etc. Listener then go in drain mode, requests are failing with various response flags (NC, UAEX, UH).

To Reproduce

Haven't reproduced locally (yet), but I would expect the following to work:

Run ambex locally
Add a valid config file in the directory being watched
Validate that Ambex has pushed a snapshot and Envoy has loaded the config
Mutate the file to make it invalid
Validate that ambex failed to load the file, but still pushed a snapshot
Validate that Envoy loaded config is wrong (missing the configuration from files)

Expected behavior
ambex should handle the error properly and not continue the snapshot generating if it failed to load any of the given input files.

Versions (please complete the following information):

Emissary-Ingress: v3.5.2
GKE 1.25.8-gke.1000

Additional context

Logs showing the issue:

2023-06-05 22:42:29 diagd 3.5.2 [P31TAEW] INFO: EnvoyConfig: Generating V3
2023-06-05 22:42:34 diagd 3.5.2 [P31TAEW] INFO: V3Listener <V3Listener HTTP https-listener on 0.0.0.0:443 [XFP]>: generated ===========================
2023-06-05 22:42:35 diagd 3.5.2 [P31TAEW] INFO: V3Ready: ==== listen on 127.0.0.1:8006
time="2023-06-05 22:42:50.8810" level=warning msg="/ambassador/envoy/envoy.json: proto: unexpected EOF" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.update file="/go/pkg/ambex/main.go:407" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2023-06-05 22:42:50.8814" level=info msg="Saved snapshot v1165" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.csDump file="/go/pkg/ambex/main.go:350" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2023-06-05 22:42:51.0230" level=info msg="Pushing snapshot v1165" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater

Note: we run with envoy_validation_timeout=0 (disabled) because with the size of our config (328.0MB), the validation uses too much memory and is too slow. In that particular error case proto: unexpected EOF, it is possible that envoy/envoy.json was corrupted and that having the validation enabled would have catch this before it reach ambex. But the Decode() function has a lot of other case where it could fail, like failing to read the file, which would end up in the same situation.

You can also see that since ambex didn't return an error, diagd think we successfully reconfigured Envoy with all the expected config (S13379 L1 G9199 C8503)

2023-06-05 22:42:59 diagd 3.5.2 [P31TAEW] INFO: configuration updated (incremental) from snapshot snapshot (S13379 L1 G9199 C8503)

The text was updated successfully, but these errors were encountered:

dethi · 2023-06-06T16:44:59Z

If we return the err instead of continuing, it cause the ambex/main-loop goroutine to exit, but the other goroutine never exit. Causing the entrypoint to continue happily, without ever updating the configuration again. Need to open another bug for that one.
If we return nil instead of continuing, this will work as I expect. The snapshot generation is skipped and Envoy keeps the valid running config. Next time diagd generate a new config, ambex will try again
- However, if EndpointResolver is used and some pods targeted by a mapping/service are created/deleted, ambex will fail to generate a new snapshot until diagd generated a full new config. This isn't great. I'm actually surprise that ambex reads the config and do all this work. Couldn't it just update the endpoint from the previous snapshot that is already in-memory instead?

cindymullins-dw · 2023-06-07T19:52:26Z

Thanks for reporting, @dethi . If you are able to reproduce it please do let us know. We suspect this is a one-off error but we'll monitor for another reports of this, and agree better error handling is needed.

dethi · 2023-06-08T11:32:31Z

Yes I can reproduce this quite easily when manually corrupting the file and sending a signal to ambex. But you are probably not interested by that.

I think I found the actual root cause of the issue: the way diagd replaces envoy/envoy.json is not atomic/consistent. With a large config like ours (320MB), it takes a few seconds for diagd to write the config:

$ ./write-speed-test.py
Took 1.46s to write the config to disk

write-speed-test.py

#!/usr/bin/env python3
import json
from contextlib import contextmanager
from timeit import default_timer

import orjson


def dump_json(obj, pretty=False) -> str:
    # There's a nicer way to do this in python, I'm sure.
    if pretty:
        return bytes.decode(
            orjson.dumps(
                obj,
                option=orjson.OPT_NON_STR_KEYS
                | orjson.OPT_SORT_KEYS
                | orjson.OPT_INDENT_2,
            )
        )
    else:
        return bytes.decode(orjson.dumps(obj, option=orjson.OPT_NON_STR_KEYS))


@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end - start


def test():
    with open("prod-envoy.json", "r") as f:
        config = json.load(f)

    with elapsed_timer() as elapsed:
        with open("/tmp/envoy.json", "w") as output:
            output.write(dump_json(config, pretty=True))

    print(f"Took {elapsed():.2f}s to write the config to disk")


if __name__ == "__main__":
    test()

Note: this test is on my laptop with a NVMe drive, on GKE with network attached drive it may be slower (I didn't check if the speed was limited by the JSON dump generation or the disk speed)

If during that time ambex tries to read the config file, it will fail (because the file on disk is only partially written) and will thus generate an "empty" snapshot (as described in my original report above). This can happen even if envoy validation is enabled (default)

I think that this scenario can happen when ambex trigger an update because of the k8s/consul watcher. From what I understand of the code, there is some kind of in-memory snapshot that is kept for the fast update of endpoints, but I don't think this in-memory snapshots contains the full configuration (this is where I'm a little bit confused and would appreciate some explanation 🙏🏻). Thus if it can't load envoy/envoy.json during a fast update, it will generate and push an incomplete snapshot.

dethi · 2023-06-08T18:56:04Z

I need to cleanup my branch tomorrow, but this would be the fixes proposed:
master...aristanetworks:emissary:ambex-data-race-fix

Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093

Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>

LanceEa · 2023-06-21T15:18:38Z

If we return the err instead of continuing, it cause the ambex/main-loop goroutine to exit, but the other goroutine never exit. Causing the entrypoint to continue happily, without ever updating the configuration again. Need to open another bug for that one.

I'm surprised at this behavior, I would have expected the pod to restart. Did you observe this?

The reason being is that the goroutines are kicked off with the dgroup which is set to block on the main entrypoint until all go-routines have completed or an error occurs.

started here: https://github.com/emissary-ingress/emissary/blob/c28d9fa4fc51774ae7e268af8b72e95be8fe484b/cmd/entrypoint/entrypoint.go#LL170C1-L170C1
Waiting here:

emissary/cmd/entrypoint/entrypoint.go

Line 216 in c28d9fa

return group.Wait()

However, if EndpointResolver is used and some pods targeted by a mapping/service are created/deleted, ambex will fail to generate a new snapshot until diagd generated a full new config. This isn't great. I'm actually surprise that ambex reads the config and do all this work. Couldn't it just update the endpoint from the previous snapshot that is already in-memory instead?

I would agree, I think there is an optimization opportunity here, it would just take some effort/refactoring to make sure its implemented properly and doesn't introduce regression/breaking changes.

I think that this scenario can happen when ambex trigger an update because of the k8s/consul watcher. From what I understand of the code, there is some kind of in-memory snapshot that is kept for the fast update of endpoints, but I don't think this in-memory snapshots contains the full configuration (this is where I'm a little bit confused and would appreciate some explanation 🙏🏻). Thus if it can't load envoy/envoy.json during a fast update, it will generate and push an incomplete snapshot.

Yes, you are correct. The Golang layer takes care of watching resources and xDS (ambex) while diagd is in the python layer has the main task of generating the most of the envoy configuration. It hasn't been refactored into the Golang layer which is partially the reason for the syncing via files (I believe historically it was also done for debug-ability but I could be mistaken on this).

The fastpath snapshot which can be triggered by EndpointResolver and ConsulResolver to avoid needing to do a full diagd reconfigure thus bypassing the python layer. This in-memory snapshot is minimal and is just the endpoints needed to leverage EDS so the connection pools are not drained. The reason the full snapshot is needed is because we use Aggregated xDS (“ADS”) which sends it over every time. Now I'm not so certain that statement is fully true and would need to experiment but I think there is a potential optimization where instead of loading the envoy.json file in this use case, the existing xDS snapshot could be enriched. I think its doable but working with the snapshot is not always the most fun but you could take a look here: https://github.com/envoyproxy/go-control-plane/blob/main/docs/cache.md#snapshots

Hope this helps!

dethi · 2023-06-22T16:21:42Z

I'm surprised at this behavior, I would have expected the pod to restart. Did you observe this?

Yes, I observed the behavior: goroutine ambex/main-loop stopped but ambex continue to run. They are in the same group but I'm not sure if the context is canceled when one the goroutines return an error. And if the context is cancelled, it's probably mean that one of them is not reacting to the cancellation and blocking the exit.

Hope this helps!

It does help, thanks for the explanation.

Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>

Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>

Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix #5093 Signed-off-by: Thibault Deutsch <[email protected]>

Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>

dethi mentioned this issue Jun 13, 2023

diagd: atomically write the ADS config #5103

Merged

6 tasks

LanceEa mentioned this issue Aug 22, 2023

[CI Run] diagd: atomically write the ADS config #5258

Closed

LanceEa closed this as completed in #5103 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ambex: failing to load `envoy.json` can result in "empty" snapshot being pushed to Envoy #5093

ambex: failing to load `envoy.json` can result in "empty" snapshot being pushed to Envoy #5093

dethi commented Jun 6, 2023 •

edited

Loading

dethi commented Jun 6, 2023 •

edited

Loading

cindymullins-dw commented Jun 7, 2023

dethi commented Jun 8, 2023 •

edited

Loading

dethi commented Jun 8, 2023

LanceEa commented Jun 21, 2023

dethi commented Jun 22, 2023 •

edited

Loading

ambex: failing to load envoy.json can result in "empty" snapshot being pushed to Envoy #5093

ambex: failing to load envoy.json can result in "empty" snapshot being pushed to Envoy #5093

Comments

dethi commented Jun 6, 2023 • edited Loading

dethi commented Jun 6, 2023 • edited Loading

cindymullins-dw commented Jun 7, 2023

dethi commented Jun 8, 2023 • edited Loading

dethi commented Jun 8, 2023

LanceEa commented Jun 21, 2023

dethi commented Jun 22, 2023 • edited Loading

ambex: failing to load `envoy.json` can result in "empty" snapshot being pushed to Envoy #5093

ambex: failing to load `envoy.json` can result in "empty" snapshot being pushed to Envoy #5093

dethi commented Jun 6, 2023 •

edited

Loading

dethi commented Jun 6, 2023 •

edited

Loading

dethi commented Jun 8, 2023 •

edited

Loading

dethi commented Jun 22, 2023 •

edited

Loading