Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ambex: failing to load envoy.json can result in "empty" snapshot being pushed to Envoy #5093

Closed
dethi opened this issue Jun 6, 2023 · 6 comments · Fixed by #5103
Closed

Comments

@dethi
Copy link
Contributor

dethi commented Jun 6, 2023

Describe the bug
If ambex fails to load any Envoy config files, the file in question is completely ignored BUT the new snapshot update continue. This cause ambex to save and push a new snapshot without the correct configuration. See https://github.com/emissary-ingress/emissary/blob/v3.5.2/pkg/ambex/main.go#L405-L409

For example, when envoy/envoy.json is skipped due to a loading error, this cause Envoy to be reconfigured with none of our Listeners/Mappings/etc. Listener then go in drain mode, requests are failing with various response flags (NC, UAEX, UH).

To Reproduce

Haven't reproduced locally (yet), but I would expect the following to work:

  1. Run ambex locally
  2. Add a valid config file in the directory being watched
  3. Validate that Ambex has pushed a snapshot and Envoy has loaded the config
  4. Mutate the file to make it invalid
  5. Validate that ambex failed to load the file, but still pushed a snapshot
  6. Validate that Envoy loaded config is wrong (missing the configuration from files)

Expected behavior
ambex should handle the error properly and not continue the snapshot generating if it failed to load any of the given input files.

Versions (please complete the following information):

  • Emissary-Ingress: v3.5.2
  • GKE 1.25.8-gke.1000

Additional context

Logs showing the issue:

2023-06-05 22:42:29 diagd 3.5.2 [P31TAEW] INFO: EnvoyConfig: Generating V3
2023-06-05 22:42:34 diagd 3.5.2 [P31TAEW] INFO: V3Listener <V3Listener HTTP https-listener on 0.0.0.0:443 [XFP]>: generated ===========================
2023-06-05 22:42:35 diagd 3.5.2 [P31TAEW] INFO: V3Ready: ==== listen on 127.0.0.1:8006
time="2023-06-05 22:42:50.8810" level=warning msg="/ambassador/envoy/envoy.json: proto: unexpected EOF" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.update file="/go/pkg/ambex/main.go:407" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2023-06-05 22:42:50.8814" level=info msg="Saved snapshot v1165" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.csDump file="/go/pkg/ambex/main.go:350" CMD=entrypoint PID=1 THREAD=/ambex/main-loop
time="2023-06-05 22:42:51.0230" level=info msg="Pushing snapshot v1165" func=github.com/emissary-ingress/emissary/v3/pkg/ambex.updaterWithTicker file="/go/pkg/ambex/ratelimit.go:159" CMD=entrypoint PID=1 THREAD=/ambex/updater

Note: we run with envoy_validation_timeout=0 (disabled) because with the size of our config (328.0MB), the validation uses too much memory and is too slow. In that particular error case proto: unexpected EOF, it is possible that envoy/envoy.json was corrupted and that having the validation enabled would have catch this before it reach ambex. But the Decode() function has a lot of other case where it could fail, like failing to read the file, which would end up in the same situation.

You can also see that since ambex didn't return an error, diagd think we successfully reconfigured Envoy with all the expected config (S13379 L1 G9199 C8503)

2023-06-05 22:42:59 diagd 3.5.2 [P31TAEW] INFO: configuration updated (incremental) from snapshot snapshot (S13379 L1 G9199 C8503)
@dethi
Copy link
Contributor Author

dethi commented Jun 6, 2023

  • If we return the err instead of continuing, it cause the ambex/main-loop goroutine to exit, but the other goroutine never exit. Causing the entrypoint to continue happily, without ever updating the configuration again. Need to open another bug for that one.

  • If we return nil instead of continuing, this will work as I expect. The snapshot generation is skipped and Envoy keeps the valid running config. Next time diagd generate a new config, ambex will try again

    • However, if EndpointResolver is used and some pods targeted by a mapping/service are created/deleted, ambex will fail to generate a new snapshot until diagd generated a full new config. This isn't great. I'm actually surprise that ambex reads the config and do all this work. Couldn't it just update the endpoint from the previous snapshot that is already in-memory instead?

@cindymullins-dw
Copy link
Contributor

Thanks for reporting, @dethi . If you are able to reproduce it please do let us know. We suspect this is a one-off error but we'll monitor for another reports of this, and agree better error handling is needed.

@dethi
Copy link
Contributor Author

dethi commented Jun 8, 2023

Yes I can reproduce this quite easily when manually corrupting the file and sending a signal to ambex. But you are probably not interested by that.

I think I found the actual root cause of the issue: the way diagd replaces envoy/envoy.json is not atomic/consistent. With a large config like ours (320MB), it takes a few seconds for diagd to write the config:

$ ./write-speed-test.py
Took 1.46s to write the config to disk
write-speed-test.py
#!/usr/bin/env python3
import json
from contextlib import contextmanager
from timeit import default_timer

import orjson


def dump_json(obj, pretty=False) -> str:
    # There's a nicer way to do this in python, I'm sure.
    if pretty:
        return bytes.decode(
            orjson.dumps(
                obj,
                option=orjson.OPT_NON_STR_KEYS
                | orjson.OPT_SORT_KEYS
                | orjson.OPT_INDENT_2,
            )
        )
    else:
        return bytes.decode(orjson.dumps(obj, option=orjson.OPT_NON_STR_KEYS))


@contextmanager
def elapsed_timer():
    start = default_timer()
    elapser = lambda: default_timer() - start
    yield lambda: elapser()
    end = default_timer()
    elapser = lambda: end - start


def test():
    with open("prod-envoy.json", "r") as f:
        config = json.load(f)

    with elapsed_timer() as elapsed:
        with open("/tmp/envoy.json", "w") as output:
            output.write(dump_json(config, pretty=True))

    print(f"Took {elapsed():.2f}s to write the config to disk")


if __name__ == "__main__":
    test()

Note: this test is on my laptop with a NVMe drive, on GKE with network attached drive it may be slower (I didn't check if the speed was limited by the JSON dump generation or the disk speed)

If during that time ambex tries to read the config file, it will fail (because the file on disk is only partially written) and will thus generate an "empty" snapshot (as described in my original report above). This can happen even if envoy validation is enabled (default)

I think that this scenario can happen when ambex trigger an update because of the k8s/consul watcher. From what I understand of the code, there is some kind of in-memory snapshot that is kept for the fast update of endpoints, but I don't think this in-memory snapshots contains the full configuration (this is where I'm a little bit confused and would appreciate some explanation 🙏🏻). Thus if it can't load envoy/envoy.json during a fast update, it will generate and push an incomplete snapshot.

@dethi
Copy link
Contributor Author

dethi commented Jun 8, 2023

I need to cleanup my branch tomorrow, but this would be the fixes proposed:
master...aristanetworks:emissary:ambex-data-race-fix

dethi added a commit to aristanetworks/emissary that referenced this issue Jun 13, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomtic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093
dethi added a commit to aristanetworks/emissary that referenced this issue Jun 13, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomtic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093

Signed-off-by: Thibault Deutsch <[email protected]>
dethi added a commit to aristanetworks/emissary that referenced this issue Jun 13, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomtic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093

Signed-off-by: Thibault Deutsch <[email protected]>
@LanceEa
Copy link
Contributor

LanceEa commented Jun 21, 2023

  • If we return the err instead of continuing, it cause the ambex/main-loop goroutine to exit, but the other goroutine never exit. Causing the entrypoint to continue happily, without ever updating the configuration again. Need to open another bug for that one.

I'm surprised at this behavior, I would have expected the pod to restart. Did you observe this?

The reason being is that the goroutines are kicked off with the dgroup which is set to block on the main entrypoint until all go-routines have completed or an error occurs.

started here: https://github.com/emissary-ingress/emissary/blob/c28d9fa4fc51774ae7e268af8b72e95be8fe484b/cmd/entrypoint/entrypoint.go#LL170C1-L170C1
Waiting here:

return group.Wait()

  • However, if EndpointResolver is used and some pods targeted by a mapping/service are created/deleted, ambex will fail to generate a new snapshot until diagd generated a full new config. This isn't great. I'm actually surprise that ambex reads the config and do all this work. Couldn't it just update the endpoint from the previous snapshot that is already in-memory instead?

I would agree, I think there is an optimization opportunity here, it would just take some effort/refactoring to make sure its implemented properly and doesn't introduce regression/breaking changes.

I think that this scenario can happen when ambex trigger an update because of the k8s/consul watcher. From what I understand of the code, there is some kind of in-memory snapshot that is kept for the fast update of endpoints, but I don't think this in-memory snapshots contains the full configuration (this is where I'm a little bit confused and would appreciate some explanation 🙏🏻). Thus if it can't load envoy/envoy.json during a fast update, it will generate and push an incomplete snapshot.

Yes, you are correct. The Golang layer takes care of watching resources and xDS (ambex) while diagd is in the python layer has the main task of generating the most of the envoy configuration. It hasn't been refactored into the Golang layer which is partially the reason for the syncing via files (I believe historically it was also done for debug-ability but I could be mistaken on this).

The fastpath snapshot which can be triggered by EndpointResolver and ConsulResolver to avoid needing to do a full diagd reconfigure thus bypassing the python layer. This in-memory snapshot is minimal and is just the endpoints needed to leverage EDS so the connection pools are not drained. The reason the full snapshot is needed is because we use Aggregated xDS (“ADS”) which sends it over every time. Now I'm not so certain that statement is fully true and would need to experiment but I think there is a potential optimization where instead of loading the envoy.json file in this use case, the existing xDS snapshot could be enriched. I think its doable but working with the snapshot is not always the most fun but you could take a look here: https://github.com/envoyproxy/go-control-plane/blob/main/docs/cache.md#snapshots

Hope this helps!

@dethi
Copy link
Contributor Author

dethi commented Jun 22, 2023

I'm surprised at this behavior, I would have expected the pod to restart. Did you observe this?

Yes, I observed the behavior: goroutine ambex/main-loop stopped but ambex continue to run. They are in the same group but I'm not sure if the context is canceled when one the goroutines return an error. And if the context is cancelled, it's probably mean that one of them is not reacting to the cancellation and blocking the exit.

Hope this helps!

It does help, thanks for the explanation.

dethi added a commit to aristanetworks/emissary that referenced this issue Jul 25, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093

Signed-off-by: Thibault Deutsch <[email protected]>
dethi added a commit to aristanetworks/emissary that referenced this issue Jul 25, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomtic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093

Signed-off-by: Thibault Deutsch <[email protected]>
LanceEa pushed a commit that referenced this issue Aug 22, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomic and that ambex won't
load a partially written file.

Fix #5093

Signed-off-by: Thibault Deutsch <[email protected]>
LanceEa pushed a commit that referenced this issue Aug 23, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomic and that ambex won't
load a partially written file.

Fix #5093

Signed-off-by: Thibault Deutsch <[email protected]>
dethi added a commit to aristanetworks/emissary that referenced this issue Oct 11, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093

Signed-off-by: Thibault Deutsch <[email protected]>
dethi added a commit to aristanetworks/emissary that referenced this issue Nov 7, 2023
Write the generated ADS config to a temporary file first, then rename
the file. This ensure that the update is atomic and that ambex won't
load a partially written file.

Fix emissary-ingress#5093

Signed-off-by: Thibault Deutsch <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants