-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ambex: failing to load envoy.json
can result in "empty" snapshot being pushed to Envoy
#5093
Comments
|
Thanks for reporting, @dethi . If you are able to reproduce it please do let us know. We suspect this is a one-off error but we'll monitor for another reports of this, and agree better error handling is needed. |
Yes I can reproduce this quite easily when manually corrupting the file and sending a signal to I think I found the actual root cause of the issue: the way diagd replaces $ ./write-speed-test.py
Took 1.46s to write the config to disk write-speed-test.py#!/usr/bin/env python3
import json
from contextlib import contextmanager
from timeit import default_timer
import orjson
def dump_json(obj, pretty=False) -> str:
# There's a nicer way to do this in python, I'm sure.
if pretty:
return bytes.decode(
orjson.dumps(
obj,
option=orjson.OPT_NON_STR_KEYS
| orjson.OPT_SORT_KEYS
| orjson.OPT_INDENT_2,
)
)
else:
return bytes.decode(orjson.dumps(obj, option=orjson.OPT_NON_STR_KEYS))
@contextmanager
def elapsed_timer():
start = default_timer()
elapser = lambda: default_timer() - start
yield lambda: elapser()
end = default_timer()
elapser = lambda: end - start
def test():
with open("prod-envoy.json", "r") as f:
config = json.load(f)
with elapsed_timer() as elapsed:
with open("/tmp/envoy.json", "w") as output:
output.write(dump_json(config, pretty=True))
print(f"Took {elapsed():.2f}s to write the config to disk")
if __name__ == "__main__":
test() Note: this test is on my laptop with a NVMe drive, on GKE with network attached drive it may be slower (I didn't check if the speed was limited by the JSON dump generation or the disk speed) If during that time I think that this scenario can happen when |
I need to cleanup my branch tomorrow, but this would be the fixes proposed: |
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>
I'm surprised at this behavior, I would have expected the pod to restart. Did you observe this? The reason being is that the goroutines are kicked off with the started here: https://github.com/emissary-ingress/emissary/blob/c28d9fa4fc51774ae7e268af8b72e95be8fe484b/cmd/entrypoint/entrypoint.go#LL170C1-L170C1 emissary/cmd/entrypoint/entrypoint.go Line 216 in c28d9fa
I would agree, I think there is an optimization opportunity here, it would just take some effort/refactoring to make sure its implemented properly and doesn't introduce regression/breaking changes.
Yes, you are correct. The Golang layer takes care of watching resources and xDS (ambex) while The fastpath snapshot which can be triggered by EndpointResolver and ConsulResolver to avoid needing to do a full diagd reconfigure thus bypassing the python layer. This in-memory snapshot is minimal and is just the endpoints needed to leverage EDS so the connection pools are not drained. The reason the full snapshot is needed is because we use Aggregated xDS (“ADS”) which sends it over every time. Now I'm not so certain that statement is fully true and would need to experiment but I think there is a potential optimization where instead of loading the Hope this helps! |
Yes, I observed the behavior: goroutine
It does help, thanks for the explanation. |
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomtic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix #5093 Signed-off-by: Thibault Deutsch <[email protected]>
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix #5093 Signed-off-by: Thibault Deutsch <[email protected]>
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>
Write the generated ADS config to a temporary file first, then rename the file. This ensure that the update is atomic and that ambex won't load a partially written file. Fix emissary-ingress#5093 Signed-off-by: Thibault Deutsch <[email protected]>
Describe the bug
If
ambex
fails to load any Envoy config files, the file in question is completely ignored BUT the new snapshot update continue. This causeambex
to save and push a new snapshot without the correct configuration. See https://github.com/emissary-ingress/emissary/blob/v3.5.2/pkg/ambex/main.go#L405-L409For example, when
envoy/envoy.json
is skipped due to a loading error, this cause Envoy to be reconfigured with none of our Listeners/Mappings/etc. Listener then go in drain mode, requests are failing with various response flags (NC, UAEX, UH).To Reproduce
Haven't reproduced locally (yet), but I would expect the following to work:
Expected behavior
ambex
should handle the error properly and not continue the snapshot generating if it failed to load any of the given input files.Versions (please complete the following information):
Additional context
Logs showing the issue:
Note: we run with
envoy_validation_timeout=0
(disabled) because with the size of our config (328.0MB), the validation uses too much memory and is too slow. In that particular error caseproto: unexpected EOF
, it is possible thatenvoy/envoy.json
was corrupted and that having the validation enabled would have catch this before it reach ambex. But theDecode()
function has a lot of other case where it could fail, like failing to read the file, which would end up in the same situation.You can also see that since ambex didn't return an error, diagd think we successfully reconfigured Envoy with all the expected config (S13379 L1 G9199 C8503)
The text was updated successfully, but these errors were encountered: