-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parser) add flag to log diff on failure #991
Conversation
Why can't this be done at |
cff38b5
to
2e61ef3
Compare
Rework a bit to create a temp dir at start and use that, since the linter doesn't like static paths under /tmp, and that's a realistic possible concern in container-less dev runs. The rationale for the additional config option is twofold:
Since we may ask users to enable debug logs for problems that don't require full state dumps, I think it's reasonable to add a separate flag for it. My end goal here is to provide a simpler end-user diagnostic experience while intentionally gating it behind something that you wouldn't normally enable for other reasons. Currently, getting state info out of the controller requires either hooking up a Go debugger (to inspect in-memory state) or a tcpdump sidecar (to inspect state as the controller sends it to the admin API over the network). Those have the same sensitive data concerns alongside usability hurdles, because they're general-purpose tools and require knowledge of how our systems work to extract data from them effectively. "Enable flag, copy contents of temp directory as problem occurs, ship results to us" are hopefully simpler instructions than "eh, just tcpdump it!", and dumping state further gets us something we can easily use to reproduce a problem offline in a test environment (since it's the format that deck and DB-less use). |
To be clear, I'm not against adding this feature itself. This is certainly useful for debugging but I'm not convinced that this needs to be behind a new configuration flag. This should be logged at debug-level. Debug-level can have sensitive info in it and of course, one shouldn't use it in production. |
In practice, we may need to use this in production sometimes. It's not always possible to replicate an unexplained issue in a test environment. It arguably should be, but the ease of doing so depends on tooling we don't control and is prone to user error. Furthermore, even when replicating production state in a test environment is possible, a perfect replica will still contain production's sensitive values. Further rework to instead make this part of debug-level logging, but redact certificate keys and credential secrets by default, only showing them if you further pass The implementation overwrites the fields in question in state in-place after attempting to apply it. This should be fine as we generate state from scratch on every update anyway, so the mangled data only goes to the diff log. |
59f728c
to
81ac9dc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a high level, I have several concerns:
- I'm not a fan of making KIC's logic more complex by adding new execution paths and outputs, which is what this PR does.
- As is, this PR leaks tempdirs at the rate of one per KIC process lifetime. Both in a non-containerized and a containerized (e.g. crash loop) scenario this can lead to uncontrolled storage exhaustion. We have to control the number of files that KIC has potential to create over time.
- The proposed sanitization mechanism is subtractive (and, hence, has potential to miss potentially sensitive entities added in the future, and puts additional burden on developers: requires them to remember that this function exists, and to update it). Anonymization/sanitization is an ungrateful problem (ask anyone who's worked on GDPR compliance 😁) and it's better to avoid having it at all. From a security perspective, anyone with access to KIC sufficient to manage these settings and read the file, is able to get the sensitive details, so the benefit of implementing sanitization here is unclear to me.
Therefore, the cost-benefit ratio of this change isn't on the undoubtedly positive side for me. I would like to propose a wholly different approach in line with the KISS principle. Not sure how much work it would be - just a thought experiment.
The 30k ft view of how KIC works today is roughly the following (stateless, except for state.State
changes being incremental) pipeline triggered by the controller loop:
K8s etcd contents -1-> state.State -2-> KongState -3-> file.Contents -4-> Kong
The solution would be to refactor appropriate steps (-1->
through -4->
), unless done so already, into reusable "filters". We could then implement a one-shot command-line tool (that tool might hypothetically be a cobra
subcommand of KIC, or a separate binary) that would run a subset of the pipeline just once (as opposed to running in a controller loop). With such a tool, the solution to the subject problem would be:
oops, KIC is failing for a customer. Ask them to run the one-shot tool, as a k8s pod, executing steps 1, 2, 3 of the pipeline and printing/dumping the result
Having KIC structured like this would additionally help us test part of KIC's functionality in better separation.
This could be said of any new functionality. This augments
There shouldn't be more than one process under normal circumstances-- I'm not really concerned with non-containerized instances since they should only be used for development work.
That aside, I do need to rework the redaction method to produce valid configuration: the basic
We don't currently have any good means of diagnosing the cause of a conflict (the single generated Kong state prevents us from logging Kong errors on a per-resource basis) or incorrect translation (the only way to see admin API payloads is to use tcpdump). These difficulties are a consequence of our current reconciliation loop, and I agree that we'd be best served by refactoring it. However, we don't have set idea of what that'd look like or when we'd be able to complete it, and with that in mind, I think we need to add something to better diagnose problems in the existing loop. If you want practical examples of where/how this would be useful:
|
Having played around with this some, it works, but purely random values always generate a diff for these resources. To avoid that, either:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As agreed in a video call (@rainest @mflendrich @shaneutt):
- the long-term solution to this problem should be a standalone tool that performs one pass of KIC and produces intermediate results
- as a stop-gap (because the long-term solution is more costly) we'll implement config dumping as a side effect of the reconciliation loop (which this PR does).
PTAL at the attached comments (written with a motivation to minimize the added tech debt associated with the stop-gap)
Add a new --log-failed-config flag. When enabled, if the reconciliation loop encounters an error, perform a diff against the last known good configuration and write it to a temporary file. At log level debug, dump a copy of the target configuration and a diff against the last known good configuration (if any) to disk. By default, redact certificate keys and credential secrets from the dump. The --log-sensitive-fields flag dumps these in the clear. These dumps can be used for offline troubleshooting outside the original Kubernetes environment. They are compatible with decK, and provide a way to inspect sync failures outside the controller in cases where the controller's error logging is ambiguous.
Codecov Report
@@ Coverage Diff @@
## next #991 +/- ##
==========================================
- Coverage 53.84% 53.77% -0.08%
==========================================
Files 33 34 +1
Lines 3172 3193 +21
==========================================
+ Hits 1708 1717 +9
- Misses 1332 1345 +13
+ Partials 132 131 -1
Continue to review full report at Codecov.
|
b024314
to
32f15a2
Compare
da2c974
to
c742ec1
Compare
Remove some cruft from old iterations. Update comments and docstrings. Only create tmpdir if dumps are enabled.
Merge didn't conflict these since they were in different locations in the file. Remove old versions in favor of the #1021 version.
Convert dump configuration to a defined type iota, rename dump directory config, and move dump directory initialization to main/pass it to the controller constructor.
Simply save config with descriptive names. We no longer diff so we don't need to keep old good configs in memory. Log a warning if we cannot save.
Variant that incorporates #1029 in https://github.com/Kong/kubernetes-ingress-controller/tree/feat/dump-config pending that getting merged to next. |
Co-authored-by: Michał Flendrich <[email protected]>
Re-add config dump following merger of decouple refactor (PR #1029). Auto-merge created a confusing conflict, so it was easier to merge it in with strategy "theirs" and then place the dump path in OnUpdate and helper back after the fact.
What this PR does / why we need it:
Add a
--log-failed-config
flag to write a config diff to disk if applying that config failed.Background
Diagnosing reconciliation loop failures is currently difficult for end users. In the event of a failure, we log admin API responses, but those don't always provide a clear indication of the issue: information about conflicts or invalid configuration isn't always easy to link back to Kubernetes resources, and in some cases, the admin API just encounters a fatal error that provides no useful info at all. Tracebacks or debug messages in Kong logs may provide additional hints, but those aren't necessarily intuitive either.
Logging what the controller wants to apply can simplify that debugging process: if you can see the proposed config, and how it differs from the last successfully-applied configuration, it should be easier to find problem Kubernetes resources related to it or apply it to a test Kong instance that does not have the same Kubernetes resources available--useful since we usually want to avoid shipping and running debug proxy code in actual user environments if possible.
Implementation considerations