[RFE] - Error Reporting and Normalization #126

shawn-hurley · 2023-06-13T16:45:18Z

As a user, understanding what high-level errors mean can help you quickly debug and solve issues. While there will always be a need for deep dive debugging, expecting users to search through hundreds of log lines that are not for their application, I believe, is an unrealistic expectation.

We need a cascading set of error/warning information to display to users to give them a complete picture of the unhappy path scenarios.

Full Stop Errors (sometimes referred to as "fatal")

The most apparent set of errors is when one of the following occurs:

The addon image is unable to be run.
The addon code itself fails if there is any validation of the inputs that could cause this failure.
The code the addon calls fails with validation error (rule invalid to be parsed, providers are not configured correctly or fail to start. Timeouts fall into this category as well.

For this set of errors, we want to consider that:

this means the user will get 0 meaningful data from this run.
That we probably want to have some defined types of errors, like but not matching "ImagePullBackoff" vs. the "ContainerFailed" or the "StartUpProbeFailed" like Kube will get you.
The common "heuristic" for this type of error is that the user gets no data, and most likely, something must be fixed by the user in the input to the analysis run.

Fall Through Errors or "Soft Errors"

There is a second set of Errors that can occur. These errors will mostly happen when running the code the addon will call. The following are examples.

The clear case for this is:

The Addon runs successfully
The code the addon calls runs successfully but has "errors" in the output as well as data that complete
The User gets some set of "good" data, but some rules have errored, and there is only a subset of the conclusions you can make from the analysis run.
The rules could fail for a couple of different reasons, the rule is malformed for the capability, the provider fails for that particular condition or instance, the code location's/custom variables are unfindable, and therefore the message may be malformed.

These particular scenarios should be captured and displayed. They basically say that some information may be useful, but you can not say all the data that is returned is correct.

Warnings

We have the concept of warnings here that we should consider and categorize. I see this happening when something like the following occurs.

The task is taking too long
The addon completes with no issues being reported but has no violations to save.

Another concern in this vein is that it would be nice, to have some way to show the skipped and unmatched rules. This information will be vital for a user with " new-custom-rule-x " not being matched. If the only option is assuming this is the case because it is not in the issues screen, it seems like it would be a less desirable UX/UI. We need some way to note this information.

Skipped rules are filtered out by the user in the input. These should fall into the above category of Unmatched.

Because the searching for violations is not, AFAIK, tied to a specific analysis but rather to the latest run, we need to have a way for users to see why maybe specific violations/rules are disappeared. This could be our most significant source of customer cases.

jwmatthews · 2023-06-13T17:25:26Z

Related to #123

jortel · 2023-06-15T18:39:46Z

For this set of errors, we want to consider that:

this means the user will get 0 meaningful data from this run.

Can you elaborate on this?

shawn-hurley · 2023-06-15T20:06:13Z

@jortel if an analysis run happens, and the thing is fatal, then nothing will be reported back right, and therefore nothing meaningful in terms of analysis was my thought process.

does that help or are we still missing something?

jortel · 2023-06-20T14:50:12Z

@shawn-hurley There are 3 levels of fatal.

The task pod has fatal error (image pull error etc)
The addon (adapter) has fatal error. (validation, git error etc)
A command: analyzer or analyzer-dep exit != 0.

Though 2 & 3 are kind of the same thing.

All errors/warnings currently are (or can be) reported in Task.Error (with further details in Task.Activity).

Tasking: add pod failures to Task.Error. tackle2-hub#56 - opened a while back but not implemented.
Tasking: Structured Task.Error. tackle2-hub#411
Propagate soft errors to Task.Error tackle2-addon-analyzer#11
https://github.com/konveyor/tackle2-ui/issues TBD The proposal here is for the UI to show a caution (yellow) triangle next to the analysis status when Task.Errors is not empty. Also, provide a way to click and see the table of errors.

Seems the UI can simply enable an icon and show the Task.Errors when not empty.

(Crude) Example of TaskReport (addon) errors:

Severity	Description
Error	Rule: Rule-001 has invalid syntax.
Warning	Rule: Rule-002 label selector not matched.

(Crude) Example of Task errors:

Severity	Description
Error	Pod failed: Image quay.io/addon-XXX not found

Severity	Description
Error	Git: clone failed: fatal: repository 'http://github.com/xxxxx/' not found.

shawn-hurley · 2023-06-20T16:10:39Z

I agree this makes sense for the fatal errors, and if we can just show that error, along with some text, "Cluster was unable to pull the image " or "git credentials were unauthorized" or "git repository not found" the more we can bucket it and give user exact statements of what wrong the better, especially for the fatal errors.

Starting with the fatal errors, my goals are twofold:

Make it obvious when simple things fail (git creds, repos not found, rules being malformed, etc.) and how to remedy the situation.
Make it possible to debug more into things when hard/complicated things fail. Think of this as the escape hatch.

I would imagine the data we would want is something like this:

The place in the process the error occurred, this would likely help us, if/when we get support cases
The error bucketed in our predetermined bucket. This will allow us to write documentation that a user can easily find to help them on their way if we standardize this correctly. (Think ImagePullBackoff, or NodePressure statuses).
A section for more information says something specific about this particular failure, "git failed authentication" or "git repo was not found at ." For the analyzer, it could be something like "rules are unparsable, here is your error" or "no provider found for cap X."
Severity, probably one of fatal error, warning. It seems like we are starting to use this terminology already. Maybe we just stick with it.
We want "more information, where we can show the full task log". I would imagine this would be useful for those cases we can not bucket. I think taking the task actions (sorry if this is the wrong word but the full accounting of everything that was run).
I would love to ensure that we have some option to run things at a higher verbosity for the entire run; doing this will help on those calls w/ a customer for us or support to help diagnose the issue.

Note: we may only have a few buckets right now, but I think this is ok. With the extra escape hatches, users can still get to the hard debugging answers, while hopefully the easy issues are immediately diagonsable.

Adding lots of folks that also had to do this exercise with MTC, please tell me if I am missing a huge step that y'all had to take, or if this is over kill.

@jortel @jwmatthews @pranavgaikwad @fabianvf @mansam @djzager @jmontleon @JustinXHale

JustinXHale · 2023-06-20T16:28:41Z

What currently happens when a user runs into an error?

shawn-hurley · 2023-06-20T16:36:41Z

@JustinXHale I might be the worst person to ask, but my understanding as of today is there is a log of actions that a task took and that if the task fails, there is an error field, IIUC correctly based on the issues above I think it is a single string right now.

@jortel @mansam is this close to correct?

jortel · 2023-06-20T16:46:57Z

Correct.
An Error reported by the addon can be viewed though the UI kebab Analysis Details.

    status: Failed
    error: fatal: repository '[http://github.com/xxxxx/](https://github.com/xxxxx/)' not found.
    activity:
        - Fetching application.
        - '[CMD] Running: /usr/bin/ssh-agent -a /tmp/agent.1'
        - '[CMD] succeeded.'
        - '[SSH] Agent started.'
        - '[GIT] Cloning: https://github.com/WASdev/sample.daytrader7.git'
        - '[FILE] Created /addon/.gitconfig.'
        - '[CMD] Running: /usr/bin/git clone https://github.com/WASdev/sample.daytrader7.git /addon/source/sample'
        - '[CMD] failed: fatal: repository '[http://github.com/xxxxx/](https://github.com/xxxxx/)' not found..'
    taskid: 6

There is an open issue to propagate k8s errors in
Task.Error

There is another issue aligned to addon-analyzer to propagate soft errors reported (not sure how) by the analyzer as non-fatal severity errors in Task.Error.

jortel · 2023-06-20T17:03:51Z

Another thing to keep in mind is the quality of error information surfaced to the user depends on the quality of the the information reported by the error origin. For example: for credentials, (in many cases) both Git and Subversion (and SSH) can report almost nothing useful.
Further, they don't report in any structured way. This would require parsing of stderr which is notoriously brittle and non-deterministic. IMHO, it's better to just show the user what is reported by the command. Humans are better at interpreting the text. We should be cautious about doing this. This is why the current approach is to show the user the raw error in context of the Activity log.

shawn-hurley · 2023-06-21T14:19:43Z

Hmm, looking for a string like "Authentication Failed" when using git doesn't seem overly burdensome or brittle to me. Am I missing something? There are two things that I can think of to be worried about:

False Positive - something else failed, but we determined it was an authentication failed problem.
False Negative- it was an authentication problem, but we were unable to determine it.

for 2, I think it is reasonable to assume this will happen, and that is why we do have the escape hatches.

for 1, I think that we need to be careful here, but if it is a simple check like above, I have a tough time seeing how this could be overly brittle.

What I worry about is that because the user has no direct access to debug things, we control the way that all of these tools are called, that if we don't have actual steps for them to take, it could be very frustrating for users.

I think it is reasonable to not try and boil the ocean, but I do think we can take some steps for common situations to try and make this better for users.

@jortel I agree on a fallback to just show the information to the user FYI. I just think that we can try and provide a better experience for certain cases, especially for things that we are putting together. (setting up git/svn creds that are stored in some other place in the hub) or for the rules that are coming for X place.

jortel · 2023-06-22T17:23:51Z

So far, I have not read any comments that indicate opposition to the iteration proposed in the linked, hub, ui and addon issues. This would be an low investment evolution of how/what is reported currently and does not seem to paint us into a corner.
Is there any objection?

@shawn-hurley

shawn-hurley closed this as completed Jun 20, 2023

shawn-hurley reopened this Jun 20, 2023

jortel mentioned this issue Jan 26, 2024

Enhance Analysis Details with skipped/unmatched rules #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFE] - Error Reporting and Normalization #126

[RFE] - Error Reporting and Normalization #126

shawn-hurley commented Jun 13, 2023

jwmatthews commented Jun 13, 2023

jortel commented Jun 15, 2023

shawn-hurley commented Jun 15, 2023

jortel commented Jun 20, 2023 •

edited

Loading

shawn-hurley commented Jun 20, 2023 •

edited

Loading

JustinXHale commented Jun 20, 2023

shawn-hurley commented Jun 20, 2023

jortel commented Jun 20, 2023 •

edited

Loading

jortel commented Jun 20, 2023 •

edited

Loading

shawn-hurley commented Jun 21, 2023 •

edited

Loading

jortel commented Jun 22, 2023 •

edited

Loading

[RFE] - Error Reporting and Normalization #126

[RFE] - Error Reporting and Normalization #126

Comments

shawn-hurley commented Jun 13, 2023

Full Stop Errors (sometimes referred to as "fatal")

Fall Through Errors or "Soft Errors"

Warnings

jwmatthews commented Jun 13, 2023

jortel commented Jun 15, 2023

shawn-hurley commented Jun 15, 2023

jortel commented Jun 20, 2023 • edited Loading

shawn-hurley commented Jun 20, 2023 • edited Loading

JustinXHale commented Jun 20, 2023

shawn-hurley commented Jun 20, 2023

jortel commented Jun 20, 2023 • edited Loading

jortel commented Jun 20, 2023 • edited Loading

shawn-hurley commented Jun 21, 2023 • edited Loading

jortel commented Jun 22, 2023 • edited Loading

jortel commented Jun 20, 2023 •

edited

Loading

shawn-hurley commented Jun 20, 2023 •

edited

Loading

jortel commented Jun 20, 2023 •

edited

Loading

jortel commented Jun 20, 2023 •

edited

Loading

shawn-hurley commented Jun 21, 2023 •

edited

Loading

jortel commented Jun 22, 2023 •

edited

Loading