Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch log formatter errors to avoid death spiral of logging (issue #2967) #2968

Merged
merged 1 commit into from
Dec 9, 2020
Merged

Catch log formatter errors to avoid death spiral of logging (issue #2967) #2968

merged 1 commit into from
Dec 9, 2020

Conversation

balgillo
Copy link
Contributor

@balgillo balgillo commented Dec 4, 2020

This PR addresses #2967

Instead of allowing an error during log line formatting to propagate into formatter_crashed, which can then lead to a death spiral of formatters crashing as they try to format the formatter_crashed error, safely format a log message about the formatter crash.

This avoids the big tests failing at debug log level, and will also be safer for production use - any undiscovered bugs in log formatting won't be able to bring down the MongooseIM process.

Example of formatted log line produced by the modified JSON formatter in case of an error during formatting:

{"when":"2020-12-04T09:16:33.003295+00:00","what":"log_format_failed","reason":"badarg","level":"error","formatter_module":"mongoose_json_formatter","class":"error","cause_mfa":"{ejabberd_router,route,4}","cause_line":144,"cause_file":"/data/developer/code/esl/MongooseIM/src/ejabberd_router.erl"}

Closes #2967

@arcusfelis
Copy link
Contributor

Hi, thanks for the PR! Not merging it yet though.

We considering to fix the issue the following way:

  • write tests, simulating any possible OTP error event (there are many types of them) - they currently could crash JSON formatter with function_clause reason.
  • modifying the code of formatters to avoid errors.
  • if we are still not happy with reliability we add try..catch.

Apparently, we would also have to investigate the running with debug log-level to get the original log event that caused crash, you've reported.

@balgillo
Copy link
Contributor Author

balgillo commented Dec 4, 2020

Thanks for taking a look @arcusfelis

  • write tests, simulating any possible OTP error event (there are many types of them) - they currently could crash JSON formatter with function_clause reason.
  • modifying the code of formatters to avoid errors.
  • if we are still not happy with reliability we add try..catch.

I think it's reasonable to do the first two - perhaps I could raise a separate issue to cover them?

I would say that from a robustness perspective, there really needs to be a backstop here, like the try/catch I've introduced. As you say, there are many possible kinds of errors and sources of errors, and the consequences of one of them slipping through the net are so severe (loss of service - the worst possible failure) that a catch-all is justified in my opinion.

So, my preference would be to add the backstop first, to gain robustness and stability, and then raise a separate issue for the underlying formatter errors.

Apparently, we would also have to investigate the running with debug log-level to get the original log event that caused crash, you've reported.

This could be raised as a further issue - errors while formatting specific log statements from the big tests.

Copy link
Contributor

@arcusfelis arcusfelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes point. If we want to merge this, we have to do at least two things in the comments :)

catch
%% Exceptions during log formatting can lead to a death spiral of recursive error logging, so
%% format the formatter error in a safe way and don't allow the exception to propagate.
Class:Reason -> format_log_formatter_error(Class, Reason, Event, FConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to catch with catch error:Class:Stacktrace.

  • Don't ignore stacktrace when catching - it would pretty hard to debug without line information.
  • ignore throw and exits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion - added

end.


format_log_formatter_error(Class, Reason, #{meta := #{time := Time, mfa := Mfa, file := File, line := Line}}, FConfig) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to include the complete Event, encoded as a term.

original_event => unicode:characters_to_binary(io_lib:format("~0p", Event))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this and pushed new changes

@arcusfelis
Copy link
Contributor

perhaps I could raise a separate issue to cover them?
sure, more PR-s and more information are welcomed :)

We also planning to make a separate PR from our side with a fix of one of OTP errors, that would include tests - that PR could be used as a template for the rest of errors.

I am pretty fine with merging try..catch workaround first, after applying the comments :)

@balgillo
Copy link
Contributor Author

balgillo commented Dec 7, 2020

I've updated the fallback error handlers in response to @arcusfelis's comments.

I think I've resolved the underlying formatter crash cause as well. jiffy:encode return type is iodata() which might be a binary or a list of things, but the mongoose_json_formatter:format function assumed that the return value is binary. I think if it was actually a list, that was crashing.

#{unstructured_log =>
unicode:characters_to_binary(io_lib:format(Format, Terms))}}},
FConfig).


-spec append_lf(iodata()) -> iodata().
append_lf(IoData) when is_list(IoData) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just this is enough:

append_lf(IoData) ->
    [IoData, <<"\n">>].

I am dialyzer:

-spec format(logger:log_event(), logger:formatter_config()) -> unicode:chardata().
-spec append_lf(iodata()) -> iodata().

These two return different types.

Log formatters return unicode:chardata().

Should be:

-spec append_lf(unicode:chardata()) -> unicode:chardata().

(tldr: iodata - integers are bytes, unicode - untegers are code points).

There is no problem with the code, there is a problem with a spec :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the change - got rid of the function because it didn't seem worth having it. That means the spec has gone as well.

The old code always returned the iodata() from jiffy:encode as the unicode:chardata() return value of format. If I understand the typing correctly, iodata() is the same as unicode:chardata() except when it's an (improper) list containing integers >=128, however there won't be any of those because the list returned from jiffy:encode consists of UTF-8 binaries. So the semantic difference between iodata() and unicode:chardata() isn't a problem in this particular case (I've resisted the urge to write a code comment to this effect).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the only difference between unicode and iodata is when lists of integers are used.
Still, it's a good idea to avoid iodata() type in the unicode code, because currently it's fine, but after 3-4 generations of code modifications, there could be a point, when someone would add iolist_to_binary(X) and another person would add io_lib:format("~ts", [Info]) (which returns code points higher 255) in two parts of the code :D

Anyway, awesome contribution :)

Copy link
Contributor

@arcusfelis arcusfelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

@arcusfelis arcusfelis merged commit 0834cc8 into esl:master Dec 9, 2020
@balgillo balgillo deleted the catch-log-formatter-crash branch December 9, 2020 12:03
@leszke leszke added this to the 4.1.0 milestone Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Big tests fail at debug log level
3 participants