Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling on worker #17

Merged
merged 1 commit into from
Oct 13, 2017

Conversation

aeroastro
Copy link
Collaborator

@ryopeko

I have improved error handling on worker.
Following are the details.

  • Wrap Shinq::Client.dequeue and Shinq.configuration.abort_on_error conditional block in begin ... rescue block
    • Most of MySQL-related errors (e.g. connection error) stem from Shinq::Client.dequeue where workers establish connection.
    • Also, there is no strong reason not to rescue errors when --no-abort-on-error
  • Log error message instead of just raising exception to ServerEngine
    • Developers would like to see the error message in log file instead of STDOUT of daemons.
  • Clean backtrace in error message when used with Rails.
    • By cleaning deep backtrace developers can smoothly trace the error cause.
  • sleep when error occurs
    • When error occurs, instant queue_abort triggers rapid error bursts, which leads to DDoS to external services or high CPU consumption which deteriorate the other services running on the same machine.
    • Ideally above exceptions should be handled at each application level, but we need minimum failure tolerance functionality at library level. The proposed and currently successful strategy at our environment is holding the queue for some seconds (default: 1 second) at worker and then resume the original process. This strategy, which consumes clock time for both worker and queue, is effective enough whichever the cause is in the particular queue or the cause is in the system.
    • This does not trigger a severe performance issue by itself (i.e. workers are too busy sleeping to serve all the queue). Since only unexpected and unhandled exceptions are reached to this level, constant performance issue means we must fix the root cause immediately.


DEFAULT = {
require: '.',
process: 1,
graceful_kill_timeout: 600,
queue_timeout: 1,
daemonize: false,
abort_on_error: true
abort_on_error: true,
sleep_sec_on_error: 1,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe default can be set to 3 seconds?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this value is reasonable or not.
So I think that default value should be minimmum.
If necessary, the user should optionally specify values that are reasonable for the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
We can take that option, and I leave the value to be customized by user since this is the last resort against a queue causing consecutive errors.


def format_error_message(error)
if defined?(::Rails) && ::Rails.backtrace_cleaner
backtrace = ::Rails.backtrace_cleaner.clean(error.backtrace || [])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that backtrace should not omit.
Because omitting it makes it impossible to know the hierarchy.

Copy link
Collaborator Author

@aeroastro aeroastro Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, backtrace_cleaner does not stop us from knowing the hierarchy. Rather, it helps us understand the backtrace more quickly, which contributes to debug efficiency.

By default, backtrace_cleaner does the following 2 operations.

  • Filter: Show relative paths from application root instead of full paths from root /.
  • Silencer: Exclude lines which belong to internal libraries.

http://api.rubyonrails.org/classes/ActiveSupport/BacktraceCleaner.html

The former does not break hierarchy at all. It just removes redundant information like /home/app-user/application/releases/20170929123456/ from each line in backtrace. This significantly reduces noisy and redundant log messages especially on production servers.

The latter one also does not stop us from understanding the hierarchy. Since it removes line belonging to external libraries, the users' ruby codes in backtrace, from which almost all the errors are stemming, are not removed nor disordered. On the contrary, removing irrelevant lines makes the log concise and lets users focus on the real cause. In our environment, typical backtrace in worker has 16 lines in serverengine, 5 lines in shinq, and 15 lines in bundler compared with a few (mainly 1 to 3) lines in our code, which means more than 90 % of backtrace consists of irrelevant lines.

Moreover, settings of backtrace_cleaner is usually maintained with application code typically under config/initializers/backtrace_cleaners.rb. This means that the filters and silencers are appropriately customized by each user, and in case whole backtrace are required, a user can see it just by calling BacktraceCleaner#remove_filters!.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case error occurs in external libraries, I will fix the code to show full backtrace when cleaned backtrace is empty.


DEFAULT = {
require: '.',
process: 1,
graceful_kill_timeout: 600,
queue_timeout: 1,
daemonize: false,
abort_on_error: true
abort_on_error: true,
sleep_sec_on_error: 1,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know this value is reasonable or not.
So I think that default value should be minimmum.
If necessary, the user should optionally specify values that are reasonable for the user.

@aeroastro
Copy link
Collaborator Author

aeroastro commented Sep 29, 2017

Thank you for reviewing this Pull Request.
I have posted a comment and fixed a issue on backtrace.
#17 (comment)

If that does not address your concern, it would be very helpful if you let me know the details.
Thank you 🐱

@aeroastro aeroastro force-pushed the feature/error-handling branch 3 times, most recently from 4740f5d to 65ba189 Compare October 10, 2017 04:35
@ryopeko ryopeko merged commit 8c8b081 into ryopeko:master Oct 13, 2017
@aeroastro aeroastro deleted the feature/error-handling branch October 13, 2017 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants