Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible event catcher - mark event_monitor_runnning when there are no events at startup #13903

Merged
merged 1 commit into from
Feb 28, 2017

Conversation

durandom
Copy link
Member

@durandom durandom commented Feb 14, 2017

The reason the EventCatcher would not start was that it did not send event_monitor_running when there are no events coming in.

So I moved that up, before the loop.

@miq-bot add_labels providers/ansible_tower, bug
@miq-bot assign @jrafanie

@jrafanie please review and merge - because you are the worker tamer 🦁

@@ -22,6 +23,7 @@ def poll
catch(:stop_polling) do
begin
loop do
@before_poll.call if @before_poll && @before_poll.respond_to?(:call)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil.respond_to?(:call) is false, so code can be simplified to:

@before_poll.call if @before_poll.respond_to?(:call)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually don't prefer try but this time I actually like @before_poll.try(:call)

Note, @before_poll doesn't scream "proc/lambda" in my head. I wonder if it makes sense to make this interface more clear by calling it @before_poll_proc or @pre_poll_proc

@durandom durandom force-pushed the heartbeat_ansible_worker branch 2 times, most recently from 4124a67 to 2946ab9 Compare February 14, 2017 17:56
@durandom
Copy link
Member Author

@chessbyte @jrafanie thanks for the review. Added your suggestions.

Hopefully this kind of micro setup is not needed anymore once the event catcher is refactored

@Fryguy
Copy link
Member

Fryguy commented Feb 14, 2017

Why is there a need to have all of this proc stuff? Can you not just call heartbeat directly? All event catchers should have a heartbeat method implemented, and I can't see why it wouldn't be necessary for every provider.

@durandom
Copy link
Member Author

Can you not just call heartbeat directly?

Usually EventCatcher::Runner does the heartbeat here - but thats only one thread, that does the event processing.

In case there is a steady stream of events, this would be called once in a while. But if there are no events coming in, then the Runner thread would not heartbeat.

why it wouldn't be necessary for every provider.

Looks like at least google, hawkular and vmware have the same problem :)

@agrare am I missing something here? What happens when the vmware event catcher has no events coming in? Will it still heartbeat?

@agrare
Copy link
Member

agrare commented Feb 14, 2017

@durandom I don't think we do anything special for heartbeat, its possible vmware is just verbose enough with events that it isn't an issue :)

We do have a timeout on WaitForUpdatesEx set so what I'd like to do is return an empty set of events when it times out so it will still heartbeat.

@durandom
Copy link
Member Author

return an empty set of events when it times out so it will still heartbeat

That wont do I guess, because an empty set will not reach the heartbeat here:

https://github.com/ManageIQ/manageiq/blob/master/app/models/manageiq/providers/base_manager/event_catcher/runner.rb#L190

  def process_events(events)
    events.to_miq_a.each do |event|
      heartbeat
      process_event(event)
      Thread.pass
    end
  end

@jrafanie
Copy link
Member

jrafanie commented Feb 14, 2017

Can you not just call heartbeat directly?
Usually EventCatcher::Runner does the heartbeat here - but thats only one thread, that does the event processing.

I wonder if it makes sense making all workers have a dedicated thread just for heartbeating. We used to do this years ago before rails/activerecord was threadsafe. I think it would be something we could finally do.

Although, we've had issues with the main thread getting "stuck" on a really long work item, such as processing a huge report, and I don't know that heartbeating in a dedicated thread would help there since we'd be still reporting to be alive but not actually able to do new work or respond to requests. 😕

So, yeah, I'm torn. I don't know what a dedicated heartbeat thread buys us since we'd still have a way to get the status or interrupt the busy thread from the heartbeat thread. I don't know.

@agrare
Copy link
Member

agrare commented Feb 14, 2017

because an empty set will not reach the heartbeat here

We could always heartbeat before entering the loop if we go that direction (aka not a dedicated heartbeat thread)

@durandom
Copy link
Member Author

re "dedicated heartbeat thread"

If the consuming thread (the one that fetches events from the provider) gets stale, because of network stuff - the worker should be restarted.

If the dispatching thread (the one that puts events on our miq queue) get stale, because of? crazyness - the worker should be restarted.

So, actually my solution here could lead to a false positive, where one thread still heartbeats and the other got stale. 😭

Threading is wicked.

I dont know if a dedicated thread makes sense at all, doesnt it defy the idea of heartbeating? Then you have a thread that just heartbeats and the other two are stale?! (just what @jrafanie said in the second paragraph)

I guess, what we have now is probably best. Restart the worker if no events are coming in...

...and in the future I'd like to re-visit the whole threading approach here, because I actually think it brings more complexity into this than it actually solves a problem here (green threads)

close this?

@blomquisg
Copy link
Member

Maybe I'm missing something very fundamental here, but we heartbeat in the do_work_loop method in the base MiqWorker::Runner class.

If I'm reading this correctly, each time the do_work method is called, the worker will heartbeat. There's an additional safeguard to heartbeat every time events are processed in case there are TONS of events and it gets stuck in the process_events method for a while.

If the worker doesn't get any events for a long time, then based on what I'm reading, the heartbeat method will be called by the do_work_loop just before it calls do_work again.

@durandom
Copy link
Member Author

@blomquisg you are absolutely right - thanks for joining in and clarifying.

This made me re-visit this whole situation.

The reason the EventCatcher would not start was that it did not send event_monitor_running when there are no events coming in.
So I moved that up, before the loop.

Nevertheless, this is an indicator for how fragile this whole setup is - not just looking for cheap excuses 😄

@durandom
Copy link
Member Author

@jrafanie could you re-review in the light of @blomquisg and my latest comment?
This boils down to moving one line up now.

If you want I can also close this PR and open a new one with just my last comment as description

@jrafanie
Copy link
Member

@durandom So... this statement in the description of the PR:

In case there are no new events from ansible, the worker would not heartbeat.

And this from @blomquisg seem to contradict each other:

If the worker doesn't get any events for a long time, then based on what I'm reading, the heartbeat method will be called by the do_work_loop just before it calls do_work again.

This latter statement is true. Each iteration of the do_worker_loop will heartbeat as long as we didn't override this method. So, what are we solving in this PR again? Can you reclarify the description? It appears we are going from heartbeating with each event to each call to monitor_events. What is this solving?

@durandom durandom changed the title heartbeat ansible event catcher ansible event catcher - mark event_monitor_runnning when there are no events at startup Feb 17, 2017
@durandom
Copy link
Member Author

@jrafanie sure. Re-wrote the description and title. Sorry for the confusion, this was a long way coming. And thanks to @blomquisg for the correct comment

@jrafanie
Copy link
Member

@jrafanie sure. Re-wrote the description and title. Sorry for the confusion, this was a long way coming. And thanks to @blomquisg for the correct comment

No worries, it's easy to have PR comments/descriptions go stale so I try to ask dumb questions to see if we can help future "us" avoid confusion.

@miq-bot
Copy link
Member

miq-bot commented Feb 27, 2017

This pull request is not mergeable. Please rebase and repush.

in case there are no events when the event catcher starts up
it would not send event_monitor_running
@miq-bot
Copy link
Member

miq-bot commented Feb 27, 2017

Checked commit durandom@e7a344e with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
1 file checked, 0 offenses detected
Everything looks good. 👍

@durandom
Copy link
Member Author

@jrafanie merge?

@jrafanie
Copy link
Member

Thanks @durandom, I didn't know the conflict was resolved. 👍

@jrafanie
Copy link
Member

@durandom can you mark euwe/darga? I believe this is euwe/no, darga/no...

@jrafanie jrafanie merged commit 93a8fd5 into ManageIQ:master Feb 28, 2017
@jrafanie jrafanie added this to the Sprint 56 Ending Mar 13, 2017 milestone Feb 28, 2017
@durandom
Copy link
Member Author

durandom commented Mar 1, 2017

@miq-bot add_label darga/no

@durandom durandom deleted the heartbeat_ansible_worker branch March 17, 2017 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants