Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add container events to pod #16583

Merged
merged 2 commits into from
Dec 21, 2017
Merged

Conversation

zeari
Copy link

@zeari zeari commented Dec 3, 2017

BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1496179

Container events were previously unhandled. They belong to their parent pod.
needs to be merged with:
ManageIQ/manageiq-providers-kubernetes#181
ManageIQ/manageiq-content#225

@cben @moolitayer @enoodle Please review
cc @simon3z @bazulay

@miq-bot add_label bug, providers/containers, automate

@zeari
Copy link
Author

zeari commented Dec 3, 2017

@agrare @Ladas @cben @gmcculloug
I cant quite tell if we need to take care of event reconnection or not: #16497

I successfully tested this locally but im not sure in production we will get the same results:

  1. A failure event comes in for a brand new pod with container that has an unresolavable image name.
    • after this change the event should show up as" Pod Container Failed"
  2. We didnt collect that pod through refresh
  3. the event triggers a refresh so we get the new pod, the correct policy action occurs.

Can we expect the same results with a large openshift env? does the refresh triggered by the event ALWAYS happen before the policy actions?
does refreshing after each pod event even make sense? 😕

@zeari
Copy link
Author

zeari commented Dec 3, 2017

Looking into it and i still see occasionaly

evm.log:[----] W, [2017-12-03T17:13:18.139644 #8150:8e7114]  WARN -- : MIQ(EmsEvent#parse_policy_parameters) Unable to find target [container_group], skipping policy evaluation

so maybe the event fires several times. oc get events on the provider side yields a few:

13m        14m         3         goodbye-openshift                   Pod       spec.containers{goodbye-openshift}   Normal    BackOff                    kubelet, ocp-compute01.10.35.48.187.xip.io   Back-off pulling image "openshift/hello-openshiftaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
13m        14m         3         goodbye-openshift                   Pod       spec.containers{goodbye-openshift}   Warning   Failed                     kubelet, ocp-compute01.10.35.48.187.xip.io   Error: ImagePullBackOff

Looks to me like we do need to reconnect targetless events on refresh as per #16497

@Ladas
Copy link
Contributor

Ladas commented Dec 4, 2017

@zeari

  1. missing image for Container will cause missing STI, then a lot of code can explode Make sure Container has always the right STI type manageiq-providers-kubernetes#177

2,3) So even with refresh + reconnect, the refresh is async so the policy will probably be executed before the refresh happens, so it will not have the container/pod associated.

In bigger env it's expected that any event tied to creating of entity will not have the refreshed entity in time (processing of refresh takes much more time than processing of the policy)

The policy or the state machine invoking the policy should have an async waiting loop (code that will check Pod/Container is in our DB and send Automate :retry if not). Then of course we will need the post refresh event reconnect, so the association is filled.


In h-release, we will be creating the Pod/Container with the event(no reconnect should be needed) and it will be targeted refresh based on the event, so it will be much quicker. We will still need the waiting loop though, to make sure the policy has the record and all the record's attributes needed for the processing.

containergroup_containerkilling,Pod Container Killing,Default,container_operations
containergroup_containerstarted,Pod Container Started,Default,container_operations
containergroup_containerstopped,Pod Container Stopped,Default,container_operations
containergroup_containerunhealthy,Pod Container Unhealthy,Default,container_operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ the names with "Pod" at the beginning, this is presently the only way for user to know which events belong in which policies

@zeari
Copy link
Author

zeari commented Dec 4, 2017

@Ladas

My plan was to call handle a second time on each ems_event I managed to reconnect after refresh: #16497 (comment)

So the flow would be:

  1. event comes in for a pod that not in the db
  2. unable to find target message appears
  3. next refresh gets the pod
  4. after refresh, the event reconnects and handle is called on the events we reconnected

Do you think its a good intermediate solution?

EDIT: this is also a good approach for alerts

@cben
Copy link
Contributor

cben commented Dec 4, 2017 via email

@Ladas
Copy link
Contributor

Ladas commented Dec 4, 2017

@cben yeah I think we need a waiting step in the event handle (async wait), waiting for entity to be in the DB. This might apply for more events, not just events upon creation, since the refresh can take a long time.

@zeari what you you mean by the :handle, calling the Event handle again after the reconnect?

@agrare I wonder, what is the way we solve this usually? Was the refresh_new_target the way? Was it sync, I know the current refresh_sync is blocking, so that is not an option. The async waiting step sounds like the best option.

@zeari
Copy link
Author

zeari commented Dec 4, 2017

@zeari what you you mean by the :handle, calling the Event handle again after the reconnect?

@Ladas yes

@Ladas
Copy link
Contributor

Ladas commented Dec 4, 2017

@zeari not sure what would be the correct way to process that, we would have to send the Event again, to invoke the handle. We should not be calling handle directly from the refresh worker.

@zeari
Copy link
Author

zeari commented Dec 4, 2017

@zeari not sure what would be the correct way to process that, we would have to send the Event again, to invoke the handle. We should not be calling handle directly from the refresh worker.

@gmcculloug Do you know the downsides to calling handle again on an event after the first time yielded Unable to Find Target? (expecting that this time the event will have a target)

@gmcculloug gmcculloug requested a review from lfu December 4, 2017 13:59
@lfu
Copy link
Member

lfu commented Dec 4, 2017

@zeari The only downside that I can think of with raising the event the second time is that refresh would be called again.

@lfu
Copy link
Member

lfu commented Dec 4, 2017

LGTM 👍

@cben
Copy link
Contributor

cben commented Dec 5, 2017

Continuing the event reconnect discussion on #16497. This PR LGTM 👍

Copy link

@moolitayer moolitayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@zeari
Copy link
Author

zeari commented Dec 7, 2017

@lfu @gmcculloug
Can we go forward with this PR and ManageIQ/manageiq-content#225 ?

@zeari
Copy link
Author

zeari commented Dec 7, 2017

@miq-bot add_label gaprindashvili/yes

@zeari
Copy link
Author

zeari commented Dec 7, 2017

@miq-bot add_label fine/yes

@agrare
Copy link
Member

agrare commented Dec 7, 2017

@agrare I wonder, what is the way we solve this usually? Was the refresh_new_target the way

For VMs we raise a creation_event during post_refresh which is what people usually attach their policy events to (so @gmcculloug tells me). Can we do this for containers/container_groups as well? Looks like we already have a creation_event on container_images.

@cben
Copy link
Contributor

cben commented Dec 7, 2017

Yes, there is separate BZ to add such creation events for pods etc. These are nice because they're guaranteed to already have inventory.

But discussion here (moved to #16497) was about processing "real" external events that arrive before inventory.

@Ladas
Copy link
Contributor

Ladas commented Dec 8, 2017

@agrare as @cben says, it the world of long refresh and fast events, any event can have the target missing

so it can be:

  • pod_1_created_event
  • refresh(started+finished)
  • pod_1_created_policy
  • pod_1_any_other_event_policy -> ( 🎉 yaay, target is in the DB)

or:

  • pod_1_created_event
  • refresh (started)
  • pod_1_any_other_event_policy -> (this goes 💥 , since the target is not in the DB yet)
  • refresh (finished)
  • pod_1_created_policy

@agrare
Copy link
Member

agrare commented Dec 8, 2017

So my biggest issue with just running the event handler again is if a customer adds some action which isn't idempotent in addition to triggering the policy event it could break their workflows.

This is a pretty fundamental change in how event handlers are run and we cannot say for certain that this won't cause regressions for customers.

Today if a customer puts a policy event on a native VmCreatedEvent from VMware it isn' guaranteed that the target is in the DB so this is no different, which is why we have the synthesized events when a VM is created in the DB.

@cben
Copy link
Contributor

cben commented Dec 13, 2017

@lfu @gmcculloug @agrare so, can we go forward with this PR, ManageIQ/manageiq-content#225, and ManageIQ/manageiq-providers-kubernetes#181?
I think all 3 got positive reviews, it's just we got into a discussion about reconnecting & re-processing events — which these 3 PR do not do.

@lfu
Copy link
Member

lfu commented Dec 13, 2017

Sounds good to me.

@Ladas
Copy link
Contributor

Ladas commented Dec 13, 2017

@cben I would probably wait with ManageIQ/manageiq-content#225 ? Since that might cause a lot of failures with missing targets? Unless we solve the 'wait for target to be in the DB' somehow.

@agrare
Copy link
Member

agrare commented Dec 13, 2017

@zeari @cben can you explain a bit better why you want to switch all of these from being associated with the container to being associated with the pod? It isn't clear to me why this would help.

My plan was to call handle a second time
Do you think its a good intermediate solution?

Right now I'm 👎 on this unless @gmcculloug tells me I'm way off base on this and there's nothing to worry about 😄. This is a big change and we can't just spring this on customers with custom automate methods that might not behave well when run multiple times.

If we were waiting to execute the policy until the item was in the DB maybe, but that won't handle if the container will never be in the DB

@zeari
Copy link
Author

zeari commented Dec 14, 2017

@agrare Well there isnt code to associate those events with Container in the first place: https://github.com/manageiq/manageiq/blob/master/app/models/ems_event.rb#L140

While there was some pre-work to have them attached to Container here
in the current state, container events arent handled just as the BZ says.
Having those events target the pod fixes the issue as well as makes sense because:

  1. Users are likely to attach policies to pods
  2. containers are an inseparable part of their pods(on the openshift side) so the events target the pod anyway.
  3. @cben Come up with a Added missing routes. #3?

@zeari
Copy link
Author

zeari commented Dec 17, 2017

on customers with custom automate methods that might not behave well when run multiple times.

I dont think these events currently working for any customer as they were never associated correctly with containers 😅

@cben
Copy link
Contributor

cben commented Dec 17, 2017 via email

@zeari
Copy link
Author

zeari commented Dec 17, 2017

EDIT: Also, can we please separate discussion of calling .handle twice? That's a separate PR, it's an orthogonal goal to make all node/pod/container events robust. Unless I'm missing some way it affects this PR?

I think we are over the discussion. Its not a viable solution and i closed the other PR.

@zeari
Copy link
Author

zeari commented Dec 21, 2017

@agrare
I removed the containers prefixes in the containers specific events
Changed pod_containercreated to just pod_created

containergroup_killing,Pod Container Killing,Default,container_operations
containergroup_started,Pod Container Started,Default,container_operations
containergroup_stopped,Pod Container Stopped,Default,container_operations
containergroup_unhealthy,Pod Container Unhealthy,Default,container_operations
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agrare I think we should at least keep the labels more indicative: Pod Container Created

@miq-bot
Copy link
Member

miq-bot commented Dec 21, 2017

Checked commits zeari/manageiq@81db841~...99bbac4 with ruby 2.3.3, rubocop 0.47.1, haml-lint 0.20.0, and yamllint 1.10.0
0 files checked, 0 offenses detected
Everything looks fine. 🍰

Copy link
Member

@agrare agrare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@agrare agrare merged commit 8e2bfac into ManageIQ:master Dec 21, 2017
@agrare agrare self-assigned this Dec 21, 2017
@agrare agrare added this to the Sprint 76 Ending Jan 1, 2018 milestone Dec 21, 2017
simaishi pushed a commit that referenced this pull request Jan 3, 2018
@simaishi
Copy link
Contributor

simaishi commented Jan 3, 2018

Gaprindashvili backport details:

$ git log -1
commit a9d71962b5e148c43fdb540a7d992ccdd1f032c4
Author: Adam Grare <[email protected]>
Date:   Thu Dec 21 08:43:48 2017 -0500

    Merge pull request #16583 from zeari/container_events_to_pod
    
    Add container events to pod
    (cherry picked from commit 8e2bfac33e7e335d07e07ca72b7f63769028527d)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1530651

@simaishi
Copy link
Contributor

simaishi commented Jan 4, 2018

Fine backport details:

$ git log -1
commit fd945e735def672519fa40f48fc1ef909ae3a462
Author: Adam Grare <[email protected]>
Date:   Thu Dec 21 08:43:48 2017 -0500

    Merge pull request #16583 from zeari/container_events_to_pod
    
    Add container events to pod
    (cherry picked from commit 8e2bfac33e7e335d07e07ca72b7f63769028527d)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1530653

d-m-u pushed a commit to d-m-u/manageiq that referenced this pull request Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants