Redesign Event API #383

gmarek · 2017-08-04T08:19:16Z

Feature Description

One-line feature description (can be used as a release note): Add more structure to Event API and change deduplication logic so Events won't overload the cluster
Primary contact (assignee): @gmarek
Responsible SIGs: instrumentation
KEP: new-event-api-ga-graduation
Design proposal link (community repo): design goolge doc - design discussions in github are too painful for me
Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred: @timothysc @wojtek-t
Approver (likely from SIG/area to which feature belongs): @bgrant0607 @thockin @countspongebob
Feature target (which target equals to which milestone):
- Beta: 1.8 [done]
- GA: 1.19 [done]

idvoretskyi · 2017-08-09T13:55:23Z

@gmarek the feature submission deadline has passed (Aug 1). Please, submit a feature exception (https://github.com/kubernetes/features/blob/master/EXCEPTIONS.md) to have this feature present in 1.8 release.

jdumars · 2017-08-16T14:24:52Z

@gmarek as you probably saw in the other feature comments, I am trying to understand how some features didn't get into the features repo before the deadline. This is only for the purpose of improving our release process and notifications for next time, not for blaming or pointing fingers. We're also trying to understand if there was prior work done on the feature, or if it was created after the freeze date.

gmarek · 2017-08-16T16:12:36Z

Yup, there's quite some work done on this, with (big) design doc shared with kubernetes-dev and in-depth discussion on SIG scale. For this we'll probably make it disabled by default, as there's not enough time to let it soak. Is it possible to have a 'quiet' release for things like that? @jdumars

jdumars · 2017-08-16T17:47:51Z

@gmarek that's an interesting question. My personal opinion is to provide as much transparency as possible, so we maintain a bond of trust with our user community. Being as you get to write the release notes, you can add something short there about it. And, thanks for clarifying the feature itself.

countspongebob · 2017-08-16T21:04:10Z

Personal perspective on this, largely repeating comments I've made before. But, as this is a case in point...

SIG PM involvement and feature submissions have been functionally optional, with SIG PM not empowered to actually keep things out of a release.
There is continued confusion over what is a feature. I echo @jbeda in calling for these to be renamed "efforts". The implication would be 100% coverage, but see my first point.

We had a discussion in SIG Scalability about especially point #2 with no clear resolution. A few of lobbied @gmarek to do the feature submission not withstanding the points above and he agreed to to do.

idvoretskyi · 2017-08-18T15:40:09Z

@jdumars @countspongebob @gmarek the main point to discuss here - is about the formal dates and deadlines, and what will happen if one will avoid them. We have agreed that the feature freeze for 1.8 (https://github.com/kubernetes/features/blob/master/release-1.8/release-1.8.md) is August 1, so all the features have to be submitted to the features repo before this date.

If people, responsible for the release and the overall community feel that this deadline is not mandatory, it can be discussed and removed. From our (PM group) standpoint, the feature freeze is necessary from the high-level point of view (including planning of the roadmap, marketing activities, etc.). But if there are some reasons why we shouldn't have a feature freeze, again, let's discuss them.

PS. It has been a long-discussed question in the community, even before SIG-PM was established. Now it might be a good time to solve it.

idvoretskyi · 2017-08-18T15:43:46Z

@countspongebob

SIG PM involvement and feature submissions have been functionally optional, with SIG PM not empowered to actually keep things out of a release.

SIG PM is not empowered, but release team is. SIG PM is responsible for managing the features on the high level, so we would be able to provide release team with the clearest and transparent information about the feature.

gmarek · 2017-08-18T20:55:44Z

@idvoretskyi - IIUC the exception process is a SIG-PM thing. I haven't heard complains from release team about developing features that are not enabled and don't impact current behavior (plus it's highly unlikely it will be finished in 1.8 timeframe). I'm happy to discuss it as soon as any doubts appear.

Please correct me if I'm wrong - the goal is to track features that will ship in a current release, not the development process that may span across multiple releases. If I'm not mistaken this means that "features" (for lack of the better word) that are disabled and not ready to be enabled don't need to be tracked, right?

gmarek · 2017-08-18T21:07:48Z

Also note that there's not clear what constitutes a 'feature' and where's the border between new feature and 'improvement' that doesn't need a feature repo issue.

Slight OT, but related to shipping features - it was widely acknowledged that @kubernetes/sig-scalability-misc have power to block features which cause performance degradation bad enough to make Kubernetes clusters break our performance SLOs (this is of course decided together with the release team). This is decided close to the release dates, when scale tests on a given release are finished. I'm saying this to make clear that feature repo can't be treated as source of truth about features that will ship in a given release.

idvoretskyi · 2017-10-02T21:03:41Z

@gmarek any plans to continue development of this item for 1.9?

bgrant0607 · 2017-10-02T21:06:46Z

FYI, the proposal recently merged:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/instrumentation/events-redesign.md

idvoretskyi · 2017-10-02T21:08:16Z

@bgrant0607 perfect. Updating the milestone.

gmarek · 2017-10-03T07:33:55Z

PR is also ready for review (not started because of 1.8): kubernetes/kubernetes#49112

idvoretskyi · 2017-11-13T13:47:14Z

@gmarek can you confirm that it's alpha for 1.9?

zacharysarah · 2017-11-22T21:26:28Z

@gmarek 👋 Please indicate in the 1.9 feature tracking board
whether this feature needs documentation. If yes, please open a PR and add a link to the tracking spreadsheet. Thanks in advance!

@shyamjvs

Automatic merge from submit-queue (batch tested with PRs 55952, 49112, 55450, 56178, 56151). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. New API group for Events. Fix kubernetes/enhancements#383 cc @shyamjvs ```release-note Add events.k8s.io api group with v1beta1 API containing redesigned Event type. ``` Kubernetes-commit: 60c20901911c710491a57eb8b9c48850cdbab054

logicalhan · 2022-05-12T16:59:31Z

We need to find an owner for this. This is actually not GA because it's not completely migrated and now we are in a weird position where we use two events APIs.

wojtek-t · 2022-05-12T17:30:16Z

. This is actually not GA because it's not completely migrated and now we are in a weird position where we use two events APIs.

They are round-trippable across each other, so it's not a bit deal from that POV.
But I agree we should finally finalize this migration...

k8s-triage-robot · 2022-08-10T18:08:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dgrisonnet · 2022-08-31T14:48:54Z

/remove-lifecycle stale

jdumars · 2022-09-01T16:55:37Z

Hey folks, can we summarize what exactly is needed to wrap this up? That may help us identify people who can work on this.

lavalamp · 2022-09-01T16:58:15Z

Hey folks, can we summarize what exactly is needed to wrap this up?

+1, there's like 300 comments on this and it's impossible to read now, I actually would like to close this for a followup issue, or edit the OP with remaining needs. I don't know the state either.

dgrisonnet · 2022-09-01T17:24:36Z

The KEP is stable and the new Events API has also graduated to stable but not all the kubernetes components have been migrated from the core/v1 API to the events/v1 one which is the reason why we are keeping this issue alive. Out of all the components, I think only the kube-scheduler was migrated. There was some concerns in the past that weren't adressed which kind of paused the total migration.

For example the kube-controller-manager migration was halted: kubernetes/kubernetes#82834

I was just discussing that topic with @liggitt today since downstream we are often hit by misconfigured clients spamming the apiserver with similar events which often results in the creation of tens of thousands of Events object in the apiserver. The go-to way to solve this problem is to migrate to the new Events API since it was designed to reduce the number of calls made to the apiserver when identical Events are being emitted in a short span of time.

So far we've find two areas where the current implementation of the new Events API could be improve in order to make the migration a lot safer for all the components.

The current UX of the events/v1 API is very different from the core/v1 one. For instance, when an event is emitted with the core/v1 API, it is directly created in the apiserver, but for the events/v1 API, it is completely different. The new API introduces the notion of EventSeries. A serie is a suite of similar events that occured in a time windows of less than 6 minutes from one another. When an event is first emitted, a call will be send to the apiserver to create the Event object. If another similar event is emitted is less than 6 minutes from the first one, instead of being created on the apiserver directly, an EventSeries will be created and the counter of the event will increase in the cache. But the Event object will only be updated when the EventSerie is finished or when the event broadcaster hits a 30 minutes heartbeat. So if a serie of similar events occur, which is often the case for pod startup issues, the user will only know that the event occured multiple times after ~30 minutes which is far from ideal.
To improve that, we want to change that 30 minutes heartbeat mechanism to a backoff one on series, which would allow having a more responsive API without spamming the apiserver too much.
The second aspect is about rate limiting. There was a time where the kubelet was creating an insane amount of unique events which impacted the kube-apiserver masssively. To prevent that a rate limiting mechanism was added. I know that there is one that limit the size of the queue of events to 1000 objects before dropping them (https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/event.go#L41), but Jordan mentionned that there might be something else. I still need to investigate that part, but as of today, I wouldn't be able to safely that the new API has mechanism to protect against a surge of unique events, which is something we definitely want to make sure before migrating all the components.

I've only started looking at that again yesterday so there might be some more thing to think about, but for now my plan is to prepare a new KEP detailing the migration process and the different safeguards that are in place to prevent any issues post update.

@shyamjvs

Automatic merge from submit-queue (batch tested with PRs 55952, 49112, 55450, 56178, 56151). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. New API group for Events. Fix kubernetes/enhancements#383 cc @shyamjvs ```release-note Add events.k8s.io api group with v1beta1 API containing redesigned Event type. ``` Kubernetes-commit: 60c20901911c710491a57eb8b9c48850cdbab054

k8s-triage-robot · 2022-11-30T18:18:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dgrisonnet · 2022-11-30T18:22:54Z

/remove-lifecycle stale

sftim · 2022-12-22T20:55:30Z

The KEP is stable and the new Events API has also graduated to stable but not all the kubernetes components have been migrated from the core/v1 API to the events/v1 one which is the reason why we are keeping this issue alive.

Do we also want to get kubectl to prefer the new API?

dgrisonnet · 2023-01-02T11:16:18Z

Today both APIs are treated the same way by kubectl, kubectl get events is already showing events from both APIs. Going forward I think we should stick to that behavior for compatibility reasons.

wojtek-t · 2023-01-10T13:20:56Z

Today both APIs are treated the same way by kubectl, kubectl get events is already showing events from both APIs. Going forward I think we should stick to that behavior for compatibility reasons.

+1

Also catching up with the rest of this thread.

but Jordan mentionned that there might be something else. I still need to investigate that part, but as of today, I wouldn't be able to safely that the new API has mechanism to protect against a surge of unique events, which is something we definitely want to make sure before migrating all the components.

Interesting. The only other thing I know is the event aggregator:
https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/events_cache.go#L165

But if we suspect there are some other mechanisms we should double check that. But it could only live in one of the two places:

events library (client-go/tools/record)
in the apiserver itself (in events-specific code like registry, strategy etc.)

but for now my plan is to prepare a new KEP detailing the migration process and the different safeguards that are in place to prevent any issues post update.

@dgrisonnet - That sounds great. Are you planning to target this for 1.27?

dgrisonnet · 2023-01-10T16:01:47Z

Interesting. The only other thing I know is the event aggregator:
https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/events_cache.go#L165

Yes that's also the only mechanism I am aware of and so far I haven't been able to find anything else in the client but it would be worth investigating more in the server.

@dgrisonnet - That sounds great. Are you planning to target this for 1.27?

I haven't started writing anything yet, but I am considering submitting a KEP for 1.27. Do you think you will have enough bandwidth to review it?

wojtek-t · 2023-01-10T17:52:52Z

I haven't started writing anything yet, but I am considering submitting a KEP for 1.27. Do you think you will have enough bandwidth to review it?

Yes - I would find time for reviewing the KEP (and probably some code too) , but I won't have time to contribute anything myself.

logicalhan · 2023-01-19T17:38:10Z

/assign

ehashman · 2023-01-19T17:49:33Z

With #3728 covering the remaining event API migration work, I'd like to rescope this so we can close out this enhancement.

dgrisonnet · 2023-01-19T19:13:27Z

I opened #3760 to update the criterias.

idvoretskyi assigned gmarek Aug 9, 2017

idvoretskyi added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Aug 9, 2017

benjaminapetersen mentioned this issue Sep 1, 2017

Quota warnings openshift/openshift-origin-design#59

Merged

idvoretskyi added this to the next-milestone milestone Oct 2, 2017

idvoretskyi modified the milestones: next-milestone, 1.9 Oct 2, 2017

idvoretskyi added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Nov 13, 2017

gmarek mentioned this issue Nov 22, 2017

New API group for Events. kubernetes/kubernetes#49112

Merged

k8s-github-robot closed this as completed in kubernetes/kubernetes#49112 Nov 23, 2017

luxas reopened this Nov 26, 2017

deads2k closed this as completed in deads2k/kubernetes@60c2090 Nov 27, 2017

0xmichalis reopened this Nov 27, 2017

sttts closed this as completed in sttts/apimachinery@3bdf1dd Nov 27, 2017

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2022

wojtek-t mentioned this issue May 6, 2022

Migrate ipallocator and portallocator to new Events API kubernetes/kubernetes#109873

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2022

dgrisonnet mentioned this issue Sep 1, 2022

Data race when creating isomorphic Events with an "k8s.io/client-go/tools/events".EventRecorder kubernetes/kubernetes#92500

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022

dgrisonnet mentioned this issue Jan 12, 2023

Event API migration plan #3728

Open

4 tasks

kerthcet mentioned this issue Jan 16, 2023

[DoNotReview][WIP]Use new event broadcast API in scheduler kubernetes/kubernetes#115097

Closed

k8s-ci-robot assigned logicalhan Jan 19, 2023

dgrisonnet mentioned this issue Jan 19, 2023

Update KEP-383 graduation criterias #3760

Merged

k8s-ci-robot closed this as completed in #3760 Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign Event API #383

Redesign Event API #383

gmarek commented Aug 4, 2017 •

edited by wojtek-t

Loading

idvoretskyi commented Aug 9, 2017

jdumars commented Aug 16, 2017

gmarek commented Aug 16, 2017

jdumars commented Aug 16, 2017

countspongebob commented Aug 16, 2017

idvoretskyi commented Aug 18, 2017

idvoretskyi commented Aug 18, 2017

gmarek commented Aug 18, 2017 •

edited

Loading

gmarek commented Aug 18, 2017

idvoretskyi commented Oct 2, 2017

bgrant0607 commented Oct 2, 2017

idvoretskyi commented Oct 2, 2017

gmarek commented Oct 3, 2017

idvoretskyi commented Nov 13, 2017

zacharysarah commented Nov 22, 2017

logicalhan commented May 12, 2022

wojtek-t commented May 12, 2022

k8s-triage-robot commented Aug 10, 2022

dgrisonnet commented Aug 31, 2022

jdumars commented Sep 1, 2022

lavalamp commented Sep 1, 2022

dgrisonnet commented Sep 1, 2022

k8s-triage-robot commented Nov 30, 2022

dgrisonnet commented Nov 30, 2022

sftim commented Dec 22, 2022

dgrisonnet commented Jan 2, 2023

wojtek-t commented Jan 10, 2023

dgrisonnet commented Jan 10, 2023

wojtek-t commented Jan 10, 2023

logicalhan commented Jan 19, 2023

ehashman commented Jan 19, 2023

dgrisonnet commented Jan 19, 2023

Redesign Event API #383

Redesign Event API #383

Comments

gmarek commented Aug 4, 2017 • edited by wojtek-t Loading

Feature Description

idvoretskyi commented Aug 9, 2017

jdumars commented Aug 16, 2017

gmarek commented Aug 16, 2017

jdumars commented Aug 16, 2017

countspongebob commented Aug 16, 2017

idvoretskyi commented Aug 18, 2017

idvoretskyi commented Aug 18, 2017

gmarek commented Aug 18, 2017 • edited Loading

gmarek commented Aug 18, 2017

idvoretskyi commented Oct 2, 2017

bgrant0607 commented Oct 2, 2017

idvoretskyi commented Oct 2, 2017

gmarek commented Oct 3, 2017

idvoretskyi commented Nov 13, 2017

zacharysarah commented Nov 22, 2017

logicalhan commented May 12, 2022

wojtek-t commented May 12, 2022

k8s-triage-robot commented Aug 10, 2022

dgrisonnet commented Aug 31, 2022

jdumars commented Sep 1, 2022

lavalamp commented Sep 1, 2022

dgrisonnet commented Sep 1, 2022

k8s-triage-robot commented Nov 30, 2022

dgrisonnet commented Nov 30, 2022

sftim commented Dec 22, 2022

dgrisonnet commented Jan 2, 2023

wojtek-t commented Jan 10, 2023

dgrisonnet commented Jan 10, 2023

wojtek-t commented Jan 10, 2023

logicalhan commented Jan 19, 2023

ehashman commented Jan 19, 2023

dgrisonnet commented Jan 19, 2023

gmarek commented Aug 4, 2017 •

edited by wojtek-t

Loading

gmarek commented Aug 18, 2017 •

edited

Loading