Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign Event API #383

Closed
gmarek opened this issue Aug 4, 2017 · 150 comments · Fixed by #3760
Closed

Redesign Event API #383

gmarek opened this issue Aug 4, 2017 · 150 comments · Fixed by #3760
Assignees
Labels
sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status

Comments

@gmarek
Copy link

gmarek commented Aug 4, 2017

Feature Description

  • One-line feature description (can be used as a release note): Add more structure to Event API and change deduplication logic so Events won't overload the cluster
  • Primary contact (assignee): @gmarek
  • Responsible SIGs: instrumentation
  • KEP: new-event-api-ga-graduation
  • Design proposal link (community repo): design goolge doc - design discussions in github are too painful for me
  • Reviewer(s) - (for LGTM) recommend having 2+ reviewers (at least one from code-area OWNERS file) agreed to review. Reviewers from multiple companies preferred: @timothysc @wojtek-t
  • Approver (likely from SIG/area to which feature belongs): @bgrant0607 @thockin @countspongebob
  • Feature target (which target equals to which milestone):
    • Beta: 1.8 [done]
    • GA: 1.19 [done]
@idvoretskyi
Copy link
Member

@gmarek the feature submission deadline has passed (Aug 1). Please, submit a feature exception (https://github.com/kubernetes/features/blob/master/EXCEPTIONS.md) to have this feature present in 1.8 release.

@idvoretskyi idvoretskyi added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Aug 9, 2017
@jdumars
Copy link
Member

jdumars commented Aug 16, 2017

@gmarek as you probably saw in the other feature comments, I am trying to understand how some features didn't get into the features repo before the deadline. This is only for the purpose of improving our release process and notifications for next time, not for blaming or pointing fingers. We're also trying to understand if there was prior work done on the feature, or if it was created after the freeze date.

@gmarek
Copy link
Author

gmarek commented Aug 16, 2017

Yup, there's quite some work done on this, with (big) design doc shared with kubernetes-dev and in-depth discussion on SIG scale. For this we'll probably make it disabled by default, as there's not enough time to let it soak. Is it possible to have a 'quiet' release for things like that? @jdumars

@jdumars
Copy link
Member

jdumars commented Aug 16, 2017

@gmarek that's an interesting question. My personal opinion is to provide as much transparency as possible, so we maintain a bond of trust with our user community. Being as you get to write the release notes, you can add something short there about it. And, thanks for clarifying the feature itself.

@countspongebob
Copy link

Personal perspective on this, largely repeating comments I've made before. But, as this is a case in point...

  • SIG PM involvement and feature submissions have been functionally optional, with SIG PM not empowered to actually keep things out of a release.
  • There is continued confusion over what is a feature. I echo @jbeda in calling for these to be renamed "efforts". The implication would be 100% coverage, but see my first point.

We had a discussion in SIG Scalability about especially point #2 with no clear resolution. A few of lobbied @gmarek to do the feature submission not withstanding the points above and he agreed to to do.

@idvoretskyi
Copy link
Member

@jdumars @countspongebob @gmarek the main point to discuss here - is about the formal dates and deadlines, and what will happen if one will avoid them. We have agreed that the feature freeze for 1.8 (https://github.com/kubernetes/features/blob/master/release-1.8/release-1.8.md) is August 1, so all the features have to be submitted to the features repo before this date.

If people, responsible for the release and the overall community feel that this deadline is not mandatory, it can be discussed and removed. From our (PM group) standpoint, the feature freeze is necessary from the high-level point of view (including planning of the roadmap, marketing activities, etc.). But if there are some reasons why we shouldn't have a feature freeze, again, let's discuss them.

PS. It has been a long-discussed question in the community, even before SIG-PM was established. Now it might be a good time to solve it.

@idvoretskyi
Copy link
Member

@countspongebob

SIG PM involvement and feature submissions have been functionally optional, with SIG PM not empowered to actually keep things out of a release.

SIG PM is not empowered, but release team is. SIG PM is responsible for managing the features on the high level, so we would be able to provide release team with the clearest and transparent information about the feature.

@gmarek
Copy link
Author

gmarek commented Aug 18, 2017

@idvoretskyi - IIUC the exception process is a SIG-PM thing. I haven't heard complains from release team about developing features that are not enabled and don't impact current behavior (plus it's highly unlikely it will be finished in 1.8 timeframe). I'm happy to discuss it as soon as any doubts appear.

Please correct me if I'm wrong - the goal is to track features that will ship in a current release, not the development process that may span across multiple releases. If I'm not mistaken this means that "features" (for lack of the better word) that are disabled and not ready to be enabled don't need to be tracked, right?

@gmarek
Copy link
Author

gmarek commented Aug 18, 2017

Also note that there's not clear what constitutes a 'feature' and where's the border between new feature and 'improvement' that doesn't need a feature repo issue.

Slight OT, but related to shipping features - it was widely acknowledged that @kubernetes/sig-scalability-misc have power to block features which cause performance degradation bad enough to make Kubernetes clusters break our performance SLOs (this is of course decided together with the release team). This is decided close to the release dates, when scale tests on a given release are finished. I'm saying this to make clear that feature repo can't be treated as source of truth about features that will ship in a given release.

@idvoretskyi
Copy link
Member

@gmarek any plans to continue development of this item for 1.9?

@idvoretskyi idvoretskyi added this to the next-milestone milestone Oct 2, 2017
@bgrant0607
Copy link
Member

@idvoretskyi
Copy link
Member

@bgrant0607 perfect. Updating the milestone.

@idvoretskyi idvoretskyi modified the milestones: next-milestone, 1.9 Oct 2, 2017
@gmarek
Copy link
Author

gmarek commented Oct 3, 2017

PR is also ready for review (not started because of 1.8): kubernetes/kubernetes#49112

@idvoretskyi idvoretskyi added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label Nov 13, 2017
@idvoretskyi
Copy link
Member

@gmarek can you confirm that it's alpha for 1.9?

@zacharysarah
Copy link
Contributor

@gmarek 👋 Please indicate in the 1.9 feature tracking board
whether this feature needs documentation. If yes, please open a PR and add a link to the tracking spreadsheet. Thanks in advance!

@luxas luxas reopened this Nov 26, 2017
@0xmichalis 0xmichalis reopened this Nov 27, 2017
sttts pushed a commit to sttts/api that referenced this issue Nov 27, 2017
Automatic merge from submit-queue (batch tested with PRs 55952, 49112, 55450, 56178, 56151). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

New API group for Events.

Fix kubernetes/enhancements#383

cc @shyamjvs

```release-note
Add events.k8s.io api group with v1beta1 API containing redesigned Event type.
```

Kubernetes-commit: 60c20901911c710491a57eb8b9c48850cdbab054
@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2022
@logicalhan
Copy link
Member

We need to find an owner for this. This is actually not GA because it's not completely migrated and now we are in a weird position where we use two events APIs.

@wojtek-t
Copy link
Member

. This is actually not GA because it's not completely migrated and now we are in a weird position where we use two events APIs.

They are round-trippable across each other, so it's not a bit deal from that POV.
But I agree we should finally finalize this migration...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 10, 2022
@dgrisonnet
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2022
@jdumars
Copy link
Member

jdumars commented Sep 1, 2022

Hey folks, can we summarize what exactly is needed to wrap this up? That may help us identify people who can work on this.

@lavalamp
Copy link
Member

lavalamp commented Sep 1, 2022

Hey folks, can we summarize what exactly is needed to wrap this up?

+1, there's like 300 comments on this and it's impossible to read now, I actually would like to close this for a followup issue, or edit the OP with remaining needs. I don't know the state either.

@dgrisonnet
Copy link
Member

The KEP is stable and the new Events API has also graduated to stable but not all the kubernetes components have been migrated from the core/v1 API to the events/v1 one which is the reason why we are keeping this issue alive. Out of all the components, I think only the kube-scheduler was migrated. There was some concerns in the past that weren't adressed which kind of paused the total migration.

For example the kube-controller-manager migration was halted: kubernetes/kubernetes#82834

I was just discussing that topic with @liggitt today since downstream we are often hit by misconfigured clients spamming the apiserver with similar events which often results in the creation of tens of thousands of Events object in the apiserver. The go-to way to solve this problem is to migrate to the new Events API since it was designed to reduce the number of calls made to the apiserver when identical Events are being emitted in a short span of time.

So far we've find two areas where the current implementation of the new Events API could be improve in order to make the migration a lot safer for all the components.

  1. The current UX of the events/v1 API is very different from the core/v1 one. For instance, when an event is emitted with the core/v1 API, it is directly created in the apiserver, but for the events/v1 API, it is completely different. The new API introduces the notion of EventSeries. A serie is a suite of similar events that occured in a time windows of less than 6 minutes from one another. When an event is first emitted, a call will be send to the apiserver to create the Event object. If another similar event is emitted is less than 6 minutes from the first one, instead of being created on the apiserver directly, an EventSeries will be created and the counter of the event will increase in the cache. But the Event object will only be updated when the EventSerie is finished or when the event broadcaster hits a 30 minutes heartbeat. So if a serie of similar events occur, which is often the case for pod startup issues, the user will only know that the event occured multiple times after ~30 minutes which is far from ideal.
    To improve that, we want to change that 30 minutes heartbeat mechanism to a backoff one on series, which would allow having a more responsive API without spamming the apiserver too much.

  2. The second aspect is about rate limiting. There was a time where the kubelet was creating an insane amount of unique events which impacted the kube-apiserver masssively. To prevent that a rate limiting mechanism was added. I know that there is one that limit the size of the queue of events to 1000 objects before dropping them (https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/event.go#L41), but Jordan mentionned that there might be something else. I still need to investigate that part, but as of today, I wouldn't be able to safely that the new API has mechanism to protect against a surge of unique events, which is something we definitely want to make sure before migrating all the components.

I've only started looking at that again yesterday so there might be some more thing to think about, but for now my plan is to prepare a new KEP detailing the migration process and the different safeguards that are in place to prevent any issues post update.

akhilerm pushed a commit to akhilerm/apimachinery that referenced this issue Sep 20, 2022
Automatic merge from submit-queue (batch tested with PRs 55952, 49112, 55450, 56178, 56151). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

New API group for Events.

Fix kubernetes/enhancements#383

cc @shyamjvs 

```release-note
Add events.k8s.io api group with v1beta1 API containing redesigned Event type.
```

Kubernetes-commit: 60c20901911c710491a57eb8b9c48850cdbab054
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022
@dgrisonnet
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 30, 2022
@sftim
Copy link
Contributor

sftim commented Dec 22, 2022

The KEP is stable and the new Events API has also graduated to stable but not all the kubernetes components have been migrated from the core/v1 API to the events/v1 one which is the reason why we are keeping this issue alive.

Do we also want to get kubectl to prefer the new API?

@dgrisonnet
Copy link
Member

Today both APIs are treated the same way by kubectl, kubectl get events is already showing events from both APIs. Going forward I think we should stick to that behavior for compatibility reasons.

@wojtek-t
Copy link
Member

Today both APIs are treated the same way by kubectl, kubectl get events is already showing events from both APIs. Going forward I think we should stick to that behavior for compatibility reasons.

+1

Also catching up with the rest of this thread.

but Jordan mentionned that there might be something else. I still need to investigate that part, but as of today, I wouldn't be able to safely that the new API has mechanism to protect against a surge of unique events, which is something we definitely want to make sure before migrating all the components.

Interesting. The only other thing I know is the event aggregator:
https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/events_cache.go#L165

But if we suspect there are some other mechanisms we should double check that. But it could only live in one of the two places:

  • events library (client-go/tools/record)
  • in the apiserver itself (in events-specific code like registry, strategy etc.)

but for now my plan is to prepare a new KEP detailing the migration process and the different safeguards that are in place to prevent any issues post update.

@dgrisonnet - That sounds great. Are you planning to target this for 1.27?

@dgrisonnet
Copy link
Member

Interesting. The only other thing I know is the event aggregator:
https://github.com/kubernetes/client-go/blob/v0.25.0/tools/record/events_cache.go#L165

Yes that's also the only mechanism I am aware of and so far I haven't been able to find anything else in the client but it would be worth investigating more in the server.

@dgrisonnet - That sounds great. Are you planning to target this for 1.27?

I haven't started writing anything yet, but I am considering submitting a KEP for 1.27. Do you think you will have enough bandwidth to review it?

@wojtek-t
Copy link
Member

I haven't started writing anything yet, but I am considering submitting a KEP for 1.27. Do you think you will have enough bandwidth to review it?

Yes - I would find time for reviewing the KEP (and probably some code too) , but I won't have time to contribute anything myself.

@logicalhan
Copy link
Member

/assign

@ehashman
Copy link
Member

With #3728 covering the remaining event API migration work, I'd like to rescope this so we can close out this enhancement.

@dgrisonnet
Copy link
Member

I opened #3760 to update the criterias.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. stage/stable Denotes an issue tracking an enhancement targeted for Stable/GA status
Projects
None yet
Development

Successfully merging a pull request may close this issue.