Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update-etcd-rewrite #202

Merged
merged 4 commits into from
Sep 26, 2017
Merged

update-etcd-rewrite #202

merged 4 commits into from
Sep 26, 2017

Conversation

calvix
Copy link
Contributor

@calvix calvix commented Sep 22, 2017

rewrite of #193

@r7vme
Copy link
Contributor

r7vme commented Sep 22, 2017

k8s does compaction on it's own kubernetes/kubernetes#24079 Why we need additional one?

Copy paste here from other ticket:
Few more things to think about compaction happens every 5 min and gyus say that they have problems with load peaks it causes [1]. So in community they implemented retention modes. [2]

1 - etcd-io/etcd#8098
2 - etcd-io/etcd#8123

WDYT @calvix @teemow ? imo we should be careful with such changes. We saw the problem only once and on very specific environment (that survived multiple updates and had inoptimized cronjobs that produced thousands of resources). And finally we are not sure that that it lack of compaction was a root cause.

In my opinion we should add a monitoing on that first.

Copy link
Contributor

@r7vme r7vme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feeling that it's not right way to go. please see details in comment above

@calvix
Copy link
Contributor Author

calvix commented Sep 25, 2017

compaction + defrag was the thing i needed to do in order to restore lycan so from my point of view it is the cause of cluster being f*cked

I can easily see customer doing same thing on k8s cluster as we did on lycan.

@calvix
Copy link
Contributor Author

calvix commented Sep 25, 2017

SO this happened again on Vodafone guest cluster, so it's not something random or specific to lycan

@calvix calvix requested a review from teemow September 25, 2017 15:40
@r7vme
Copy link
Contributor

r7vme commented Sep 25, 2017

Okay, but i'm still not 100% confident :D

etcd-io/etcd#8009

TL;DR despite they have autocompaction enabled, they still hit this issue. And it's related to BoltDB bug. Fix will be landed only in 3.3

For me it''s still not clear what was the root cause for both cases.

Copy link
Contributor

@r7vme r7vme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok if quorum will decide to merge this. But my opinion is still that autocompaction is not the proper fix.

@corest
Copy link
Contributor

corest commented Sep 26, 2017

IMHO, a single appearance of that issue looks like a corner case, which does not bring enough information for applying particular resolving strategy

@teemow
Copy link
Member

teemow commented Sep 26, 2017

It happened on Lycan and Viking (in a guest cluster). I think this can always happen and we need to monitor that at least. For now starting with the host clusters.

@calvix
Copy link
Contributor Author

calvix commented Sep 26, 2017

@calvix calvix merged commit c2f46a4 into master Sep 26, 2017
@calvix calvix deleted the update-etcd-rewrite branch September 26, 2017 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants