-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: close provider/provisioner connections when not used anymore #2406
core: close provider/provisioner connections when not used anymore #2406
Conversation
Currently Terraform is leaking goroutines and with that memory. I know strictly speaking this maybe isn’t a real concern for Terraform as it’s mostly used as a short running command line executable. But there are a few of us out there that are using Terraform in some long running processes and then this starts to become a problem. Next to that it’s of course good programming practise to clean up resources when they're not needed anymore. So even for the standard command line use case, this seems an improvement in resource management. Personally I see no downsides as the primary connection to the plugin is kept alive (the plugin is not killed) and only unused connections that will never be used again are closed to free up any related goroutines and memory.
@mitchellh @phinze I suppose you guys would want to have a look at this one 😉 If one of you could have a look/review and share your thoughts that would be great! Thanks! |
Ping @mitchellh and @phinze for a review on this one. Doesn't seem something that has a big impact, yet it improves the general code by properly releasing any unused resourses... Would be really great if this could be added before releasing 0.6.0. Thx! |
func (m *muxBroker) NextId() uint32 { | ||
return atomic.AddUint32(&m.nextId, 1) | ||
return rand.Uint32() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a (very low) possibility of the ID repeating, no? Which would be problematic for an identifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It are pseudo-random numbers and there are 4.294.967.295 of numbers to hand out. That means it would be extremely unlikely to get the same number within just a few numbers of each other.
And an ID may very wel be reused as soon as the previous command that was using the ID is closed by the new close provider. Having just a counter going up would mean there is hard limit of the maximum amount of commands that can be executed. And even through 4.294.967.295 is quite a lot, if you use TF in an always on setup it will hit that limit sooner or later.
But if we think this could be an issue, the only additional thing I can think of is to make a map with generated numbers and a timestamp (something like a map[uint32]Time
) and then whenever a new number is generated we first check if the number was already handed out. But then additionally it should also delete entries that are older then x time, where x would be something we would agree on. I would say if you would like to make it extremely safe make it 60 minutes. There is no single apply/plan/refresh that takes 60 minutes right? Doing the cleanup could then be triggered only when NextId() is called, so it doesn't require a scheduled thing of it's own to periodically check and cleanup.
Sounds like maybe a good add-on, let me know what you think and if you think it's needed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked about this on IRC, but mirroring here - I think we're fine to punt on this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have a look tomorrow at how much impact a solution as discussed would have. The more I think about it the more I believe it wouldn't hurt and would make 99,99% a 100% which is always better of course 😉
But if so, I'll open a new PR for that one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would feel better about reinstating the atomic increment. Packer/Terraform (both use this system, but not the same code) open a LOT of channels. Hundreds. While the chance is low, if it did conflict, it would be disastrous. LIkely an immediate panic at some point.
The atomic increment isn't a hard limit, we just expect it to wrap at some point (being a signed integer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point and worth potentially opening in Yamux. But I think on our side we use it for other reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue there to discuss this, but I can imagine use cases where this check would actually make sense. For Terraform it doesn't as the channels aren't usually used for more then a couple of minutes...
So would something like this be a workaround until (if) the logic in yamux can be changed?
func (m *muxBroker) NextId() uint32 {
if atomic.LoadUint32(&m.nextid) == math.MaxUint32-10 {
atomic.StoreUint32(&m.nextid, 0)
}
return atomic.AddUint32(&m.nextId, 1)
}
Maybe still a little dirty, but this would do the trick right?
Edit: With some comments explaining why that is done of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 This seems like a good compromise for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for not picking this up today. Had a horrable (non productive) day with a broken macbook.
Will do some tests and make a PR for this tomorrow so we can get it in (if all is well of course) before shipping 0.6.0...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yeah, that would be ideal!
Thanks for taking this on @svanharmelen! Overall this LGTM - two inline questions / nits. |
Removed the `DotOrigin()` func after review of the PR. With this change the PR is up-to-spec to be merged (as just discussed with Paul in IRC).
…ners core: close provider/provisioner connections when not used anymore
core: reverting a few lines from PR #2406
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Currently Terraform is leaking goroutines and with that memory. I know strictly speaking this maybe isn’t a real concern for Terraform as it’s mostly used as a short running command line executable.
But there are a few of us out there that are using Terraform in some long running processes and then this starts to become a problem.
Next to that it’s of course good programming practise to clean up resources when they're not needed anymore. So even for the standard command line use case, this seems an improvement in resource management.
Personally I see no downsides as the primary connection to the plugin is kept alive (the plugin is not killed) and only unused connections that will never be used again are closed to free up any related
goroutines and memory.