Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(TK-202) Add support for restart and HUP handling #198

Merged
merged 19 commits into from
Jan 15, 2016
Merged

(TK-202) Add support for restart and HUP handling #198

merged 19 commits into from
Jan 15, 2016

Conversation

haus
Copy link
Contributor

@haus haus commented Dec 15, 2015

This PR adds a restart function for TK apps that allows trapperkeeper to cleanly handle SIGHUP by calling restart on all of the registered apps.

@igalic
Copy link

igalic commented Dec 15, 2015

travis-ci/travis-ci#5227

@cprice404
Copy link

@igalic thanks for the heads up. That's disconcerting :/

(init [this context] (lc-fn context :init-service3))
(start [this context] (lc-fn context :start-service3))
(stop [this context] (lc-fn context :stop-service3))
(service3-fn [this] (lc-fn nil :service3-fn)))]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest moving these down so that they're just above the tests that use them.

@cprice404
Copy link

+1 other than the one minor nitpick.

Probably also worth doing a lein install and testing this with Puppet Server to make sure we didn't accidentally break any backward compat.

@cprice404
Copy link

@pcarlisle do you want to review this? @haus added a test that actually exercises the signal.

@pcarlisle
Copy link
Contributor

cool, looks fine to me

(internal/register-sighup-handler [app])
(beckon/raise! "HUP")
(let [start (System/currentTimeMillis)]
(while (or (not= (count @call-seq) 15)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop logic is inverted. I'll get a fix out this afternoon.

@haus
Copy link
Contributor Author

haus commented Dec 17, 2015

supposedly travis is no longer overflowing its heap, so i kicked the tests. 🎱

this))))
this)
(a/restart [this]
(run-lifecycle-fns app-context s/stop "stop" (reverse ordered-services))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should happen if one of these fail - e.g., if an unhandled exception is thrown up the stack from one of a service's lifecycle functions? Should the process that the apps are running in be shut down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me to shut the app process down in that case, but maybe @cprice404 has thoughts here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in shutdown! we'll catch an Exception thrown from an individual service's stop function, log it, and continue calling the other services so they still have a chance to stop? Should we do that here too? Maybe we should share some of this logic between the restart and shutdown!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just pushed out a new commit that wraps the restart in a try-catch to achieve this effect.

@camlow325
Copy link
Contributor

Not that I think it's a big deal or even desirable, true statement that we won't reparse the bootstrap.cfg - potentially starting up different services than were present at initial start - after a SIGHUP, right?

(is (= [:init-service1 :init-service2 :init-service3
:start-service1 :start-service2 :start-service3]
:start-service1 :start-service2 :start-service3
:stop-service3 :stop-service2 :stop-service1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@camlow325
Copy link
Contributor

In earlier descriptions about this feature, I had the impression that it wasn't going to be possible for a service to store state in its context and then be able to get that state back after a restart. But it looks like it does actually preserve the service specific context across the restart. Is that intentional?

I suppose that might be nice for some services that want to conditionally skip re-initializing something if they can preserve it across a restart but probably worth documenting that as such so that services are prepared for that. For example, it might be bad for the service to end up with a "dirty" context that is a combination of some new stuff the service wanted to add to the context after the restart but wasn't aware of old unintended stuff was implicitly preserved on the context from the previous service startup.

@cprice404
Copy link

@camlow325 good catch, no, that was not intentional, we should most definitely be clearing state.

@camlow325
Copy link
Contributor

Overall, looks really good and seemed to do what I expected when testing it.

I left some questions about expected error handling - like should the process get shut down if the restart fails - and whether or not data is supposed to be preserved in the service context across a restart.

@cprice404 and I had also talked about the concern I had about the restart being executed from the signal handler's thread while concurrent activity might be happening on the Trapperkeeper main thread - and specifically how it might be better to make assurances that lifecycle functions only ever be executed from the same thread to avoid potential issues. That led to https://tickets.puppetlabs.com/browse/TK-300 being filed. I'm good with letting that question be sorted out there.

@haus
Copy link
Contributor Author

haus commented Dec 21, 2015

@camlow325 I pushed some new commits to ensure context is being cleared on a restart.

(bootstrap-services-with-empty-config services))]
; We can't use the with-app-with-empty-config macro because we don't want to use its implicit tk-app/stop call
; We're asserting that the stop will happen because of the exception. So instead, we use the tk/run-app here to
; block on the app until the restart is called an explodes in an exception.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are really long - we typically break them at ~80 characters (in Clojure).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool. will fix up. fixed up.

@KevinCorcoran
Copy link
Contributor

I took a quick peek. 🙈

@haus
Copy link
Contributor Author

haus commented Jan 4, 2016

okay travis. let's try this again.

@haus
Copy link
Contributor Author

haus commented Jan 4, 2016

check out that green from travis

@cprice404
Copy link

We may need to be careful about the timing of merging this since it will have pretty major implications on users of TK. We should perhaps consider reviewing and merging #199 first, doing a release that contains that, and then merging this afterward. I will review #199 ASAP.

@haus
Copy link
Contributor Author

haus commented Jan 4, 2016

@cprice404 sounds reasonable to me.

(stop [this context] (throw (IllegalStateException. "Exploding Service")))))]
(try
(ks-testutils/with-no-jvm-shutdown-hooks
(let [app (internal/throw-app-error-if-exists!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could combine this let block with the next one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. seems to work. pushed out a commit with that change.

@camlow325
Copy link
Contributor

👍 with just the remaining questions about stop/shutdown processing in the event of a restart failure.

This commit adds a `restart` function to the `tk-app` protocol.
The implementation simply stops all of the services, then calls
their `init` and `start` functions to bring them back up.

We intend to add another commit that adds support for handling
SIGHUP signals, by calling this function for all running
TK apps.
Prior to this commit, we had a single global var that we'd
populate with a reference to the last TK app that was started.
This was intended to make it possible to connect to a running
process via nrepl, and get a reference to the app.

This would cause issues if there were more than one TK app
running in the same process; this commit changes the variable
to be an atom that maintains a vector of all running TK apps.
This commit applies the indentation recommended by the pl clojure style
guide to the services tests.
The conditional in the while loop was previously inverted and only
worked because the count did eventually become 15. If it hadn't, it
would have looped forever. This commit updates the conditional to break
the loop if the count is correct or if 1000 milliseconds have elapsed.
Previously during a restart the contexts for various services would be
left intact, which is undesirable. This commit updates the restart
function to re-initialize all of the service contexts after they have
stopped.
This commit adds a simple test to validate that the context is
re-initialized after a restart.
If any step in the restart fails, it won't be a good idea to attempt to
continue the restart process. This commit wraps the restart in a try
catch that will cause a shutdown if an exception is thrown at any point
in the restart.
(stop [this] "Stop the services")
(restart [this] "Stop and restart the services"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably return the tk to for consistency with start, stop and init.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. updated to do so.

@haus
Copy link
Contributor Author

haus commented Jan 12, 2016

I'm seeing a test failure related to the optional deps. more to come.

@haus
Copy link
Contributor Author

haus commented Jan 12, 2016

Ah, yes rebasing the optional-deps work in has broken those tests as they use build-app.

@haus
Copy link
Contributor Author

haus commented Jan 12, 2016

I'm going to see about updating build-app to take both a function and maintain previous behavior. Ideally, my last commit can go away then and existing repls will continue to work.

@haus
Copy link
Contributor Author

haus commented Jan 12, 2016

pushed out a commit that updates build-app to support a map or function for config-data. i've verified that the puppet-server repl works as is with these changes.

To be consistent with the other app functions, this commit updates the
restart to return `this`.
A previous iteration of test used with-test-logging-debug, but that is
no longer the case so this commit removes the refer.
:post [(satisfies? app/TrapperkeeperApp %)]}
(config/initialize-logging! config-data)
(internal/build-app* services config-data (promise)))
(let [config-data-fn #(if (map? config-data) (constantly config-data) config-data)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add the map-or-fn support to the other similar functions below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need the anonymous function that wraps the if map? or can we just get rid of that like:

(let [config-data-fn (if (map? config-data) (constantly config-data) config-data] ...)

An earlier commit had changed the signature of build-app in a backward
incompatible manner. This commit updates build-app to handle the
previous behavior, where config-data was a map, as well as the new
behavior, where config-data is a function.
;; This adds the TrapperkeeperApp instance to the tk-apps list, so that
;; it can be referenced in a remote nREPL session, etc.
(swap! internal/tk-apps conj app)
(internal/register-sighup-handler)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I notice here:

We don't do the initial registration of the sighup handler until after the app is booted. I think this lessens the concerns that people had about a HUP causing indeterminate behavior if the user sent it before the app had finished booting up. The behavior won't be indeterminate -- it will just shut down the process.

We might want to revisit that in the future, and change this up so that the registration of the handler happens earlier and it's smart enough not to initiate the restart while the initial startup is in progress, but I think we can defer that for now. The fact that the behavior is not indeterminate seems to alleviate the most pressing concerns.

@cprice404
Copy link

I left a few minor nits, and some suggestions for making one of the tests just a bit more complete. Other than I'm +1 and will merge this today unless anyone objects.

(There is still some additional work to do before we'd roll out a release, I'm just talking about merging this PR... main implication there is that we can't do any other new TK releases that don't include this, after this is merged, unless we create a new branch... which would be nice to avoid.)

@KevinCorcoran @camlow325 @pcarlisle @senior @adreyer if you have any concerns about me merging this today once @haus has had a chance to review the last round of feedback, let me know, but I feel like I've already pretty much checked with most of you and it seems like everyone's good with it.

This commit corrects a typo, collects parens, and moves a service down a
line to respond to feedback in the PR.
This commit updates the test-context-cleared-on-restart test to include
some additional state and assertions as suggested by cprice. This
includes validating that the context is cleared after a restart and that
init is called twice during the test lifetime.
@cprice404
Copy link

+1

cprice404 added a commit that referenced this pull request Jan 15, 2016
(TK-202) Add support for restart and HUP handling
@cprice404 cprice404 merged commit 0dc1c23 into puppetlabs:master Jan 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants