Skip to content

Commit

Permalink
drain manager: Make probabilistic drain configurable (envoyproxy#11403)
Browse files Browse the repository at this point in the history
Add DrainStrategy enum to Options with Graceful and Immediate
Disable probabilistic drain in DrainManager if DrainStrategy == Immediate

Add integration tests
Risk Level: Low.
Testing: Integration tests, verify that the race condition from envoyproxy#11240 does not occur if the probabilistic drain is disabled.

Signed-off-by: Auni Ahsan <[email protected]>
  • Loading branch information
auni53 authored and songhu committed Jun 25, 2020
1 parent 74300a3 commit 67ff3bd
Show file tree
Hide file tree
Showing 18 changed files with 261 additions and 75 deletions.
13 changes: 12 additions & 1 deletion api/envoy/admin/v3/server_info.proto
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ message ServerInfo {
CommandLineOptions command_line_options = 6;
}

// [#next-free-field: 33]
// [#next-free-field: 34]
message CommandLineOptions {
option (udpa.annotations.versioning).previous_message_type =
"envoy.admin.v2alpha.CommandLineOptions";
Expand All @@ -75,6 +75,14 @@ message CommandLineOptions {
InitOnly = 2;
}

enum DrainStrategy {
// Gradually discourage connections over the course of the drain period.
Gradual = 0;

// Discourage all connections for the duration of the drain sequence.
Immediate = 1;
}

reserved 12, 20, 21;

reserved "max_stats", "max_obj_name_len";
Expand Down Expand Up @@ -142,6 +150,9 @@ message CommandLineOptions {
// See :option:`--drain-time-s` for details.
google.protobuf.Duration drain_time = 17;

// See :option:`--drain-strategy` for details.
DrainStrategy drain_strategy = 33;

// See :option:`--parent-shutdown-time-s` for details.
google.protobuf.Duration parent_shutdown_time = 18;

Expand Down
13 changes: 12 additions & 1 deletion api/envoy/admin/v4alpha/server_info.proto
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ message ServerInfo {
CommandLineOptions command_line_options = 6;
}

// [#next-free-field: 33]
// [#next-free-field: 34]
message CommandLineOptions {
option (udpa.annotations.versioning).previous_message_type = "envoy.admin.v3.CommandLineOptions";

Expand All @@ -74,6 +74,14 @@ message CommandLineOptions {
InitOnly = 2;
}

enum DrainStrategy {
// Gradually discourage connections over the course of the drain period.
Gradual = 0;

// Discourage all connections for the duration of the drain sequence.
Immediate = 1;
}

reserved 12, 20, 21;

reserved "max_stats", "max_obj_name_len";
Expand Down Expand Up @@ -141,6 +149,9 @@ message CommandLineOptions {
// See :option:`--drain-time-s` for details.
google.protobuf.Duration drain_time = 17;

// See :option:`--drain-strategy` for details.
DrainStrategy drain_strategy = 33;

// See :option:`--parent-shutdown-time-s` for details.
google.protobuf.Duration parent_shutdown_time = 18;

Expand Down
8 changes: 8 additions & 0 deletions docs/root/operations/cli.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,14 @@ following are the command line options that Envoy supports.
desirable to have a very long drain time. In service to service scenarios, it might be possible
to make the drain and shutdown time much shorter (e.g., 60s/90s).

.. option:: --drain-strategy <string>

*(optional)* Determine behaviour of Envoy during the hot restart drain sequence. During the drain sequence, the drain manager encourages draining through terminating connections on request completion, sending "Connection: CLOSE" on HTTP1, and sending GOAWAY on HTTP2.

* ``gradual``: *(default)* The percentage of requests encouraged to drain increases to 100% as the drain time elapses.

* ``immediate``: All requests are encouraged to drain as soon as the drain sequence begins.

.. option:: --parent-shutdown-time-s <integer>

*(optional)* The time in seconds that Envoy will wait before shutting down the parent process
Expand Down
1 change: 1 addition & 0 deletions docs/root/version_history/current.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,7 @@ New Features
* router: more fine grained internal redirect configs are added to the :ref:`internal_redirect_policy
<envoy_v3_api_field_config.route.v3.RouteAction.internal_redirect_policy>` field.
* runtime: add new gauge :ref:`deprecated_feature_seen_since_process_start <runtime_stats>` that gets reset across hot restarts.
* server: add the option :option:`--drain-strategy` to enable different drain strategies for DrainManager::drainClose().
* stats: added the option to :ref:`report counters as deltas <envoy_v3_api_field_config.metrics.v3.MetricsServiceConfig.report_counters_as_deltas>` to the metrics service stats sink.
* tracing: tracing configuration has been made fully dynamic and every HTTP connection manager
can now have a separate :ref:`tracing provider <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.Tracing.provider>`.
Expand Down
13 changes: 12 additions & 1 deletion generated_api_shadow/envoy/admin/v3/server_info.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 12 additions & 1 deletion generated_api_shadow/envoy/admin/v4alpha/server_info.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

37 changes: 30 additions & 7 deletions include/envoy/server/options.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,24 @@ enum class Mode {
// to be validated in a non-prod environment.
};

/**
* During the drain sequence, different components ask the DrainManager
* whether to drain via drainClose(). This enum dictates the behaviour of
* drainClose() calls.
*/
enum class DrainStrategy {
/**
* The probability of drainClose() returning true increases from 0 to 100%
* over the duration of the drain period.
*/
Gradual,

/**
* drainClose() will return true as soon as the drain sequence is initiated.
*/
Immediate,
};

using CommandLineOptionsPtr = std::unique_ptr<envoy::admin::v3::CommandLineOptions>;

/**
Expand Down Expand Up @@ -76,10 +94,21 @@ class Options {
virtual uint32_t concurrency() const PURE;

/**
* @return the number of seconds that envoy will perform draining during a hot restart.
* @return the duration of the drain period in seconds.
*/
virtual std::chrono::seconds drainTime() const PURE;

/**
* @return the strategy that defines behaviour of DrainManager::drainClose();
*/
virtual DrainStrategy drainStrategy() const PURE;

/**
* @return the delay before shutting down the parent envoy in a hot restart,
* generally longer than drainTime().
*/
virtual std::chrono::seconds parentShutdownTime() const PURE;

/**
* @return const std::string& the path to the configuration file.
*/
Expand Down Expand Up @@ -154,12 +183,6 @@ class Options {
*/
virtual const std::string& logPath() const PURE;

/**
* @return the number of seconds that envoy will wait before shutting down the parent envoy during
* a host restart. Generally this will be longer than the drainTime() option.
*/
virtual std::chrono::seconds parentShutdownTime() const PURE;

/**
* @return the restart epoch. 0 indicates the first server start, 1 the second, and so on.
*/
Expand Down
8 changes: 7 additions & 1 deletion source/server/drain_manager_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,18 @@ bool DrainManagerImpl::drainClose() const {
return false;
}

if (server_.options().drainStrategy() == Server::DrainStrategy::Immediate) {
return true;
}
ASSERT(server_.options().drainStrategy() == Server::DrainStrategy::Gradual);

// P(return true) = elapsed time / drain timeout
// If the drain deadline is exceeded, skip the probability calculation.
const MonotonicTime current_time = server_.dispatcher().timeSource().monotonicTime();
if (current_time >= drain_deadline_) {
return true;
}

// P(return true) = elapsed time / drain timeout
const auto remaining_time =
std::chrono::duration_cast<std::chrono::seconds>(drain_deadline_ - current_time);
ASSERT(server_.options().drainTime() >= remaining_time);
Expand Down
28 changes: 23 additions & 5 deletions source/server/options_impl.cc
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,10 @@ OptionsImpl::OptionsImpl(std::vector<std::string> args,
TCLAP::ValueArg<uint32_t> drain_time_s("", "drain-time-s",
"Hot restart and LDS removal drain time in seconds", false,
600, "uint32_t", cmd);
TCLAP::ValueArg<std::string> drain_strategy(
"", "drain-strategy",
"Hot restart drain sequence behaviour, one of 'gradual' (default) or 'immediate'.", false,
"gradual", "string", cmd);
TCLAP::ValueArg<uint32_t> parent_shutdown_time_s("", "parent-shutdown-time-s",
"Hot restart parent shutdown time in seconds",
false, 900, "uint32_t", cmd);
Expand Down Expand Up @@ -262,6 +266,15 @@ OptionsImpl::OptionsImpl(std::vector<std::string> args,
drain_time_ = std::chrono::seconds(drain_time_s.getValue());
parent_shutdown_time_ = std::chrono::seconds(parent_shutdown_time_s.getValue());

if (drain_strategy.getValue() == "immediate") {
drain_strategy_ = Server::DrainStrategy::Immediate;
} else if (drain_strategy.getValue() == "gradual") {
drain_strategy_ = Server::DrainStrategy::Gradual;
} else {
throw MalformedArgvException(
fmt::format("error: unknown drain-strategy '{}'", mode.getValue()));
}

if (hot_restart_version_option.getValue()) {
std::cerr << hot_restart_version_cb(!hot_restart_disabled_);
throw NoServingException();
Expand Down Expand Up @@ -365,10 +378,15 @@ Server::CommandLineOptionsPtr OptionsImpl::toCommandLineOptions() const {
}
command_line_options->mutable_file_flush_interval()->MergeFrom(
Protobuf::util::TimeUtil::MillisecondsToDuration(fileFlushIntervalMsec().count()));
command_line_options->mutable_parent_shutdown_time()->MergeFrom(
Protobuf::util::TimeUtil::SecondsToDuration(parentShutdownTime().count()));

command_line_options->mutable_drain_time()->MergeFrom(
Protobuf::util::TimeUtil::SecondsToDuration(drainTime().count()));
command_line_options->set_drain_strategy(drainStrategy() == Server::DrainStrategy::Immediate
? envoy::admin::v3::CommandLineOptions::Immediate
: envoy::admin::v3::CommandLineOptions::Gradual);
command_line_options->mutable_parent_shutdown_time()->MergeFrom(
Protobuf::util::TimeUtil::SecondsToDuration(parentShutdownTime().count()));

command_line_options->set_disable_hot_restart(hotRestartDisabled());
command_line_options->set_enable_mutex_tracing(mutexTracingEnabled());
command_line_options->set_cpuset_threads(cpusetThreadsEnabled());
Expand All @@ -387,9 +405,9 @@ OptionsImpl::OptionsImpl(const std::string& service_cluster, const std::string&
log_format_(Logger::Logger::DEFAULT_LOG_FORMAT), log_format_escaped_(false),
restart_epoch_(0u), service_cluster_(service_cluster), service_node_(service_node),
service_zone_(service_zone), file_flush_interval_msec_(10000), drain_time_(600),
parent_shutdown_time_(900), mode_(Server::Mode::Serve), hot_restart_disabled_(false),
signal_handling_enabled_(true), mutex_tracing_enabled_(false), cpuset_threads_(false),
fake_symbol_table_enabled_(false) {}
parent_shutdown_time_(900), drain_strategy_(Server::DrainStrategy::Gradual),
mode_(Server::Mode::Serve), hot_restart_disabled_(false), signal_handling_enabled_(true),
mutex_tracing_enabled_(false), cpuset_threads_(false), fake_symbol_table_enabled_(false) {}

void OptionsImpl::disableExtensions(const std::vector<std::string>& names) {
for (const auto& name : names) {
Expand Down
12 changes: 8 additions & 4 deletions source/server/options_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,13 @@ class OptionsImpl : public Server::Options, protected Logger::Loggable<Logger::I
local_address_ip_version_ = local_address_ip_version;
}
void setDrainTime(std::chrono::seconds drain_time) { drain_time_ = drain_time; }
void setLogLevel(spdlog::level::level_enum log_level) { log_level_ = log_level; }
void setLogFormat(const std::string& log_format) { log_format_ = log_format; }
void setLogPath(const std::string& log_path) { log_path_ = log_path; }
void setParentShutdownTime(std::chrono::seconds parent_shutdown_time) {
parent_shutdown_time_ = parent_shutdown_time;
}
void setDrainStrategy(Server::DrainStrategy drain_strategy) { drain_strategy_ = drain_strategy; }
void setLogLevel(spdlog::level::level_enum log_level) { log_level_ = log_level; }
void setLogFormat(const std::string& log_format) { log_format_ = log_format; }
void setLogPath(const std::string& log_path) { log_path_ = log_path; }
void setRestartEpoch(uint64_t restart_epoch) { restart_epoch_ = restart_epoch; }
void setMode(Server::Mode mode) { mode_ = mode; }
void setFileFlushIntervalMsec(std::chrono::milliseconds file_flush_interval_msec) {
Expand Down Expand Up @@ -121,6 +122,9 @@ class OptionsImpl : public Server::Options, protected Logger::Loggable<Logger::I
return local_address_ip_version_;
}
std::chrono::seconds drainTime() const override { return drain_time_; }
std::chrono::seconds parentShutdownTime() const override { return parent_shutdown_time_; }
Server::DrainStrategy drainStrategy() const override { return drain_strategy_; }

spdlog::level::level_enum logLevel() const override { return log_level_; }
const std::vector<std::pair<std::string, spdlog::level::level_enum>>&
componentLogLevels() const override {
Expand All @@ -129,7 +133,6 @@ class OptionsImpl : public Server::Options, protected Logger::Loggable<Logger::I
const std::string& logFormat() const override { return log_format_; }
bool logFormatEscaped() const override { return log_format_escaped_; }
const std::string& logPath() const override { return log_path_; }
std::chrono::seconds parentShutdownTime() const override { return parent_shutdown_time_; }
uint64_t restartEpoch() const override { return restart_epoch_; }
Server::Mode mode() const override { return mode_; }
std::chrono::milliseconds fileFlushIntervalMsec() const override {
Expand Down Expand Up @@ -188,6 +191,7 @@ class OptionsImpl : public Server::Options, protected Logger::Loggable<Logger::I
std::chrono::milliseconds file_flush_interval_msec_;
std::chrono::seconds drain_time_;
std::chrono::seconds parent_shutdown_time_;
Server::DrainStrategy drain_strategy_;
Server::Mode mode_;
bool hot_restart_disabled_;
bool signal_handling_enabled_;
Expand Down
2 changes: 1 addition & 1 deletion test/integration/integration.cc
Original file line number Diff line number Diff line change
Expand Up @@ -457,7 +457,7 @@ void BaseIntegrationTest::createGeneratedApiTestServer(
test_server_ = IntegrationTestServer::create(
bootstrap_path, version_, on_server_ready_function_, on_server_init_function_, deterministic_,
timeSystem(), *api_, defer_listener_finalization_, process_object_, validator_config,
concurrency_, drain_time_, use_real_stats_);
concurrency_, drain_time_, drain_strategy_, use_real_stats_);
if (config_helper_.bootstrap().static_resources().listeners_size() > 0 &&
!defer_listener_finalization_) {

Expand Down
4 changes: 4 additions & 0 deletions test/integration/integration.h
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,10 @@ class BaseIntegrationTest : protected Logger::Loggable<Logger::Id::testing> {
// The duration of the drain manager graceful drain period.
std::chrono::seconds drain_time_{1};

// The DrainStrategy that dictates the behaviour of
// DrainManagerImpl::drainClose().
Server::DrainStrategy drain_strategy_{Server::DrainStrategy::Gradual};

// Member variables for xDS testing.
FakeUpstream* xds_upstream_{};
FakeHttpConnectionPtr xds_connection_;
Expand Down
Loading

0 comments on commit 67ff3bd

Please sign in to comment.