api: GoogleRE2 max_program_size should be checked by server, not client. #10971

markdroth · 2020-04-27T20:55:40Z

Commit Message: api: GoogleRE2 max_program_size should be checked by server, not client.
Additional Description: Deprecates GoogleRE2.max_program_size field so that client does not need to check this.
Risk Level: Low
Testing: N/A
Docs Changes: Included in PR
Release Notes: Included in PR

This PR is sent as a proposal to see what people think. It looks like this field was originally added by @mattklein123 at the request of @jmarantz (see #7878 (comment)). But now that we're trying to add support for regex matchers to gRPC, it occurs to us that this mechanism is very strange, because the management server is telling the client both the regex and how to validate it, rather than just doing the validation itself. (This is the moral equivalent of saying "int_value=5; please reject this config if int_value>10".) It seems a little sub-optimal to push this down to the client; ideally, it should be done when the user creates the config to begin with.

One related problem here, which I've discussed with @htuch, is the fact that the RE2 library supports checking the max program size, but it's available in C++ only. The fact that many (most?) control planes are written in languages other than C++ may have contributed to the way this was done, the logic being that Envoy could always check this, whereas the control plane might not be able to easily do so. However, with the introduction of xDS clients in languages other than C++ (specifically, grpc-java and grpc-go), it is no longer the case that we can assume that this can be done on the client.

@dfawley tells me that there is actually a way to do this same check with the go regexp library, and @ejona86 is going to look into adding support for this in the java re2j library. But even with those building blocks in place, we think it makes more sense for this check to be done on the management server than on the client, because if the check fails, the config should be rejected at that point instead of making it all the way out to the clients.

Signed-off-by: Mark D. Roth [email protected]

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only · 2020-04-27T20:55:45Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10971 was opened by markdroth.

see: more, trace.

mattklein123 · 2020-04-27T21:12:19Z

I would be fine with this change as I think this config has caused more confusion and problems than it has solved, but I will defer to @jmarantz on this.

jmarantz

OK here's my take:

I understand the confusion here but I want to push back on the analogy of

"int_value=5; please reject this config if int_value>10".

I think it's not the same because program_size is not immediately obvious from looking at regexes, and in fact it can and does change as RE2 improves. So I'm biased to having the server protect itself from a config, which might be directly specified, and not absolutely always via a control-plane.

I didn't understand the relevance of the language the control-plane is written in: in general I think control-planes should be robust to having the Envoy reject the config, and passing clear actionable errors back to the control-plane's user.

But I think I'll also defer to @htuch on this, and his vision for the xDS APIs.

jmarantz · 2020-04-27T21:17:00Z

api/envoy/type/matcher/regex.proto

+    //
+    // This field is deprecated; regexp validation should be performed on the management server
+    // instead of being done by each individual client.
+    google.protobuf.UInt32Value max_program_size = 1 [deprecated = true];


WDYT of leaving this in, but providing a reasonable default max_program_size that is bounded, but can be overridden?

In what way is that different from the current field? If max_program_size is "not specified, the default is 100."

I think I'm getting at what @markdroth is trying to express: this is an API that is intended to be implementable by proxies other than Envoy.

I think maybe he objects to predefining what the default behavior is, and rely solely on the management plane to enforce limits, which concerns me because I am thinking from the perspective of managing a shared-tenancy proxy, and I want the default scenario to consume bounded CPU.

I think maybe we could change the comment from saying "if not specified, the default is 100." to "if not specified, the default limit is implementation-dependent".

Mark, would that be good enough?

I don't think changing the default here would address my concern. The real issue is that we need to be validating this on the management server, not on the client.

jmarantz · 2020-04-27T21:17:51Z

docs/root/version_history/current.rst

@@ -40,3 +40,6 @@ Deprecated
 * Tracing provider configuration as part of :ref:`bootstrap config <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.tracing>`
  has been deprecated in favor of configuration as part of :ref:`HTTP connection manager
  <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.Tracing.provider>`.
+* The * :ref:`GoogleRE2.max_program_size<envoy_api_field_RegexMatcher.GoogleRE2.max_program_size>`
+  field is now deprecated. Management servers are expected to validate regexp program sizes
+  instead of asking the client to do it.


here 'client' means the proxy, that is, it's a client of the control-plane?

What if the user is deploying Envoy without a control-plane? Maybe that never happens?

Here, "client" means the xDS client (i.e., Envoy or gRPC).

If they're deploying Envoy with a static config, they can statically validate the regexps.

ejona86 · 2020-04-27T21:23:07Z

docs/root/version_history/current.rst

@@ -40,3 +40,6 @@ Deprecated
 * Tracing provider configuration as part of :ref:`bootstrap config <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.tracing>`
  has been deprecated in favor of configuration as part of :ref:`HTTP connection manager
  <envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.Tracing.provider>`.
+* The * :ref:`GoogleRE2.max_program_size<envoy_api_field_RegexMatcher.GoogleRE2.max_program_size>`
+  field is now deprecated. Management servers are expected to validate regexp program sizes
+  instead of asking the client to do it.


s/asking/expecting/

ejona86 · 2020-04-27T21:25:45Z

For posterity, in Go you can use regexp/syntax to learn the program size:

regex, err := syntax.Parse("INPUT HERE", 0)
prog, err := syntax.Compile(regex)
len(prog.Inst) // this is the program size

markdroth · 2020-04-27T22:18:03Z

I think it's not the same because program_size is not immediately obvious from looking at regexes, and in fact it can and does change as RE2 improves.

I didn't realize that this would change over time as RE2 improves. That does provide a reason why this would need to be checked on the client rather than on the management server, because the management server does not know what version of RE2 the client is linked with.

That having been said, this still seems problematic, because in practice it's important to catch configuration errors in the management server, before they make their way to the client. If we're saying that there's no way this problem can be detected earlier, then we will wind up with clients rejecting their configs, which can cause an outage. I think we need to think about how this can actually be useful in practice. If a management server does not know what version of RE2 is linked into their clients, how can it catch bad configs before they get to the clients?

Note that while this issue was prompted by the need to support xDS clients other than Envoy, it really doesn't have anything directly to do with that. I think this problem would exist even if all the clients were Envoys, because each Envoy instance can be built with a different version of RE2.

jmarantz · 2020-04-27T22:22:25Z

Is it practical to have the control-plane guarantee never to send an invalid config to the proxy?

It seems there are in general semantics that could only be verified on the server, and there has to be a mechanism for the server to say "I reject this config, and the system will remain in the exact state it was before you sent it". @htuch is that right?

markdroth · 2020-04-27T22:28:11Z

Note: In the context of xDS, Envoy is the client, not the server. Calling it the server is extremely confusing. :)

I do certainly agree that there will sometimes be cases where only the client can determine if the config is invalid; after all, that's why xDS allows the client to NACK an update. However, I think we should be seeking to minimize these, because in general, the infrastructure will be easier to manage and safer when problems are caught on the control plane instead. What concerns me here is that this seems like something that cannot possibly be caught on the management server, and that makes me very nervous.

Is there some way that we catch this problem on the management server, before it gets to the client?

mattklein123 · 2020-04-27T22:29:00Z

As an aside, one thing I have considered doing with this field is using it to emit a warning/stat in the client vs. a failure. Would this be a better direction?

jmarantz · 2020-04-28T04:23:13Z

"server" vs "client"

sorry about that...definitely confusing for me to think fo Envoy as a client.

should be seeking to minimize these, because in general, the infrastructure will be easier to manage and safer when problems are caught on the control plane instead.

I have the opposite instinct. I think that problems with validation on the control-plane client are inevitable and we must do engineering to ensure the config-rejection flow is tested, stable, and user-actionable. That is most likely to be guaranteed if we have cases where it matters and we test for them.

If it makes you nervous, then I think we should make it happen enough so we become confident, because there will be some semantic corner-case some day, that will be accepted by the control-plane and rejected by Envoy (or whatever). So we need to get it right.

is there some way we could catch this problem on the management server?

Not in general, no, and as discussed, not for RE2 if the management server is compiled with a different version of RE2.

As an aside, one thing I have considered doing with this field is using it to emit a warning/stat in the client vs. a failure. Would this be a better direction?

I could live with that, if it makes life simpler.

jmarantz · 2020-04-28T12:39:30Z

/wait

htuch · 2020-04-28T15:54:22Z

Reading the above, I think it's best to do this on the control plane. Even if we have a range of clients, the control plane can be responsible for picking a conservative bound for RE2 size and validating against that. Simple control planes for trusted environments can skip validation. Those that want to enforce bounds for proxies with untrusted downstreams can implement the logic. I recommend deprecation.

markdroth · 2020-04-28T16:29:30Z

@jmarantz I do agree that we need better infrastructure in xDS for handling cases where clients reject their configs. But I think it will take us a long time to get there, because there are so many cases that the protocol itself currently doesn't support. For example, today, a NACK is just for one individual resource, and there is no way to indicate a problem with an entire tree of resources. If we get a CDS update pointing to a different EDS service name and that EDS service name does not exist, we will have already ACKed the CDS resource by the time we discover that the EDS resource does not exist, and the protocol offers no way to indicate that problem to the server. This is one of many things that I would like to see us address as part of UDPA.

That having been said, even if we had the ideal infrastructure for handling validation failures on the client, I continue to believe that our goal should be to avoid that happening in the first place, because by the time the validation failure happens on the client, it's too late to stop it from impacting user traffic. At best, a client that was already running when it got the bad config can ignore the update and keep running with its old config, but not all deployments have the right feedback infrastructure to detect when that's happening. And any client that starts up when the bad config is being served will not have any previous config to keep running with, so it will not be able to start up at all. This situation is rife for outages.

I like Harvey's suggestion of having the control plane validate this by assuming an older version of RE2.

Signed-off-by: Mark D. Roth <[email protected]>

…_size Signed-off-by: Mark D. Roth <[email protected]>

markdroth · 2020-04-28T22:45:01Z

@htuch When I ran fix_format, it generated a whole bunch of new v4alpha files, which I assume is expected. However, the docs test is now complaining about some of those files. Any idea what's wrong here or what I need to do to fix it? Thanks!

jmarantz · 2020-04-29T14:02:32Z

Catching up on this: given the above I agree it makes sense to deprecate the API field and rely on the control-plane to self-validate. Can't boil the ocean :)

jmarantz · 2020-04-29T14:04:09Z

I'll go ahead & switch the reviewer to Harvey since this is really just an API change.

Some weak regexes were causing v4alpha protos to end up in the v2 docs in envoyproxy#10971, this PR fixes. Signed-off-by: Harvey Tuch <[email protected]>

…y. (#11008) Some weak regexes were causing v4alpha protos to end up in the v2 docs in #10971, this PR fixes. Signed-off-by: Harvey Tuch <[email protected]>

htuch · 2020-04-30T12:24:31Z

@markdroth the docs issue is fixed by #1108, the CI API issue is as discussed.

…_size Signed-off-by: Mark D. Roth <[email protected]>

Signed-off-by: Mark D. Roth <[email protected]>

markdroth · 2020-04-30T16:36:29Z

@htuch Thanks for the quick fix! I've merged from master, re-ran fix_format, and applied the test patch. Let's see if everything's happy now.

moderation · 2020-04-30T20:34:42Z

This is already a big change but if you wanted to test / bump the re2 dependency you can use:

    com_googlesource_code_re2 = dict(
        sha256 = "98794bc5416326817498384a9c43cbb5a406bab8da9f84f83c39ecad43ed5cea",
        strip_prefix = "re2-2020-04-01",
        urls = ["https://github.com/google/re2/archive/2020-04-01.tar.gz"],
    ),

markdroth · 2020-04-30T20:41:27Z

@htuch I'm not sure why //test/extensions/common/redis:cluster_refresh_manager_test is failing in the CI. It passes when I run it locally. Is it possible there's some flakiness here?

…_size Signed-off-by: Mark D. Roth <[email protected]>

moderation · 2020-05-02T04:52:20Z

New re2 release 2020-05-01

    com_googlesource_code_re2 = dict(
        sha256 = "88864d7f5126bb17daa1aa8f41b05599aa6e3222e7b28a90e372db53c1c49aeb",
        strip_prefix = "re2-2020-05-01",
        urls = ["https://github.com/google/re2/archive/2020-05-01.tar.gz"],
        use_category = ["dataplane"],
    ),

htuch

LGTM, thanks!

markdroth added 2 commits April 27, 2020 13:35

api: GoogleRE2 max_program_size should be checked by server, not client.

8803614

Signed-off-by: Mark D. Roth <[email protected]>

add deprecation to release notes

5005314

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only bot added the api label Apr 27, 2020

jmarantz reviewed Apr 27, 2020

View reviewed changes

ejona86 reviewed Apr 27, 2020

View reviewed changes

mattklein123 assigned jmarantz Apr 27, 2020

repokitteh-read-only bot added the waiting label Apr 28, 2020

fix wording

0be0ab1

Signed-off-by: Mark D. Roth <[email protected]>

repokitteh-read-only bot removed the waiting label Apr 28, 2020

markdroth added 5 commits April 28, 2020 12:43

fix docs

2e8e4c0

Signed-off-by: Mark D. Roth <[email protected]>

attempt to fix doc link

7479875

Signed-off-by: Mark D. Roth <[email protected]>

fix_format

38c74fc

Signed-off-by: Mark D. Roth <[email protected]>

try to make CI happy

b9162d8

Signed-off-by: Mark D. Roth <[email protected]>

Merge remote-tracking branch 'upstream/master' into regex_max_program…

2d3dd0e

…_size Signed-off-by: Mark D. Roth <[email protected]>

jmarantz assigned htuch and unassigned jmarantz Apr 29, 2020

htuch mentioned this pull request Apr 30, 2020

tools: strengthen v2 API package regexes in proto_build_targets_gen.py. #11008

Merged

markdroth added 3 commits April 30, 2020 09:26

Merge remote-tracking branch 'upstream/master' into regex_max_program…

3fdd681

…_size Signed-off-by: Mark D. Roth <[email protected]>

make api CI happy

cf6cb5f

Signed-off-by: Mark D. Roth <[email protected]>

fix_format

99bc27f

Signed-off-by: Mark D. Roth <[email protected]>

Merge remote-tracking branch 'upstream/master' into regex_max_program…

512d9ea

…_size Signed-off-by: Mark D. Roth <[email protected]>

htuch approved these changes May 3, 2020

View reviewed changes

repokitteh-read-only bot removed the api label May 3, 2020

htuch merged commit 1012516 into envoyproxy:master May 3, 2020

markdroth deleted the regex_max_program_size branch May 4, 2020 14:57

ejona86 mentioned this pull request May 4, 2020

Heads up: max_program_size deprecation will require re2j changes to allow validation envoyproxy/java-control-plane#138

Open

LisaLudique mentioned this pull request May 13, 2020

max_program_size deprecated but still enforced by client #11185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api: GoogleRE2 max_program_size should be checked by server, not client. #10971

api: GoogleRE2 max_program_size should be checked by server, not client. #10971

markdroth commented Apr 27, 2020

repokitteh-read-only bot commented Apr 27, 2020

mattklein123 commented Apr 27, 2020

jmarantz left a comment •

edited

Loading

jmarantz Apr 27, 2020

ejona86 Apr 27, 2020

jmarantz Apr 27, 2020

markdroth Apr 28, 2020

jmarantz Apr 27, 2020

markdroth Apr 28, 2020

ejona86 Apr 27, 2020

markdroth Apr 28, 2020

ejona86 commented Apr 27, 2020

markdroth commented Apr 27, 2020

jmarantz commented Apr 27, 2020

markdroth commented Apr 27, 2020

mattklein123 commented Apr 27, 2020

jmarantz commented Apr 28, 2020

jmarantz commented Apr 28, 2020

htuch commented Apr 28, 2020

markdroth commented Apr 28, 2020

markdroth commented Apr 28, 2020

jmarantz commented Apr 29, 2020

jmarantz commented Apr 29, 2020

htuch commented Apr 30, 2020

markdroth commented Apr 30, 2020

moderation commented Apr 30, 2020

markdroth commented Apr 30, 2020

moderation commented May 2, 2020

htuch left a comment

api: GoogleRE2 max_program_size should be checked by server, not client. #10971

api: GoogleRE2 max_program_size should be checked by server, not client. #10971

Conversation

markdroth commented Apr 27, 2020

repokitteh-read-only bot commented Apr 27, 2020

mattklein123 commented Apr 27, 2020

jmarantz left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejona86 commented Apr 27, 2020

markdroth commented Apr 27, 2020

jmarantz commented Apr 27, 2020

markdroth commented Apr 27, 2020

mattklein123 commented Apr 27, 2020

jmarantz commented Apr 28, 2020

jmarantz commented Apr 28, 2020

htuch commented Apr 28, 2020

markdroth commented Apr 28, 2020

markdroth commented Apr 28, 2020

jmarantz commented Apr 29, 2020

jmarantz commented Apr 29, 2020

htuch commented Apr 30, 2020

markdroth commented Apr 30, 2020

moderation commented Apr 30, 2020

markdroth commented Apr 30, 2020

moderation commented May 2, 2020

htuch left a comment

Choose a reason for hiding this comment

jmarantz left a comment •

edited

Loading