Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: GoogleRE2 max_program_size should be checked by server, not client. #10971

Merged
merged 12 commits into from
May 3, 2020

Conversation

markdroth
Copy link
Contributor

Commit Message: api: GoogleRE2 max_program_size should be checked by server, not client.
Additional Description: Deprecates GoogleRE2.max_program_size field so that client does not need to check this.
Risk Level: Low
Testing: N/A
Docs Changes: Included in PR
Release Notes: Included in PR

This PR is sent as a proposal to see what people think. It looks like this field was originally added by @mattklein123 at the request of @jmarantz (see #7878 (comment)). But now that we're trying to add support for regex matchers to gRPC, it occurs to us that this mechanism is very strange, because the management server is telling the client both the regex and how to validate it, rather than just doing the validation itself. (This is the moral equivalent of saying "int_value=5; please reject this config if int_value>10".) It seems a little sub-optimal to push this down to the client; ideally, it should be done when the user creates the config to begin with.

One related problem here, which I've discussed with @htuch, is the fact that the RE2 library supports checking the max program size, but it's available in C++ only. The fact that many (most?) control planes are written in languages other than C++ may have contributed to the way this was done, the logic being that Envoy could always check this, whereas the control plane might not be able to easily do so. However, with the introduction of xDS clients in languages other than C++ (specifically, grpc-java and grpc-go), it is no longer the case that we can assume that this can be done on the client.

@dfawley tells me that there is actually a way to do this same check with the go regexp library, and @ejona86 is going to look into adding support for this in the java re2j library. But even with those building blocks in place, we think it makes more sense for this check to be done on the management server than on the client, because if the check fails, the config should be rejected at that point instead of making it all the way out to the clients.

Signed-off-by: Mark D. Roth [email protected]

@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10971 was opened by markdroth.

see: more, trace.

@mattklein123
Copy link
Member

I would be fine with this change as I think this config has caused more confusion and problems than it has solved, but I will defer to @jmarantz on this.

Copy link
Contributor

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK here's my take:

I understand the confusion here but I want to push back on the analogy of

"int_value=5; please reject this config if int_value>10".

I think it's not the same because program_size is not immediately obvious from looking at regexes, and in fact it can and does change as RE2 improves. So I'm biased to having the server protect itself from a config, which might be directly specified, and not absolutely always via a control-plane.

I didn't understand the relevance of the language the control-plane is written in: in general I think control-planes should be robust to having the Envoy reject the config, and passing clear actionable errors back to the control-plane's user.

But I think I'll also defer to @htuch on this, and his vision for the xDS APIs.

//
// This field is deprecated; regexp validation should be performed on the management server
// instead of being done by each individual client.
google.protobuf.UInt32Value max_program_size = 1 [deprecated = true];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of leaving this in, but providing a reasonable default max_program_size that is bounded, but can be overridden?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what way is that different from the current field? If max_program_size is "not specified, the default is 100."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm getting at what @markdroth is trying to express: this is an API that is intended to be implementable by proxies other than Envoy.

I think maybe he objects to predefining what the default behavior is, and rely solely on the management plane to enforce limits, which concerns me because I am thinking from the perspective of managing a shared-tenancy proxy, and I want the default scenario to consume bounded CPU.

I think maybe we could change the comment from saying "if not specified, the default is 100." to "if not specified, the default limit is implementation-dependent".

Mark, would that be good enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think changing the default here would address my concern. The real issue is that we need to be validating this on the management server, not on the client.

@@ -40,3 +40,6 @@ Deprecated
* Tracing provider configuration as part of :ref:`bootstrap config <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.tracing>`
has been deprecated in favor of configuration as part of :ref:`HTTP connection manager
<envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.Tracing.provider>`.
* The * :ref:`GoogleRE2.max_program_size<envoy_api_field_RegexMatcher.GoogleRE2.max_program_size>`
field is now deprecated. Management servers are expected to validate regexp program sizes
instead of asking the client to do it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here 'client' means the proxy, that is, it's a client of the control-plane?

What if the user is deploying Envoy without a control-plane? Maybe that never happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, "client" means the xDS client (i.e., Envoy or gRPC).

If they're deploying Envoy with a static config, they can statically validate the regexps.

@@ -40,3 +40,6 @@ Deprecated
* Tracing provider configuration as part of :ref:`bootstrap config <envoy_v3_api_field_config.bootstrap.v3.Bootstrap.tracing>`
has been deprecated in favor of configuration as part of :ref:`HTTP connection manager
<envoy_v3_api_field_extensions.filters.network.http_connection_manager.v3.HttpConnectionManager.Tracing.provider>`.
* The * :ref:`GoogleRE2.max_program_size<envoy_api_field_RegexMatcher.GoogleRE2.max_program_size>`
field is now deprecated. Management servers are expected to validate regexp program sizes
instead of asking the client to do it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/asking/expecting/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ejona86
Copy link
Contributor

ejona86 commented Apr 27, 2020

For posterity, in Go you can use regexp/syntax to learn the program size:

regex, err := syntax.Parse("INPUT HERE", 0)
prog, err := syntax.Compile(regex)
len(prog.Inst) // this is the program size

@markdroth
Copy link
Contributor Author

I think it's not the same because program_size is not immediately obvious from looking at regexes, and in fact it can and does change as RE2 improves.

I didn't realize that this would change over time as RE2 improves. That does provide a reason why this would need to be checked on the client rather than on the management server, because the management server does not know what version of RE2 the client is linked with.

That having been said, this still seems problematic, because in practice it's important to catch configuration errors in the management server, before they make their way to the client. If we're saying that there's no way this problem can be detected earlier, then we will wind up with clients rejecting their configs, which can cause an outage. I think we need to think about how this can actually be useful in practice. If a management server does not know what version of RE2 is linked into their clients, how can it catch bad configs before they get to the clients?

Note that while this issue was prompted by the need to support xDS clients other than Envoy, it really doesn't have anything directly to do with that. I think this problem would exist even if all the clients were Envoys, because each Envoy instance can be built with a different version of RE2.

@jmarantz
Copy link
Contributor

Is it practical to have the control-plane guarantee never to send an invalid config to the proxy?

It seems there are in general semantics that could only be verified on the server, and there has to be a mechanism for the server to say "I reject this config, and the system will remain in the exact state it was before you sent it". @htuch is that right?

@markdroth
Copy link
Contributor Author

Note: In the context of xDS, Envoy is the client, not the server. Calling it the server is extremely confusing. :)

I do certainly agree that there will sometimes be cases where only the client can determine if the config is invalid; after all, that's why xDS allows the client to NACK an update. However, I think we should be seeking to minimize these, because in general, the infrastructure will be easier to manage and safer when problems are caught on the control plane instead. What concerns me here is that this seems like something that cannot possibly be caught on the management server, and that makes me very nervous.

Is there some way that we catch this problem on the management server, before it gets to the client?

@mattklein123
Copy link
Member

As an aside, one thing I have considered doing with this field is using it to emit a warning/stat in the client vs. a failure. Would this be a better direction?

@jmarantz
Copy link
Contributor

"server" vs "client"

sorry about that...definitely confusing for me to think fo Envoy as a client.

should be seeking to minimize these, because in general, the infrastructure will be easier to manage and safer when problems are caught on the control plane instead.

I have the opposite instinct. I think that problems with validation on the control-plane client are inevitable and we must do engineering to ensure the config-rejection flow is tested, stable, and user-actionable. That is most likely to be guaranteed if we have cases where it matters and we test for them.

If it makes you nervous, then I think we should make it happen enough so we become confident, because there will be some semantic corner-case some day, that will be accepted by the control-plane and rejected by Envoy (or whatever). So we need to get it right.

is there some way we could catch this problem on the management server?

Not in general, no, and as discussed, not for RE2 if the management server is compiled with a different version of RE2.

As an aside, one thing I have considered doing with this field is using it to emit a warning/stat in the client vs. a failure. Would this be a better direction?

I could live with that, if it makes life simpler.

@jmarantz
Copy link
Contributor

/wait

@htuch
Copy link
Member

htuch commented Apr 28, 2020

Reading the above, I think it's best to do this on the control plane. Even if we have a range of clients, the control plane can be responsible for picking a conservative bound for RE2 size and validating against that. Simple control planes for trusted environments can skip validation. Those that want to enforce bounds for proxies with untrusted downstreams can implement the logic. I recommend deprecation.

@markdroth
Copy link
Contributor Author

@jmarantz I do agree that we need better infrastructure in xDS for handling cases where clients reject their configs. But I think it will take us a long time to get there, because there are so many cases that the protocol itself currently doesn't support. For example, today, a NACK is just for one individual resource, and there is no way to indicate a problem with an entire tree of resources. If we get a CDS update pointing to a different EDS service name and that EDS service name does not exist, we will have already ACKed the CDS resource by the time we discover that the EDS resource does not exist, and the protocol offers no way to indicate that problem to the server. This is one of many things that I would like to see us address as part of UDPA.

That having been said, even if we had the ideal infrastructure for handling validation failures on the client, I continue to believe that our goal should be to avoid that happening in the first place, because by the time the validation failure happens on the client, it's too late to stop it from impacting user traffic. At best, a client that was already running when it got the bad config can ignore the update and keep running with its old config, but not all deployments have the right feedback infrastructure to detect when that's happening. And any client that starts up when the bad config is being served will not have any previous config to keep running with, so it will not be able to start up at all. This situation is rife for outages.

I like Harvey's suggestion of having the control plane validate this by assuming an older version of RE2.

Signed-off-by: Mark D. Roth <[email protected]>
Signed-off-by: Mark D. Roth <[email protected]>
Signed-off-by: Mark D. Roth <[email protected]>
Signed-off-by: Mark D. Roth <[email protected]>
Signed-off-by: Mark D. Roth <[email protected]>
@markdroth
Copy link
Contributor Author

@htuch When I ran fix_format, it generated a whole bunch of new v4alpha files, which I assume is expected. However, the docs test is now complaining about some of those files. Any idea what's wrong here or what I need to do to fix it? Thanks!

@jmarantz
Copy link
Contributor

Catching up on this: given the above I agree it makes sense to deprecate the API field and rely on the control-plane to self-validate. Can't boil the ocean :)

@jmarantz
Copy link
Contributor

I'll go ahead & switch the reviewer to Harvey since this is really just an API change.

@jmarantz jmarantz assigned htuch and unassigned jmarantz Apr 29, 2020
htuch added a commit to htuch/envoy that referenced this pull request Apr 30, 2020
Some weak regexes were causing v4alpha protos to end up in the v2 docs
in envoyproxy#10971, this PR fixes.

Signed-off-by: Harvey Tuch <[email protected]>
lizan pushed a commit that referenced this pull request Apr 30, 2020
…y. (#11008)

Some weak regexes were causing v4alpha protos to end up in the v2 docs
in #10971, this PR fixes.

Signed-off-by: Harvey Tuch <[email protected]>
@htuch
Copy link
Member

htuch commented Apr 30, 2020

@markdroth the docs issue is fixed by #1108, the CI API issue is as discussed.

@markdroth
Copy link
Contributor Author

@htuch Thanks for the quick fix! I've merged from master, re-ran fix_format, and applied the test patch. Let's see if everything's happy now.

@moderation
Copy link
Contributor

This is already a big change but if you wanted to test / bump the re2 dependency you can use:

    com_googlesource_code_re2 = dict(
        sha256 = "98794bc5416326817498384a9c43cbb5a406bab8da9f84f83c39ecad43ed5cea",
        strip_prefix = "re2-2020-04-01",
        urls = ["https://github.com/google/re2/archive/2020-04-01.tar.gz"],
    ),

@markdroth
Copy link
Contributor Author

@htuch I'm not sure why //test/extensions/common/redis:cluster_refresh_manager_test is failing in the CI. It passes when I run it locally. Is it possible there's some flakiness here?

@moderation
Copy link
Contributor

New re2 release 2020-05-01

    com_googlesource_code_re2 = dict(
        sha256 = "88864d7f5126bb17daa1aa8f41b05599aa6e3222e7b28a90e372db53c1c49aeb",
        strip_prefix = "re2-2020-05-01",
        urls = ["https://github.com/google/re2/archive/2020-05-01.tar.gz"],
        use_category = ["dataplane"],
    ),

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants