-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Agent Caching for Service Discovery Results #4541
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😅 Big PR, but I get why.
This looks awesome. I'm really glad to see the framework for generic agent caching (the ?cached
param) start making it in, too. This is looking really good.
I left some inline comments, but nothing major stood out. The positive thing I saw is that you didn't have to alter any of the previous agent/cache
tests, which I feel is really important towards this. That's a good sign, since a lot of those tests test subtle behavior and bugs we found in the past.
The one meta-comment I'd make is: maybe we should consider adding information about agent caching in the general docs (not API docs) as well, since I think its a big benefit. I was thinking perhaps on the 1.) architecture page and 2.) maybe even a page under "Agent", though I'm not sure since it'd probably just repeat a lot of the API. :)
agent/cache/cache.go
Outdated
@@ -84,6 +84,7 @@ type typeEntry struct { | |||
type ResultMeta struct { | |||
// Return whether or not the request was a cache hit | |||
Hit bool | |||
Age time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add docs to this. It may be as simple as since the value was last loaded into the cache, but clarity that its the age its in the cache vs. set in the server is an important detail. If this is always 0 for background refreshing ones, its important to note that too.
agent/cache/cache_test.go
Outdated
@@ -631,12 +632,15 @@ func TestCacheGet_expire(t *testing.T) { | |||
require.Equal(42, result) | |||
require.False(meta.Hit) | |||
|
|||
time.Sleep(5 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleeps at this granularity will be flaky in CI, unfortunately. I usually only go in increments of 50ms. Since all the tests are set with t.Parallel
it usually doesn't noticably increase the time for tests since... Amdahl's law.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed although I think it's OK here - notice that the only reason for the sleep is not to wait for any background work to complete but just to ensure that we can measure at least 5ms of age between the fetch that this thread previously invoked and the following one. So I don't think it will make things flaky it's just sanity checking that the age actually increases by a meaningful amount between two serial calls.
I will add a comment if that isn't clear though as generally for "wait for background" this is an unreliable fudge factor.
agent/cache/request.go
Outdated
// MaxAge if set limits how stale a cache entry can be. If it is non-zero and | ||
// there is an entry in cache that is older than specified, it is treated as a | ||
// cache miss and re-fetched. It only really makes sense for cache types that | ||
// have Refresh: false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"It only really makes sense" is unclear to me. Does it do anything otherwise? etc. Docs note.
q.QueryIDOrName, | ||
q.Limit, | ||
q.Connect, | ||
}, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably doesn't matter, but this will imply ordering on the hash value. The case this might matter in theory is if we ever persist the cache (we do not now) to support "warm starts" of the client.
If you want to nip this in the bud, you can tag a field and hashstructure will do the right thing. You may also want to consider just a full anonymous struct with fields set as well.
hashstructure.Hash(struct {
Fields []interface{} `hash:"set"`
}{
Fields: []interface{}{ ... },
})
Might not matter now though, probably worth thinking about both ways though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intersting.
Yeah I'd assumed that changing order here would invalidate current caches, could certainly call that out.
In the case of a full anonymous struct, it seems that would not solve the problem - if we changed the order of the fields in the source it would presumably change the order in which reflection iterates them right? So changing source order in either case would invalidate current caches? At least that seems to be the case given the following which may not be what you mean:
hashstructure.Hash(struct {
QueryIDOrName string
Limit int
Connect bool
}{
QueryIDOrName: q.QueryIDOrName,
Limit: q.Limit,
Connect: q.Connect,
})
And that someone coming later and adding a field not at the end like this woudl cause the same problem as with the plain list I have here right?
hashstructure.Hash(struct {
QueryIDOrName string
Limit int
OtherThing bool
Connect bool
}{
QueryIDOrName: q.QueryIDOrName,
Limit: q.Limit,
OtherThing: q.OtherThing,
Connect: q.Connect,
})
I think for now given that it doesn't matter a comment about ordering will suffice since if we later rely on order (e.g. for persistence) we could make the change then?
@mitchellh I added comments for all the noted places and a paragraph to the 10,000 foot view on the architecture page. I kinda think it's worth mentioning there as it is a bit of an architectural change for fault tolerance etc but I could easily be persuaded to move it somewhere else. I kinda agree that a whole other section under Agent would largely repeat the API docs but open to suggestions there. @pearkes do you have an opinion? The conflicts are due to another docs thing that touched an API endpoint that I just merged. I'll fix it later since I realised there is another change needed to clean up the merge there first. |
…roxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this looks pretty good although I have a few questions and comments.
-
Is there a reason not to use the same caching mechanisms for the service discovery aspects of DNS? It seems like we could use the same cache too.
-
If we have a cluster with 30k nodes and each one uses the /v1/catalog/service endpoint for getting information about 3 different services does that mean that there will be 90k blocking queries running on the leader to do the background refresh? If so, this seems less than ideal.
-
Many of the new cache types do the exact same thing cast the cache request to a DC specific request update the min index and issue the RPC. Could this be refactored a little so that we don't implement the same thing 4 times. I think the main issue is that for the RPC reply we need a real object to decode into. We would have to use other factory functions to create the reply vars before handing them to the RPC call. Otherwise the only thing to track is just the name of the RPC endpoint.
-
Uncached queries do not update the cache. So for example if you cached a prepared query execution and something else came along and did it uncached, my expectation would be that the next cached execution should just return the value that the uncached query did. As it is it will return potentially more stale data due to a disconnect between the caching mechanism and the uncached queries. This concern applies to all the caching. Would it be better to be able to request a direct fetch via the caching infrastructure and then it could do the fetch, cache the result and hand it back?
agent/cache-types/prepared_query.go
Outdated
func (c *PreparedQuery) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) { | ||
var result cache.FetchResult | ||
|
||
// The request should be a DCSpecificRequest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Comment needs updating
@@ -551,6 +616,11 @@ func (s *HTTPServer) parseConsistency(resp http.ResponseWriter, req *http.Reques | |||
fmt.Fprint(resp, "Cannot specify ?stale with ?consistent, conflicting semantics.") | |||
return true | |||
} | |||
if b.UseCache && b.RequireConsistent { | |||
resp.WriteHeader(http.StatusBadRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function instead return an error and be refactored to return the BadRequestError?
Maybe not as this change would need to be propagated through a couple other places. Just a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could but I followed the precedent of all other helpers in this file - I think if we want to clean that up we should do it all at once separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, thats probably best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few remarks:
- to the administrator, it would be cool to use the
DiscoveryMaxStale
mechanism for enabling caching by default - to the API user, having max-age as well as
maxStale
is confusing: as a non-expert Consul user, I would mix both: max_stale can be semantically seen as maxAge for a non-leader server (if you think as non-leader servers as a kind a caching mechanism). I wonder if it would be nice to merge 2 concepts, because those are actually very close semantically for a developper
Hmm great questions @mkeeler. Answers inline:
We could do that later. I mostly didn't consider it here because:
Yep. Note that it's exactly the same as now though without the cache. If anything it's possible we improve the situation because if two processes on a host are blocking on the same result and we switch them to ?cached, we now only make one shared blocking query from the cache rather than two separate identical ones. Also note that these are opt-in so just rolling this out will make zero change to current workloads. It is theoretically possible that we could get more sophisticated in the future and implement something like consul-templates deduplication here but I doubt we will because that is so complex and brittle and a better option is likely a more significant change to how blocking queries work to make them much more efficient.
We maybe could reduce duplication a bit here but it's mostly as you said type marshalling and the request and response types are unique in every case. I personally think it's easier to reason about with this small bit of boilerplate than trying to come up with an abstraction (which may end up being a bad one) just to save a few lines. In practice virtually the only line that is not type specific is the RPC call one so it's hard to see how that could be factored out without something gross like generating code from some generic template! Example, type specific lines annotated: var result cache.FetchResult
// The request should be a PreparedQueryExecuteRequest.
reqReal, ok := req.(*structs.PreparedQueryExecuteRequest) // request type specific
if !ok { // request type specific (error handling for the type assertion)
return result, fmt.Errorf(
"Internal cache failure: request wrong type: %T", req)
}
// Fetch
var reply structs.PreparedQueryExecuteResponse // response type specific
if err := c.RPC.RPC("PreparedQuery.Execute", reqReal, &reply); err != nil {
return result, err
}
result.Value = &reply
result.Index = reply.QueryMeta.Index // response type specific
return result, nil Note that these are pretty minimal examples too - in other cases you might need to manipulate query params on request or response too which are also all type specific.
That's a good point to clarify. I've tried to make this entire mechanism opt-in. Caching everything and just not using the results by default could have a huge memory impact on existing workloads - imagine how large the service discovery responses are in some of our busiest user's cases suddenly are all kept in memory on every agent permanently for no benefit on upgrade. That seems much worse than the issue here. Even worse, users that consume service discovery APIs will have not just result memory consumed but background blocking query load added to their servers updating a result they never use. Finally, the caching infrastructure as a read-through would break the current behaviour around consistency semantics - if you issue a request with It would be possible to only store the "read-through" responses if there is already a staler version in the cache. But in general this problem only applies to non-background refreshing cache entries since background refresh ones will be as up to date as the uncached response any way barring small timing issues. The only non-background refresh endpoint currently is prepared queries. This PR in fact already implements some of the moving parts needed to support that case - calls without |
DNS CachingFor DNS caching, I was thinking that clients wouldn’t do anything to opt-in but rather the agent would be configured to enable using the cache just like we have DNS configuration to enable stale data. Especially with the background blocking query mechanisms, it seems like the cache could be very helpful there as well. I understand not wanting to include that here as it would make the changes that much larger but it could be something we should consider for the future. Blocking Query PerformanceAs for the performance of the blocking queries I can agree we are no worse off as far as the overall number of queries (blocking and non-blocking) will be less. However there is a potential for more blocking queries in flight and with the current implementation that will turn into a 1 go routine per blocking query. You are right that it is always better if the HTTP clients are using blocking queries themselves but if they are just periodically getting the latest data without blocking then we are worse off as far as long running queries are concerned. Go routines are nice but no scheduler has negligible overhead when dealing with 100K+ tasks. This isn’t really a problem with your implementation but rather is a deficiency in our current implementation of blocking queries. We really ought to have a worker model for blocking queries and be able to better control our go routine usage. My real concern is that I have seen us continually add more things that are doing blocking queries behind the scenes and that those operations coupled with users usage of consul-template or other clients utilizing blocking queries are going to overload the leader servers like we have seen in a few select cases. The issues have cropped up a few times so far and are hard to work around without telling users to just not use the software like they are. My fear is that if we continue utilizing this more prior to actually fixing blocking query performance we are going to run into more and more issues. I don’t think that is something you need to fix in this PR but we may want to prioritize updating the blocking query mechanism for the near-future. Cache Type GeneralizationHere is good example of what I was talking about: func (c *CatalogServices) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
var result cache.FetchResult
// The request should be a DCSpecificRequest.
reqReal, ok := req.(*structs.ServiceSpecificRequest)
if !ok {
return result, fmt.Errorf(
"Internal cache failure: request wrong type: %T", req)
}
// Set the minimum query index to our current index so we block
reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
reqReal.QueryOptions.MaxQueryTime = opts.Timeout
// Fetch
var reply structs.IndexedServiceNodes
if err := c.RPC.RPC("Catalog.ServiceNodes", reqReal, &reply); err != nil {
return result, err
}
result.Value = &reply
result.Index = reply.QueryMeta.Index
return result, nil
} func (c *ConnectCARoot) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
var result cache.FetchResult
// The request should be a DCSpecificRequest.
reqReal, ok := req.(*structs.DCSpecificRequest)
if !ok {
return result, fmt.Errorf(
"Internal cache failure: request wrong type: %T", req)
}
// Set the minimum query index to our current index so we block
reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
reqReal.QueryOptions.MaxQueryTime = opts.Timeout
// Fetch
var reply structs.IndexedCARoots
if err := c.RPC.RPC("ConnectCA.Roots", reqReal, &reply); err != nil {
return result, err
}
result.Value = &reply
result.Index = reply.QueryMeta.Index
return result, nil
} The only difference between these is the function receiver, the reply type and the RPC endpoint. We could implement this in a generic way like: package cachetype
import (
"fmt"
"github.com/hashicorp/consul/agent/cache"
"github.com/hashicorp/consul/agent/structs"
)
type GenericBlockingQuery struct {
NewReply func () interface{}
Endpoint string
RPC RPC
}
func (c *GenericBlockingQuery) Fetch(opts cache.FetchOptions, req cache.Request) (cache.FetchResult, error) {
var result cache.FetchResult
// The request should be a DCSpecificRequest.
reqReal, ok := req.(*structs.DCSpecificRequest)
if !ok {
return result, fmt.Errorf(
"Internal cache failure: request wrong type: %T", req)
}
// Set the minimum query index to our current index so we block
reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
reqReal.QueryOptions.MaxQueryTime = opts.Timeout
// Fetch
reply := c.NewReply()
if err := c.RPC.RPC(c.Endpoint, reqReal, reply); err != nil {
return result, err
}
result.Value = reply
result.Index = reply.(*QueryMeta).Index
return result, nil
}
func (c *GenericBlockingQuery) SupportsBlocking() bool {
return true
} Then during registration: a.cache.RegisterType(cachetype.CatalogServicesName, &cachetype.GenericBlockingQuery{
Endpoint: cachetype.CatalogServicesEndpoint,
NewReply: func() { return &structs.IndexedServiceNodes{} }
RPC: a,
}, &cache.RegisterOptions{
// Maintain a blocking query, retry dropped connections quickly
Refresh: true,
RefreshTimer: 0 * time.Second,
RefreshTimeout: 10 * time.Minute,
}) This would eliminate the specific implementation for catalog services, ca roots, intention matches and health services and replace with the single common implementation. I am not advocating for generalizing all of them as some such as prepared queries and leaf certs need special handling but those other 4 where the fetch functions are basically identical it seems like we could pull those into a common implementation and specialize only the types that actually need different handling. Personally I dont think the generic implementation is any harder to reason about but I could be biased. Uncached/Cached Query InteractionAs for uncached queries not updating the cache I think your reply makes sense. If you aren’t using caching queries then you shouldn’t incur the extra overhead in terms of memory and a little cpu for cache management to push replies to uncached queries into the cache. I think maybe we just need to document the behavior really well. Adding something that says "uncached queries will not update the cache even in the case where there are also cached queries executed for the same data" would suffice. |
Cool @mkeeler think we agree on most of that then :) I agree blocking query usage is becoming more and more a hot topic which is why I want to work on optimizing that.
Hmm your example misses the request types are different too (partly because of my copy-paste comment typo probably :D). That means all of this would need to be in an interface function or something too: // The request should be a DCSpecificRequest.
reqReal, ok := req.(*structs.DCSpecificRequest)
if !ok {
return result, fmt.Errorf(
"Internal cache failure: request wrong type: %T", req)
}
// Set the minimum query index to our current index so we block
reqReal.QueryOptions.MinQueryIndex = opts.MinIndex
reqReal.QueryOptions.MaxQueryTime = opts.Timeout Once you account for that too I'm just not sold that the extra abstraction buys anything real - I don't anticipate it being touched often once written or changed much in a way that doing a similar edit in 4 or 5 files is a real problem in the grand scheme of things. If we end up finding there is a whole other large bit of code we have to add to all of them then maybe that would be a good time? Or many cachetypes with identical request/response types getting added? |
…roxy processes. (#4550) * Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
For the record the problem is worse that you imply on the server already - each blocking watch depending on the number of radix tree nodes it's watching, can end up with a whole tree of goroutines. JC and friends have noticed the effect of that in their performance analysis and it's yet another motivator (other than general efficiency of data transfer and encoding/decoding) to want to take a fundamental look at blocking. So 90k blocking queries in your example is already going to be probably closer to 500k plus goroutines on the leader. If you use ?stale it gets a bit better as that can be split between other servers including non-voting read replicas. There is a decent argument that all our background blocking should do that by default here which is a sensible change I'll make. |
This would definitely be a separate change but some version of this for DNS could be really useful. A lot of client libraries don't respect DNS TTLs so I remember seeing sad setups with thousands of qps of RPCs coming from DNS lookups (often the same one over and over). For queries coming from DNS I wouldn't bother with any blocking query under the hood (just let those results sit in the cache for the TTL with no invalidation). You could even use the existing TTL in The blocking query goroutine fanout because of the fine-grained radix tree stuff is pretty interesting to optimize. This was tuned with a benchmark but was still pretty arbitrary - https://github.com/hashicorp/go-memdb/blob/master/watch-gen/main.go#L12-L17. I'll be interested to see where things go with that one :-) |
@banks Not having the cache types abstraction is perfectly fine. I thought the request type was already being abstracted to a cache.Request and then coerced into a DCSpecificRequest which all those contain. If you think it would be better left alone thats fine with me. I am not hard set on it. Given that I think all my concerns have been addressed. |
Hey @slackpad, I love your drive-by context sharing ❤️ . To share a bit more our motivation here: this is not really a performance optimisation so much as a way to have Consul agents able to manage external proxies which need to have their service discovery "pushed" in. We could solve that part alone with blocking queries and no caching, but then proxies like the built-in one that do resolve discovery at connection time break if the servers are unavailable so this was a nice way to solve for that and get some performance benefit for other clients too if they want to use it. But totally agree DNS is a great feature to build on this cache now we have it later. We may also allow caching for KV eventually too. |
@mkeeler thanks. I'm going to update this to fix comment typo and also to use stale queries with a time bound by default for service discovery to spread this blocking query load. I think your still "requesting changes" here... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First congrats, because this is a very interesting move. I actually did start implementing something less well designed for DNS (but with a basic Most Frequently Used entries only) as we have a few operational issues on this side (especially regarding xDC calls).
I have a few remarks however:
On the theoretical POV, requesting stale read for entries is similar to use non-leader servers as cache. From an API point of view, this is the exact same thing, so not having the same mechanism as ?stale&max_stale=5m
for caching is IMHO a mistake:
when ?stale?&max_stale=5m
reaches a follower, if the data is not available on the follower side and the leader is not available, you have an error. It means that if you use: ?cache&max_stale=5m
, in the same way, you should either get a cache that is known to be refresher than 5m (whenever or not the call has been done is pure detail) AND you should have an error if it is not the cache.
Cache invalidation is hard, but adding stale-if-error
make the whole stuff complex to understand as an API IMHO without immediate benefit (or at least none if understand).
In our shop, we use max_stale to 72h (on DNS for instance, but also with discovery_max_stale parameter and we implement monitoring to be notified when we have more than 30s of delay between servers. Having 72h is almost equivalent as stale-if-error=no, but the semantics of the caller are easier to get.
Furthermore, as a Consul administrator, I really would like to use the same default mechanism as discovery_max_stale
(see #3920 ) to replace stale
by cache by default. Having a single semantics of error occurs if cache is older than max_stale
would really make it useful and more easy to understand by users.
I also see the will of re-using existing HTTP cache mechanisms, but as sad as is caching on Internet (basically, hard to get right), I am not convinced by the use of Max-Age, very often used far too much.
I also wonder how get blocking queries work with this cache mechanism (I mean on the client side), IIUC, the result would be strictly equivalent, right ?
Kind Regards
@@ -551,6 +616,11 @@ func (s *HTTPServer) parseConsistency(resp http.ResponseWriter, req *http.Reques | |||
fmt.Fprint(resp, "Cannot specify ?stale with ?consistent, conflicting semantics.") | |||
return true | |||
} | |||
if b.UseCache && b.RequireConsistent { | |||
resp.WriteHeader(http.StatusBadRequest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few remarks:
- to the administrator, it would be cool to use the
DiscoveryMaxStale
mechanism for enabling caching by default - to the API user, having max-age as well as
maxStale
is confusing: as a non-expert Consul user, I would mix both: max_stale can be semantically seen as maxAge for a non-leader server (if you think as non-leader servers as a kind a caching mechanism). I wonder if it would be nice to merge 2 concepts, because those are actually very close semantically for a developper
@@ -682,6 +752,9 @@ func (s *HTTPServer) parseInternal(resp http.ResponseWriter, req *http.Request, | |||
if s.parseConsistency(resp, req, b) { | |||
return true | |||
} | |||
if parseCacheControl(resp, req, b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same remark here, might be interresting to think if max_stale is equivalent to maxAge...
if err != nil { | ||
// Don't return error if StaleIfError is set and we are within it and had | ||
// a cached value. | ||
if raw != nil && m.Hit && args.QueryOptions.StaleIfError > m.Age { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not introduce StaleIfError
this as it renders the whole stuff VERY complicated to understand on a user point of view
@pierresouchay thanks for taking a look! Some great points - most of your suggestions are things we considered in our internal RFC process and rejected for a few reasons. I'll sum up quickly though:
There are lots of future features and optimisations possible for other use cases as you suggest but they are out of scope for now as this is so huge already.
Yeah I considered this. I think it certainly makes some sense but I want to leave it as a future enhancement as it's not needed for the original intent of this work.
StaleIfError is needed to meet the second design goal from this work I stated - to allow "fail static" behaviour for the built-in proxy and other clients. In fact it's only needed for prepared queries in this case since the blocking refresh endpoints support "fail static" already. Without this there is now way to have prepared queries that updated frequently in the happy case when servers are up but still return cached values for a much longer period if the servers are unreachable. I do agree that there is a lot of extra complexity added here but it seems cleaner than "breaking" the existing stale semantics. I also only expect expert level users who are prepared to read through the details to want to use this. Caching is all about tradeoffs and you should really understand them in detail to make a good decision about whether they meet your needs. I tried to stick with standard HTTP caching directives and semantics for the new features to make it more clear for people familiar with configuring HTTP caching in CDNs and similar and to make it clearer that this is a whole different mechanism than just the "stale" consistency optimisation which is a raft thing. Does that make sense? |
That commit:
I think this is still "approved" but any further feedback welcome. We likely won't merge this directly as it's making sense with our refined release schedule to move this work into a feature branch rather than add it piecemeal to master so I'll do that and attempt to rebase this there before it's merged tomorrow. |
…st that caching works
TODO: - docs for all endpoints to include cache support info
…to make the new data path more clear.
…roxy processes. (#4550) * Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
…ests to AllowStale so we can spread service discovery load across servers.
I rebased this onto a new feature branch of the current master so we can merge Envoy supporting PRs without putting them on the release path for 1.2.3 patch release until it's all done. Given the approvals here, I'll merge it to f-envoy now. If there are any additional concerns please commit or bring it up in future review when the feature branch is merged. |
…roxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541
* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.
* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.
* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.
* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
* Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.
* Rename agent/proxy package to reflect that it is limited to managed proxy processes Rationale: we have several other components of the agent that relate to Connect proxies for example the ProxyConfigManager component needed for Envoy work. Those things are pretty separate from the focus of this package so far which is only concerned with managing external proxy processes so it's nota good fit to put code for that in here, yet there is a naming clash if we have other packages related to proxy functionality that are not in the `agent/proxy` package. Happy to bikeshed the name. I started by calling it `managedproxy` but `managedproxy.Manager` is especially unpleasant. `proxyprocess` seems good in that it's more specific about purpose but less clearly connected with the concept of "managed proxies". The names in use are cleaner though e.g. `proxyprocess.Manager`. This rename was completed automatically using golang.org/x/tools/cmd/gomvpkg. Depends on #4541 * Fix missed windows tagged files
This PR ended up being larger than I imagined. I was originally going to split this into just being the Catalog and Health changes which needed very little change to the Cache package as they have similar semantics.
I ended up just adding the ground work for Prepared Queries which can't block and ended up doing the whole thing as it felt like there was no good place to stop.
So this PR includes:
/catalog/services/:service
now supports caching with?cached
param/health/services/:service
now supports caching with?cached
param/query/:name/execute
now supports caching with?cached
paramCache-Control
headers to allow clients to manage the staleness of the cached response since we need to trade keeping cached results around for a long time for failure tolerance vs not serving 3 day old responses to prepared queries on the happy path. There are a bunch of tests and documentation for this.I suggest reviewers first read the documentation on caching to understand the intended semantics of caching before getting to involved in the code changes.