Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load balancer extensibility #5598

Closed
htuch opened this issue Jan 14, 2019 · 23 comments · Fixed by #17400
Closed

Load balancer extensibility #5598

htuch opened this issue Jan 14, 2019 · 23 comments · Fixed by #17400
Labels
area/load balancing design proposal Needs design doc/proposal before implementation help wanted Needs help!

Comments

@htuch
Copy link
Member

htuch commented Jan 14, 2019

Load balancers seems to be a natural first class extension point in Envoy, but we don't support this today. As load balancer behaviors become more complicated, e.g. witness issues such as #4685, and the need for custom locality handling in Istio (CC @rshriram @costinm), it would be great to allow for LB extensions and even CDS delivery of LB behaviors via Lua/WASM.

Complications that make this challenging include the tight integration between LB and various ClusterManager data structures such as host, priority and locality data structures. We would neat a tighter, more stable and better defined API here. Also, allowing reuse of existing LBs and just minor customization, without having to reimplement the entire LB, would be useful.

I'm opening this issue for discussion and long-term evolution of the LB implementation, help wanted.

CC @mattklein123 @snowp @cpakulski @rshriram @costinm

@htuch htuch added design proposal Needs design doc/proposal before implementation help wanted Needs help! labels Jan 14, 2019
@venilnoronha
Copy link
Member

/sub

@mattklein123
Copy link
Member

@htuch heads up that I was discussing this with @HenryYYang today and I don't think we are going to end up needing this for the redis cluster work so Lyft won't be implementing.

@htuch
Copy link
Member Author

htuch commented Mar 1, 2019

Ack; it's still on our backlog but not a super high priority.

@snowp
Copy link
Contributor

snowp commented Nov 23, 2019

@markdroth I see you landed #7744 a while ago that seems to tie nicely into this issue. I was gonna spend some time thinking of how to do this over the next few months, and the config approach in your PR is very similar to what I was thinking. Is this something you're actively working on?

@markdroth
Copy link
Contributor

@snowp My impetus for #7744 was for gRPC clients being configured via xDS, not for Envoy, so I am not personally working on the Envoy-side changes to support this. But I do agree that this should be implemented in Envoy, and we had discussed the possibility of @htuch taking this on at some point. But if this is something that you need sooner, I suspect he would not mind if you take this on.

No matter who does the Envoy-side work, I would be happy to consult on functionality and semantics, to make sure that things stay consistent between Envoy and gRPC.

@htuch
Copy link
Member Author

htuch commented Dec 2, 2019

@markdroth yeah, @snowp and I chatted at EnvoyCon. He is the Envoy-side domain expert in this area, so it would be awesome if he can own this for us.

@snowp
Copy link
Contributor

snowp commented Dec 6, 2019

One thing I’m thinking about is whether it would make sense to move the generic lb configuration to the ClusterLoadAssignment proto, which would make it possible to reconfigure arbitrary lb details through EDS (or CDS since the proto is inlined). If we were to decouple endpoint details from the lb configuration (e.g. by using the named_endpoints) and provide a way to cross reference the structure and the endpoints it seems like we’d be able to provide a very flexible API for statically compiling in custom load balancers.

If this general approach seems reasonable I’d be happy to put together a doc.

@markdroth
Copy link
Contributor

@snowp I don't quite understand what you mean about cross-referencing the structure and the endpoints. Can you give a concrete example of how this might work?

In general, it seems like EDS is mainly dynamic data (i.e., it changes to shift load around), whereas CDS is more configuration data (i.e., it changes only when humans modify it), and I would think that the LB policy configuration fits more into the latter category. But if there are good reasons to do it in EDS, I'm not necessarily opposed.

I have actually considered putting the LB config in EDS at least twice, and both times we didn't wind up going with that approach. Let me provide some context to explain when I considered that and why I didn't go with that approach.

gRPC has the ability to independently select the policy for each level of the routing hierarchy -- i.e., we can choose the policy for picking the locality and then separately choose the policy for picking the endpoint within the locality. We do this by having the locality-picking policy create a child policy for each locality. For each request, the parent policy picks the locality and delegates to the child policy for the chosen locality to pick the endpoint within that locality. In effect, we have a tree of LB policies whose structure matches that of the organizational hierarchy of the endpoints.

I specifically wanted to be able to support that structure when I added the new fields in #7744. The approach that I went with was essentially the same one that we use in gRPC: we have the config for the parent policy include a field that tells it what child policy to use and what config to pass to the child policy. So, for example, if we had a locality-picking policy called "closest_locality", its config might be expressed using the following proto message:

message ClosestLocalityLbConfig {
  envoy.api.v2.LoadBalancingPolicy child_policy = 1;
}

We could use this to configure the "closest_locality" LB policy for locality picking and then independently choose a "weighted_round_robin" policy for endpoint picking. Here's how it would look in CDS:

load_balancing_policy: {
  policies: {
    name: "io.grpc.builtin.closest_locality"
    typed_config: {
      type_url: "type.googleapis.com/io.grpc.ClosestLocalityLbConfig"
      value: {
        child_policy: {
          policies: {
            name: "io.grpc.builtin.weighted_round_robin"
          }
        }
      }
    }
  }
}

While working on #7744, I had originally considered saying that we would configure the locality-picking policy in CDS and then the endpoint-picking scheme in the Locality message in EDS, so that the latter could potentially even be overridden on a per-locality basis. That approach would have worked fine for hierarchical policies, but it would not have provided an intuitive way to represent the existing Envoy LB policies that are non-hierarchical. For example, I understand that Envoy's current ROUND_ROBIN policy handles weighting for both localities and endpoints in a single mechanism. Because it's non-hierarchical, it's not clear how it would be represented in a config that wants to configure each level separately. There are ways we could have made this work. One way would have been to just configure the locality-picking policy and then leave the endpoint-picking policy unset. Another way would have been to configure the same policy in both places, and have the two pieces work together to do the right thing. But neither of these seemed as flexible as the approach we wound up using: because the hierarchical structure is encoded only in the per-policy config for policies that support it, the top-level config can deal only with the top-level LB policy, and any delegation to child policies that may happen inside of the top-level policy is hidden from the rest of the system.

The other time that I considered making this configurable in EDS was for the case I described in #7454. We have a use-case where we have endpoints in two different priorities and we need to use a different endpoint-picking LB policy for each one. We had originally thought about addressing that by allowing per-priority endpoint-picking policy overrides in EDS, but @htuch suggested that we use the aggregate cluster design instead.

Stepping back a bit, I have to say that it does seem a little strange that we have two different prioritization mechanisms that both basically do the same thing, one in the aggregate cluster design and another in the priority field for localities. In the long run as part of UDPA, I wonder if it would make sense to try to restructure this such that a cluster defines priorities as a top-level concept, and then sets localities and endpoints for each priority. In other words, instead of this:

Aggregate Cluster -> Prioritized Cluster -> Prioritized Locality -> Endpoint

we would have this:

Cluster -> Priority -> Locality -> Endpoint

Then we could set defaults at the Cluster level but also override them as needed at the priority level.

Anyway, this may be more detail than you are actually interested in, but I hope the context is useful. Please let me know what you think.

@htuch
Copy link
Member Author

htuch commented Dec 6, 2019

@snowp didn't the last discussion around EDS-overriding-CDS end up with the aggregate cluster compromise? Was that missing some use case or does custom LB introduce new options that might need overriding in EDS?

@htuch
Copy link
Member Author

htuch commented Dec 6, 2019

@markdroth isn't "Cluster -> Priority -> Locality -> Endpoint" what we had before aggregate cluster? :)

@markdroth
Copy link
Contributor

@htuch No, before aggregate cluster, we had "Cluster -> Prioritized Locality -> Endpoint". That structure couldn't handle overriding various things on a per-priority basis, which is why we had to add aggregate cluster. What I'm proposing is combining the two places that we're doing prioritization into a single one.

@htuch
Copy link
Member Author

htuch commented Dec 6, 2019

Yeah, that makes sense. I think this is going to be on of the more complex aspects of UDPA; right now I'm getting a feel that there are some pretty non-controversial wins we can make in UDPA, for example transport protocol, routing and the moral equivalent of EGDS (i.e. discovering endpoints groups without any of this implied priority and policy). We're going to have to look at a few other proxies and see what they might need in terms of expressability etc here.

@snowp
Copy link
Contributor

snowp commented Dec 7, 2019

There seems to be a fundamental question we should probably figure out: how much of the existing load balancing logic should be core vs extensions? In my mind I was imagining a world where everything was just extensions, which is why I was thinking about having an arbitrary config stanza as part of the CLA proto. Imagine something like this on the CLA proto:

named_endpoints:
  foo: ...
  bar: ...
lb_config:
  name: envoy.load_balancers.priority
  config:
    priorities:
    - endpoints: [foo] # priority 0
        inner_lb:
          name: envoy.load_balancers.random
    - endpoints: [bar] # priority 1
        inner_lb: 
          name: envoy.load_balancers.least_request

When I said cross-reference in the previous comment, I was referring to the fact that we're naming endpoints and then referring to them by name to define the LB structure in the config. Naming them vs defining inline means that we can refer to them multiple times:

lb_config:
  name: envoy.load_balancers.priority
  config:
    priorities:
    - endpoints: [foo, baz] # priority 0
        inner_lb:
          name: envoy.load_balancers.locality
          config:
            - weight: 10
               endpoints: [foo]
               inner_lb: 
                 name: envoy.load_balancers.random
            - weight: 20
               endpoints: [baz]
               inner_lb: 
                 name: envoy.load_balancers.random

With this, the configuration stored on the CLA is not just the LB policy, but the entire LB hierarchy. This is a pretty substantial change in how load balancing works in Envoy (the LB impls track the LB structure instead of the cluster), but it would make it possible to split up the LB code substantially and make it easier to pick and choose what LB features you want. It also substantially changes how host changes would be propagated in code, due to its current reliance on the PrioritySet which would no longer be a core part of the Cluster.

The composable API structure would also make it a lot easier for custom LB algorithms to be used for intermediate steps:

lb_config:
  name: envoy.load_balancers.priority
  config:
    priorities:
    - endpoints: [foo] # priority 0
        inner_lb: 
          name: envoy.load_balancers.random
    - endpoints: [bar, baz, bax] # priority 1
        inner_lb: 
          name: my.custom.lb
          config:
            inner_lb: envoy.load_balancers.least_request

where the high level idea is that my.custom.lb is used to select which subset of priorirty 1 should be targeted, and delegates the selection from that subset to the least_request LB.

This is all going down the path of trying to make everything extensible. It might be that we want to make certain things supported directly instead, which would probably warrant a different API.

@markdroth
Copy link
Contributor

@snowp In general, I agree that making everything extensible is the right approach. I think that the built-in LB policies can simply be provided as plugins that are shipped with Envoy and available by default. This is what we do in gRPC for the few LB policies that we provide out of the box.

I think there's also another benefit to what you're proposing here, which is that it actually removes the restriction that exists today where endpoints must be grouped into localities with associated priorities. Instead, the hierarchy would be completely customizable by the user: it could be a single, flat list of endpoints, or it could be a multi-level hierarchy of priority, region, locality, and endpoint. In effect, xDS would no longer be enforcing its own notion of the hierarchy in which endpoints are organized. I think this would add a lot of flexibility.

One thing that may be fairly complex here is to figure out how to handle the priority failover logic that is used in choosing priority levels and in locality weighted load balancing. I don't think that's an insurmountable issue; it's just something that we need to carefully consider when we design the LB policy API. When we get further into the details, I'd be happy to show you how our API in gRPC handles this sort of thing.

The one noteworthy difference I see between what you're proposing and what we do in gRPC is that you would be explicitly setting the targets along with the config for each level of LB policy, whereas in gRPC, we usually wind up specifying those two things separately. However, the design you propose does actually seem a bit more flexible, since it allows more easily overriding the behavior for each individual child of a given node in the LB policy tree. And gRPC can certainly adapt to this approach when we start using this part of UDPA.

@htuch
Copy link
Member Author

htuch commented Dec 10, 2019

I think the endpoint cross-refs is basically what we are thinking about in EGDS (see mentions of this in #8400), I think there is a lot of alignment around this idea.

@snowp
Copy link
Contributor

snowp commented Dec 10, 2019

I think there's also another benefit to what you're proposing here, which is that it actually removes the restriction that exists today where endpoints must be grouped into localities with associated priorities. Instead, the hierarchy would be completely customizable by the user: it could be a single, flat list of endpoints, or it could be a multi-level hierarchy of priority, region, locality, and endpoint. In effect, xDS would no longer be enforcing its own notion of the hierarchy in which endpoints are organized. I think this would add a lot of flexibility.

I think this is pretty key in providing a truly extensible LB experience: having the API (and by extension the implementations) be opinionated about the hierarchy makes it hard for custom LBs to efficiently store endpoints in a different format: in Envoy, this would most likely result in these LB extensions managing their LB state in addition to the PrioritySet, resulting in a lot of wasted space and time spent processing host changes.

Another approach I had in mind for splitting LB config and endpoints was to make use of endpoint metadata to provide per LB extension information in the endpoint metadata, something like:

named_endpoints:
  foo: 
    address: ...
    metadata: 
       envoy.lb.priorities:
         priority: 1
       envoy.lb.endpoint_weighting:
         weight: 2
       envoy.lb.locality:
         locality:
           zone: ...

This moves the endpoints out of the dynamic LB config, which would make it very easy to provide partial EDS updates as talked about in #8400: the LB config remains relatively fixed, and endpoints can be added/removed from the endpoint map without having to worry about modifying the arbitrary tree formed by the LB config.

This sacrifices some wire overhead (ie potentially lots of config per endpoint) and possibly harder to optimize update code (you have to keep scanning the endpoints for which ones match a priority, locality, etc., vs knowing the names from the LB config) for the reduction in coupling between the endpoints and the LB config.

@markdroth
Copy link
Contributor

I think the most flexible way of approaching this might be to simply have each LB policy control how it identifies its children. There are some cases where it will be useful to directly configure a policy's children in its LB config, but there are other cases where the policy will need to dynamically determine its children at run-time. That might be done based on a request header, or it might be controlled by some external control plane.

One very flexible way of doing this would be to allow each policy to create its own xDS resource type, which it could query via ADS. For example, let's say that we want a simple hierarchy of the following form:

Cluster -> Locality -> Endpoint

There are no priorities. We want to use simple weighted round-robin picking of the locality and then use the least_request policy for picking the endpoint within the locality. To do this, we can define a new xDS resource type that defines the set of localities in the cluster, which would look something like this:

message ClusterLocalities {
  message Locality {
    // Name of locality.
    string name = 1;
    // Weight.
    uint64 weight = 2;
    // List of EGDS resources for this locality.
    messge Egds {
      ConfigSource source = 1;
      string name = 2;
    }
    repeated Egds egds = 3;
  }
  repeated Locality locality = 1;
}

The config message for the top-level LB policy can look something like this:

message LocalityLbConfig {
  // How to fetch the cluster locality info.
  ConfigSource cluster_locality_source = 1;
  string cluster_locality_name = 2;
  // The child policy to create for each locality.
  envoy.api.v2.LoadBalancingPolicy child_policy = 3;
}

The top-level LB policy will fetch the ClusterLocality resource as specified in the config and create the specified child policy for each config.

The child policy might have a config message that looks like this:

message EndpointLeastRequest {
  // List of EGDS resources for this locality.
  messge Egds {
    ConfigSource source = 1;
    string name = 2;
  }
  repeated Egds egds = 1;
}

The child policy will fetch the EGDS resources and perform the least_request algorithm across the resulting endpoints to pick the endpoint for each request.

An actual configuration might look like this:

lb_config:
  name: envoy.load_balancers.locality_round_robin
  config:
    locality_group_source:
      ads:
    locality_group_name: "my_locality_group"
    child_policy:
      name: envoy.load_balancers.least_request

Note that in this case, the config for the top-level LB policy is fully specified in the LB config; it specifies which locality group to use directly in the config. However, the child policy specifies the name only, not the corresponding config, because the top-level policy will construct the config for the child policy based on the Locality data it obtains from the management server. (There are obviously other ways we could have structured this if we wanted, but this is just an example.)

This approach provides flexible decoupling of endpoint data from the LB config, but it would avoid the overhead of tagging everything as metadata on the endpoints, which both avoids the wire overhead and provides more flexibility for non-leaf policies that don't know or care about endpoints.

As a side note, this also makes me think that as part of UDPA, we should consider splitting up some parts of what's currently in RDS and moving it to this new LB policy mechanism instead. For example, it would be trivial to express things like cluster_header or weighted_clusters as an LB policy.

@htuch
Copy link
Member Author

htuch commented Dec 13, 2019

@markdroth it looks like you have a lot of great ideas for where to take the API. Do you and @snowp want to put together a strawman (similar to what we did for routing) for UDPA-WG? That way we can explore this in a doc rather than GH threads, which are hard to follow and then feed that next year into some concrete protos.

I would aim for the v3 API (which is where this present issue is likely to intercept) to be a bit more constrained. EGDS makes sense and we need that sooner rather than later, but we should probably descope as much as possible to ensure we can deliver in that time frame.

@yxue
Copy link
Contributor

yxue commented Dec 13, 2019

/cc @yxue

@jmarantz
Copy link
Contributor

jmarantz commented Apr 9, 2020

/cc @jmarantz

@gupta-deeptig
Copy link
Contributor

I was looking at how to implement custom load-balancing algorithms in envoy, and based on this thread, this needs to be a cluster based extension? (similar to what was done for Redis)? The EDGS and other ideas here are not merged, right? Is there any reference on what we can use/restrictions for custom algorithm today?

lizan pushed a commit that referenced this issue Aug 14, 2021
Enables `LOAD_BALANCING_POLICY_CONFIG` enum value in `LbPolicy` and supports typed load balancers specified in `load_balancing_policy`. Continues work done by Charlie Getzen <[email protected]> in #15827.

Custom load balancers specified by `load_balancing_policy` are created as implementations of `ThreadAwareLoadBalancer`. Thread-local load balancers can be implemented as thread-aware load balancers that contain no logic at the thread-aware level, i.e. the purpose of the thread-aware LB is solely to contain the factory used to instantiate the thread-local LBs. (In the future it might be appropriate to provide a construct that abstracts away thread-aware aspects of `ThreadAwareLoadBalancer` for LBs that don't need to be thread-aware.)

A cluster that uses `LOAD_BALANCING_POLICY_CONFIG` may not also set a subset LB configuration. If the load balancer type makes use of subsetting, it should include a subset configuration in its own configuration message. Future work on load balancing extensions should include moving the subset LB to use load balancing extensions.

Similarly, a cluster that uses `LOAD_BALANCING_POLICY_CONFIG` may not set the `CommonLbConfig`, and it is not passed into load balancer creation (mostly owing to its dubious applicability as a top level configuration message to hierarchical load balancing policy). If the load balancer type makes use of the `CommonLbConfig`, it should include a `CommonLbConfig` in the configuration message for the load balancing policy.

Considerations for migration of existing load balancers:

- pieces of the `ThreadAwareLoadBalancerBase` implementation are specific to the built-in hashing load balancers and should be moved into a base class specifically for hashing load balancers. As it stands, custom load balancing policies are required to implement a `createLoadBalancer()` method even if the architecture of the LB policy does not require a hashing load balancer. I think we would also benefit from disentangling `ThreadAwareLoadBalancerBase` from `LoadBalancerBase`, as the former never actually does any host picking.

- as we convert existing thread-local load balancers to thread-aware load balancers, new local LBs will be re-created upon membership changes. We should provide a mechanism allowing load balancers to control whether this rebuild should occur, e.g. a callback that calls `create()` for thread-aware LBs by default, which can be overridden to do nothing for thread-local LBs.

Risk Level: low
Testing: brought up a cluster with a custom load balancer specified by `load_balancing_policy`; new unit tests included
Docs Changes: n/a
Release Notes: Enable load balancing policy extensions
Platform Specific Features: n/a
Fixes #5598

Signed-off-by: Eugene Chan <[email protected]>
@abhiroop93
Copy link

@markdroth @htuch @snowp
By my understanding, It is possible to use hierarchical load balancing.
But I have been trying to set that up, but there is no sample documentation for the same.
How do I go about setting a hierarchical policy, for eg:

  1. First to check the closest endpoints (by zone)
  2. And then to select which endpoint to route to, within that zone.

Can you share a sample doc/code snippet for the same?

@markdroth
Copy link
Contributor

We don't have any "parent" LB policies today, but we do now have the structure in place such that it should not be too hard for you to write such a policy.

What is the exact behavior you want here for selecting the zone? If you always want the closest zone for any given client, why not just have the control plane send only the endpoints in that zone to the client, so that no matter what it picks, it gets the right thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/load balancing design proposal Needs design doc/proposal before implementation help wanted Needs help!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants