-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: rework options flow to have a more consistent pattern #390
base: main
Are you sure you want to change the base?
Conversation
… middle layers to eliminate confusion
) | ||
imageResolver := imagefamily.New( | ||
operator.GetClient(), | ||
imageProvider, | ||
) | ||
launchTemplateProvider := launchtemplate.NewProvider( | ||
ctx, | ||
opts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the highlight worth discussing.
Prior to this, in addition to passing the options (and azConfig
) directly that you can see in the changes preview, some of the providers did call options.FromContext(ctx)
again. I find this inconsistent and would prefer to stick to one side—either pass all required parameters here, or just pass in ctx
/opts
.
The current code chooses passing in opts
. This will make the providers depends on the knowledge of options (no change), while sparing operator from having to know which provider needs what. This is particularly helpful in launchTemplateProvider
, which is have a lot of demands for fields in options.
With that said, each provider have different demands in options fields, but ideally I want to keep the dependencies/pattern consistent for all providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could stay explicit on the interface and pass all variables that we currently getting from options.FromContext(ctx)
through here instead. The interface will be be big for high-demanding providers, but we will be able to cut their dependence on options
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is to add some layering in options
to categorize them by uses, appearing at "sub-structs".
But this will force options
to contain knowledge about the exact uses of their fields, which is going to add additional complexity, given some fields are shared with multiple uses.
An extension to that is to have additional layer between options
(which will mostly take envs in raw) and this layer to map those to its provider uses. But that would be no difference that just creating structs here, which will only be better than explicit passing by extensibility and small reusability, at the cost of additional complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But all these options are not really making use of options.FromContext(ctx)
pattern established by karpenter-core and AWS provider. If we want to keep that, another options is to just pass in ctx
to all places and let options.FromContext(ctx)
retrieve the options whenever it is needed. This will assume that all instances here are singletons or at least have no differences between each, which is true(?), for now.
If you ask me, I just don't think it is a better option than others. Opinions in offline discussions I had seems to be against this pattern.
Although it is probably the best in terms of consistency.
ClusterEndpoint string // => APIServerName in bootstrap, except needs to be w/o https/port | ||
VMMemoryOverheadPercent float64 | ||
ClusterID string | ||
// Target cluster information; might be use for both bootstrapping and ARM authentications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind this is I try to categorize this by user-facing behavior (e.g., target cluster for the operations to occur, what will be on provisioned nodes) rather than implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some should be renamed. But let's put that in a different PR
func (o *Options) Default(ctx context.Context) error { | ||
var err error | ||
|
||
var authManager *auth.AuthManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a flaw in the implementation, as authManager
is created in both here and operator
. Given core, there is nothing much we can do(?) if we want to follow this pattern.
ClusterName string | ||
ClusterEndpoint string // => APIServerName in bootstrap, except needs to be w/o https/port | ||
ClusterID string | ||
APIServerName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and VnetGUID
(that I just added) are the options that are not available to be configurable directly by the customer, but directly depends on other options through defaulting. Open to suggestions if this should be elsewhere.
@@ -44,13 +44,17 @@ import ( | |||
type Controller struct { | |||
kubeClient client.Client | |||
instanceProvider *instance.Provider | |||
|
|||
// Will be used to calculate the goal state | |||
opts *options.Options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another instance of "not sure if options is overkill", see the comments in operator.go
for similar discussions. But if you ask me I think this is appropriate given the nature of the controllers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a quick scan. I like the direction of streamlining options. However, would recommend that we keep the naming generic since the flow doesn't necessarily have to just be used for workload identities.
value: "" | ||
- name: AZURE_NODE_RESOURCE_GROUP | ||
value: ${AZURE_RESOURCE_GROUP_MC} | ||
- name: ARM_AUTH_METHOD | ||
value: "workload-identity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the reason its called "from environment" is because its actually using a more generic/general auth option that finds the creds in the environment. This is the flow that works for workload-identity. However, I think the naming should probably keep the more generic term since that is what its actually doing. Unless you have more tied to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, by theory, there could be "from environment" auth option that is not workload identity, right?
If that's the case then I think keeping "from-environment" would be a better choice, given we doesn't seem to have ties to workload identity (right?).
// NewCredential provides a token credential | ||
func (am AuthManager) NewCredential() (azcore.TokenCredential, error) { | ||
if am.authMethod == AuthMethodWorkloadIdentity { | ||
klog.V(2).Infoln("cred: using workload identity for new credential") | ||
return azidentity.NewDefaultAzureCredential(nil) | ||
} | ||
|
||
if am.authMethod == AuthMethodSysMSI { | ||
klog.V(2).Infoln("cred: using system assigned MSI for new credential") | ||
msiCred, err := azidentity.NewManagedIdentityCredential(nil) | ||
if err != nil { | ||
return nil, err | ||
} | ||
return msiCred, nil | ||
} | ||
|
||
return nil, fmt.Errorf("cred: unsupported auth method: %s", am.authMethod) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this belong in the shared SDK? For my token fix i imagine other ppl may want the same fix. Uing azure sdk for go extensions may be a good idea for our auth so we can share that logic too. Should we move this there?
controllers := []controller.Controller{ | ||
nodeclaimgarbagecollection.NewController(kubeClient, cloudProvider), | ||
inplaceupdate.NewController(kubeClient, instanceProvider), | ||
inplaceupdate.NewController(kubeClient, instanceProvider, opts), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just options.FromContext(ctx)? or for that matter why not just pass in ctx so that future updates to the controller can access everything from the context object? Are we refactoring options out of ctx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed offline)
Although both share the same principle, options.FromContext(ctx)
seems to be the pattern that is generally discouraged by all opinions we have discussed with, except karpenter-core.
For my own reasoning, this would allow us to separate options
from ctx
, which might wanted to be used for other purposes. We could still argue that options
still contain things that might not be used, but it still allow extensibility and hide the knowledge of "what this component requires" from the creation interface. So I would say it is something in between ctx
and explicitly passing parameters.
Anyway, it is not a strong opinion.
) | ||
|
||
// Default sets the default values for the options, but only for those too complicated to be in env default (e.g., depends on other envs) | ||
func (o *Options) Default(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to go this far, can't we just resolve a lot of the configuration from having clustername and rg?
Why make users pass in things like "NetworkDataplane" or "NetworkPluginMode"? Why not just use the AKS SDK to dynamically resolve that configuration for them?
For Self hosted at least that might make things easier, if the values aren't updated consistently between the two models on restart for managed that might be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea. Now that we establish this framework, we could consider simplifying the integration interface and get obvious fields by ourselves. A trade-off for this is Karpenter will have more responsibility of "getting required info of a cluster" which could at least lead to having more traffic at the initialization.
I think the changes for this, if we decided to, should be in a separate PR at least. But we can start discussing here if we have strong ideas.
} | ||
|
||
if o.ClusterID, err = getAKSClusterID(o.APIServerName); err != nil { | ||
return fmt.Errorf("failed to get ClusterID: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make these errors typed? Makes this part easier to test, rachel left this feedback in another of my prs but you touch this part a lot more so lmk what you think about changing all of these to typed errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a nice to have, but not a necessity at the moment.
// getAKSClusterID returns cluster ID based on the DNS prefix of the cluster. | ||
// The logic comes from AgentBaker and other places, originally from aks-engine | ||
// with the additional assumption of DNS prefix being the first 33 chars of FQDN | ||
func getAKSClusterID(apiServerFQDN string) (string, error) { | ||
dnsPrefix := apiServerFQDN[:33] | ||
h := fnv.New64a() | ||
h.Write([]byte(dnsPrefix)) | ||
r := rand.New(rand.NewSource(int64(h.Sum64()))) //nolint:gosec | ||
return fmt.Sprintf("%08d", r.Uint32())[:8], nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question, does this belong in agentbaker? We stole it from agentbaker will self contained have this in the pkg/agent/common or wahtever we can just import rather than redefining it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I will dive deeper into this along with the contract interface rework. Hate
func contains(slice []string, target string) bool { | ||
for _, element := range slice { | ||
if target == element { | ||
return true | ||
} | ||
} | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes helpers for slices should have this already
} | ||
|
||
if o.NodeResourceGroup == "" { | ||
return fmt.Errorf("missing field, node-resource-group") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about shared error type. we really only care that
- its missing
- what is missing
@@ -40,7 +40,7 @@ type AKS struct { | |||
Arch string | |||
TenantID string | |||
SubscriptionID string | |||
UserAssignedIdentityID string | |||
KubeletIdentityClientID string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great name change!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we worry about these values being different from those in azure json? So people may grep for UserAssignedIdentityID since its called that everywhere else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed offline)
It is better this way given that the UserAssignedIdentityID
that is known by azure.json
is in fact KubeletIdentityClientID
that is known by the cluster. This confusion is the prime motivation for this entire work.
Currently, I choose to keep this variable name until we feed it to the bootstrap module. There could be even more opportunities around that area to make it even more clear...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have this knowledge, we should be aware when coming across their uses in those "everywhere else." Maybe we can catch some big bugs.
return &Provider{ | ||
kubernetesVersionCache: kubernetesVersionCache, | ||
imageCache: cache.New(imageExpirationInterval, imageCacheCleaningInterval), | ||
location: location, | ||
location: opts.Location, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just put options in the Provider struct itself and we can access location from that? I also will need some values from options in some of these providers in this pr https://github.com/Azure/karpenter-provider-azure/pull/365/files so that pattern would make access a bit easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed offline)
I am still hesitate to make these providers depends on options
entirely.
Right now it does module-wise, but not the struct, as the only reference to options
is on creation function (NewProvider
). Maybe the better name to this is NewProviderFromOptions
.
But again, not a strong opinion. Wondering what do others think? Should we be more extreme on one side (e.g., using opts
/ctx
everywhere)?
memory := resources.Quantity(fmt.Sprintf("%dGi", int64(memoryGiB(sku)))) | ||
// Account for VM overhead in calculation | ||
memory.Sub(resource.MustParse(fmt.Sprintf("%dMi", int64(math.Ceil( | ||
float64(memory.Value())*options.FromContext(ctx).VMMemoryOverheadPercent/1024/1024))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this come from core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, yes, this will introduce a difference between the two.
In general, I think our changes it will be for the better (at least, intention-wise). But the trade-off is consistency.
In this case, I am not completely sure, but leaning towards keeping current change.
Current changes mean consistency with this pattern within our provider (e.g., providers).
Restoring to original means consistency with the AWS provider. But regardless whether we keep other changes or restore all, inconsistency remains within our provider, unless we become more extreme with the use of ctx
.
// TODO(bsoghigian): this should be refactored to lo.Ternary(nodeClass.Spec.VnetSubnetID != nil, lo.FromPtr(nodeClass.Spec.VnetSubnetID), os.Getenv("AZURE_SUBNET_ID")) when we add VnetSubnetID to the nodeclass | ||
vnetSubnetComponents, err := utils.GetVnetSubnetIDComponents(options.FromContext(ctx).SubnetID) | ||
func (p *Provider) getVnetInfoLabels(_ *v1alpha2.AKSNodeClass) (map[string]string, error) { | ||
// TODO(bsoghigian): this should be refactored to lo.Ternary(nodeClass.Spec.SubnetID != nil, lo.FromPtr(nodeClass.Spec.SubnetID), p.opts.SubnetID) when we add VnetSubnetID to the nodeclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure we should differ from the AKS Agentpool api in what we call SubnetID? We discussed this somewhere on the other PR and i thought we agreed on calling it VnetSubnetID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named it SubnetID from the start but got feedback we should change it to be VNETSubnetID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is a mistake from mass text replace. My bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there will be similar changes (not in this PR) that I intend to rename all SubnetID
to VnetSubnetID
. We can discuss more later along with the rest of the renames.
@@ -64,7 +65,9 @@ func NewAPI() client.PricingAPI { | |||
return client.New() | |||
} | |||
|
|||
func NewProvider(ctx context.Context, pricing client.PricingAPI, region string, startAsync <-chan struct{}) *Provider { | |||
func NewProvider(ctx context.Context, opts *options.Options, pricing client.PricingAPI, startAsync <-chan struct{}) *Provider { | |||
region := opts.Location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just use opts.Location in the map lookup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is being used in other places (at least line 76/79) as well. But I don't think any choice matters.
Fixes #
Description
How was this change tested?
Does this change impact docs?
Release Note