Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print featuregates via antctl #2082

Merged
merged 1 commit into from
Jun 3, 2021
Merged

Conversation

luolanzone
Copy link
Contributor

@luolanzone luolanzone commented Apr 13, 2021

allow user to run antctl get featuregates inside of agent or controller container or out of cluster to get feature gates information.
Resolves #2010
sample cli:

#antctl get featuregates
Antrea Agent Feature Gates
FEATUREGATE              STATUS         VERSION
AntreaProxy              Enabled        BETA
Egress                   Disabled       ALPHA
EndpointSlice            Disabled       ALPHA
Traceflow                Enabled        BETA
FlowExporter             Disabled       ALPHA
NetworkPolicyStats       Disabled       ALPHA
NodePortLocal            Disabled       ALPHA
AntreaPolicy             Enabled        BETA

Antrea Controller Feature Gates
FEATUREGATE              STATUS         VERSION
NetworkPolicyStats       Disabled       ALPHA
AntreaPolicy             Enabled        BETA
Egress                   Disabled       ALPHA
Traceflow                Enabled        BETA

Signed-off-by: Lan Luo [email protected]

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this could be a bit misleading because:

  • some features are only significant for the Agent and some only for the Controller. For example, if someone has disabled AntreaProxy in the Agent configuration but uses a recent version of Antrea (where AntreaProxy is Beta), running this command from the Controller Pod will say that AntreaProxy is enabled which is incorrect.
  • it seems that the command won't run "out-of-cluster", yet this is a useful scenario IMO. If someone wants to quickly & accurately check which features are enabled, they should be able to run it out-of-cluster and get correct information based on the contents of the ConfigMap.

@luolanzone
Copy link
Contributor Author

@antoninbas for item 1, I will check and split them into two parts for agent and controller, for item 2, I will move it and make sure it can be run out of cluster, if I am not wrong, I remember only controller support it? and I am a little concern about get the data from configmap directly since I noticed that even I manually change the configmap, it won't reflect to agent/controller until I restart them. I think it can do better to use configmap checksum to make sure agent/controller will be restarted automatically when it's updated.
and another question is is there any reason we need to use one configmap for both agent/controller? I feel it's better to split them. but I guess it's another story. : )

@jianjuns
Copy link
Contributor

For featuregates on the agent side, I feel the only way is to read from ConfigMap if we want to get them remotely.

@luolanzone luolanzone force-pushed the featuregates branch 2 times, most recently from 760007b to b4cddf8 Compare April 14, 2021 07:16
@luolanzone
Copy link
Contributor Author

Hi @jianjuns @antoninbas Could you help to review again? I have refined it to get feature gates info from configmap when it's out-of-cluster, and read from config file directly when it's in pod. thanks.

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you will add a controller API and an agent API (for in-Pod execution) to get the accurate featuregate information. But I realized anyway you need to rely on ConfigMap for the remote execution case.
However, one problem with the current approach is that antctl version must match the controller/agent version to have the right default featuregate values. To address this, seems we still need API calls to controller (for remote execution). Not sure it is overkill.
@antoninba ?

@antoninbas
Copy link
Contributor

I thought you will add a controller API and an agent API (for in-Pod execution) to get the accurate featuregate information.

I was also expecting that.

However, one problem with the current approach is that antctl version must match the controller/agent version to have the right default featuregate values. To address this, seems we still need API calls to controller (for remote execution). Not sure it is overkill.

I think it's a bit of an issue. One solution would be to add the FeatureGate information (default values, graduation stage) to the ControllerInfo and have antctl consume that (along with the ConfigMap). That is, as an alternative to introducing a new API.

@jianjuns
Copy link
Contributor

One solution would be to add the FeatureGate information (default values, graduation stage) to the ControllerInfo and have antctl consume that (along with the ConfigMap).

Returning featuregate default values by ControllerInfo sounds strange to me.

@antoninbas
Copy link
Contributor

One solution would be to add the FeatureGate information (default values, graduation stage) to the ControllerInfo and have antctl consume that (along with the ConfigMap).

Returning featuregate default values by ControllerInfo sounds strange to me.

Then maybe the actual values directly? When I was working on telemetry, I had that code in the Controller.
Otherwise, and if we don't want to add a new API, I would recommend at least checking the version info in ControllerInfo and printing a warning when running antctl featuregates if there is a version mismatch.

@luolanzone
Copy link
Contributor Author

@jianjuns ,I was thinking to make the source of feature gates consistent in both in-pod or out-of-cluster way. so remove the original nonReourceURL API. but it's indeed a problem when antctl version is inconsistent with controller/agent. I can add a nonResourceURL to return default feature gates and call it from out-of-cluster.
@antoninbas considering the warning of version mismatch won't return the user the right feature gates list, I think it's better to add a new API, any objection?
for the version mismatch between antctl and controller, maybe we need another feature enhancement issue to print the warning?

@antoninbas
Copy link
Contributor

No issue on my side with adding an API in system to report feature gate enablement, but let's check with @tnqn.

If we do it, there is no need to check for version mismatch any more.

@tnqn
Copy link
Member

tnqn commented Apr 15, 2021

Using the actual values as source of truth makes sense to me. I debugged an issue that it was clear that the issue was because the feature was not enabled according to logs but the configmap shows it's enabled (it must be overwritten after the processes have read them).
@antoninbas By system group, do you mean making it a versioned resource API? I see currently many of this kind of APIs are non-resource and just have a path and handler, which seems suitable for this one too.

@antoninbas
Copy link
Contributor

@antoninbas By system group, do you mean making it a versioned resource API? I see currently many of this kind of APIs are non-resource and just have a path and handler, which seems suitable for this one too.

Yes, I was suggesting using aggregation to have it work out of cluster. We can also have antctl connect to the Controller's apiserver directly like we do for some other commands. I guess the less we rely on aggregation, the better. And non-versioned is not an issue as long as we don't evolve the API in the future...

Comment on lines 37 to 40
Command.Long = "Get current Antrea agent feature gates info"
} else if runtime.Mode == runtime.ModeController && runtime.InPod {
Command.RunE = controllerLocalRunE
Command.Long = "Get current Antrea controller feature gates info."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we typically capitalize "Agent" and "Controller", like you did below

Comment on lines 116 to 94
if mode == "remote" {
if agentClient, err = getAgentClient(k8sClientset, antreaClientset, restconfigTmpl); err != nil {
return err
}
if controllerClient, err = getControllerClient(k8sClientset, antreaClientset, restconfigTmpl); err != nil {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do this in the switch statement above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to reduce the error check through switch case for agent/controller only mode, considering the code consistence, I will add it back to switch case.

agentClient, err = rest.RESTClientFor(restconfigTmpl)
case runtime.ModeController:
controllerClient, err = rest.RESTClientFor(restconfigTmpl)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra empty line?

}
}

if mode == runtime.ModeAgent || mode == "remote" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be better to test agentClient for nil here instead of this if condition. Same below for controllerClient.

Comment on lines 44 to 62
isController := false
if runtime.Mode == runtime.ModeController {
isController = true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of an antctl concept. I'd rather have 2 different handlers or add a parameter to this handler function.

Comment on lines 116 to 99
if mode == "remote" {
if agentClient, err = getAgentClient(k8sClientset, antreaClientset, restconfigTmpl); err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this part. It seems that in getAgentClient you look for a "random" Node and you select the Agent instance running on that specific Node. I don't think this is great, as the feature gates may not be the same for all agent, if we are in the middle of an upgrade for example. IMO, the best way for "remote" mode would be to only provide the feature gates for the Controller by default, but potentially introduce a new command-line option which would accept a Node name. We would then select the Agent instance running on that specific Node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, OK, I thought all agent check the same configmap, so the featuregates should be the same, didn't consider the upgrade case. but it will be more complicate to add a node flag. looks like @jianjuns 's suggestion #2082 (review) is more applicable to simplify the case, what's your thought?

Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luolanzone @antoninbas : so I wonder why we need to complicate the featuregates command. Could we just follow other normal commands like "get versions", that:

  1. When running remotely, antctl connects to controller via K8s API, and the controller can return the features based on the ConfigMap.
  2. When running inside agent Pod, antctl connects to agent, and agent returns the features based on its internal state.

@antoninbas
Copy link
Contributor

@luolanzone @antoninbas : so I wonder why we need to complicate the featuregates command. Could we just follow other normal commands like "get versions", that:

  1. When running remotely, antctl connects to controller via K8s API, and the controller can return the features based on the ConfigMap.
  2. When running inside agent Pod, antctl connects to agent, and agent returns the features based on its internal state.

That works for me. I thought that's what we were going to do from the get go, since 1) is what I had done for telemetry (retrieve the ConfigMap and "merge" it with the "defaults" information). This is the most useful case, although I can see some value in being able to get the "actual" feature gate configuration for an agent (bullet point 2).
Do you think we should distinguish between the command when it's run within the Controller Pod and the command when it's run from outside? I'm not sure it's necessary, given once again that I think 1) is the most useful case.

@luolanzone
Copy link
Contributor Author

Hi @jianjuns after checking the code, looks like "version" only print controller's version no matter it's in-pod or out of cluster, if we only want to show the controller's feature gates when we issue 'antctl featuregates', we can use internal state from controller pod. just need to have two handlers in both controller and agent package. here are two ways to implement it in my mind:

  1. show controller feature gates only when run 'antctl featuregates' not matter it's from out-of-cluster or in controller pod.
    this is the most simple way and we can just use two similar handlers to return their separate internal state.
  2. show controller and agent feature gates not matter it's from out-of-cluster or in controller pod.
    this one requires controller handler to read configmap to read both agent and controller featuregates, and agent will still have it's own handler to return the feature gates based on internal state.

since you mentioned configmap, so I would like to clarify with you about it, are you preferring the first simple way to show controller feature gates itself?

@jianjuns
Copy link
Contributor

Hi @luolanzone : I would go your #2. As @antoninbas said, we want most a command to get all features.

@antoninbas
Copy link
Contributor

@luolanzone the out-of-cluster command (for the Controller) should definitely print both sets of feature gates (Agent + Controller), based on the ConfigMap. Some code similar to this: https://github.com/antoninbas/antrea/blob/488da983bf3c8b581a03f4550a409629ab66d6a4/pkg/controller/usagereport/reporter.go#L178-L219
Calling the command from inside the Controller Pod can do the same if it's simple, or can (I think it would be better) report only the Controller's feature gates (not using the ConfigMap), to mirror what we want to do for the Agent.

@luolanzone luolanzone force-pushed the featuregates branch 2 times, most recently from 6941ff9 to 1866513 Compare May 24, 2021 13:57
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Two minor comments left.
@jianjuns @antoninbas do you want to take a look too since you reviewed it a few times?

kubeconfig.Insecure = true
kubeconfig.CAFile = ""
kubeconfig.CAData = nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function may be confusing. There is one SetupKubeconfig too so looks like that's a secure one and this is insecure one. Can't it caller use SetupKubeconfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just refined it with name SetupBaseKubeconfig, this simple one will be called in proxy command. we haven't had a secure one yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I may misunderstood it, let me check if it's possible to just use SetupKubeconfig for proxy command

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after tried it locally, proxy command can work with SetupKubeconfig, so I just removed SetupBaseKubeconfig.

podNameEnvKey = "POD_NAME"
podNamespaceEnvKey = "POD_NAMESPACE"
svcAcctNameEnvKey = "SERVICEACCOUNT_NAME"
antreaConfigMapEnvKey = "ANTREA_CONFIG_MAP_NAME"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other variables don't use antreaSvcAcctNameEnvKey and ANTREA_SERVICEACCOUNT_NAME, maybe keep same here? i.e.:

configMapEnvKey = "CONFIGMAP_NAME"

Copy link
Contributor Author

@luolanzone luolanzone May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am following the name used in downstream which is provided by @antoninbas , I guess we should keep them the same for less maintenance effort?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. No strong opinion.

@luolanzone luolanzone force-pushed the featuregates branch 2 times, most recently from 9ca1c06 to 12743f4 Compare May 25, 2021 13:12
@luolanzone
Copy link
Contributor Author

@antoninbas @jianjuns do you have any new comments for this PR? there was a few new comments from @tnqn recently which have been addressed, please let me know if there is any new comments or Quan can help to move on, thanks!

jianjuns
jianjuns previously approved these changes May 27, 2021
Copy link
Contributor

@jianjuns jianjuns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented May 28, 2021

@luolanzone since this is not urgent to be included in v1.1. I will merge it after v1.1 is released to not impact the releasing. Thanks for your patience.

@antoninbas antoninbas added this to the Antrea v1.2 release milestone May 28, 2021
@tnqn
Copy link
Member

tnqn commented Jun 2, 2021

@luolanzone as Antrea is a CNCF project now, it has adopted DCO on all PRs. The PR is not mergable because of the check. Could you sign the commits, following https://github.com/antrea-io/antrea/blob/main/CONTRIBUTING.md#sign-off-your-work? Meanwhile, you can squash the commits.

@luolanzone
Copy link
Contributor Author

luolanzone commented Jun 2, 2021

@tnqn sure, I just rebased and squashed the code, updated the commit with sign-off info, please help to check, thanks!

@tnqn
Copy link
Member

tnqn commented Jun 2, 2021

/test-all

@tnqn
Copy link
Member

tnqn commented Jun 2, 2021

All e2e tests failed for the antctl case, may due to some code conflict.

2021-06-02T07:01:17.1185324Z === RUN   TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070117Z.out_featuregates
2021-06-02T07:01:17.3005161Z ##[error]    antctl_test.go:93: Error when running `antctl [featuregates]` from antrea-agent-z9gbb: command terminated with exit code 1
2021-06-02T07:01:17.3007850Z         antctl stdout:
2021-06-02T07:01:17.3008252Z         
2021-06-02T07:01:17.3008676Z         antctl stderr:
2021-06-02T07:01:17.3009743Z         Error: unknown command "featuregates" for "antctl-coverage"
2021-06-02T07:01:17.3010746Z         Run 'antctl-coverage --help' for usage.
2021-06-02T07:01:17.3011538Z === CONT  TestAntctlAgentLocalAccess
2021-06-02T07:01:17.3014242Z ##[error]    fixtures.go:224: Exporting test logs to '/home/runner/work/antrea/antrea/log/TestAntctlAgentLocalAccess/beforeTeardown.Jun02-07-01-17'
2021-06-02T07:01:20.2840591Z ##[error]    fixtures.go:349: Deleting 'antrea-test' K8s Namespace
2021-06-02T07:01:26.2947712Z --- FAIL: TestAntctlAgentLocalAccess (47.66s)
2021-06-02T07:01:26.2949890Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070113Z.out_version (0.42s)
2021-06-02T07:01:26.2952566Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070113Z.out_log-level (0.33s)
2021-06-02T07:01:26.2955058Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070114Z.out_get_networkpolicy (0.30s)
2021-06-02T07:01:26.2957772Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070114Z.out_get_appliedtogroup (0.32s)
2021-06-02T07:01:26.2960320Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070114Z.out_get_addressgroup (0.30s)
2021-06-02T07:01:26.2962809Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070115Z.out_get_agentinfo (0.28s)
2021-06-02T07:01:26.2965282Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070115Z.out_get_podinterface (0.29s)
2021-06-02T07:01:26.2967744Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070115Z.out_get_ovsflows (0.33s)
2021-06-02T07:01:26.2970193Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070116Z.out_trace-packet (0.28s)
2021-06-02T07:01:26.2972650Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070116Z.out_supportbundle (0.53s)
2021-06-02T07:01:26.2975106Z     --- PASS: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070116Z.out_traceflow (0.26s)
2021-06-02T07:01:26.2977942Z     --- FAIL: TestAntctlAgentLocalAccess/antctl-coverage_-test.run=TestBincoverRunMain_-test.coverprofile=antctl-20210602T070117Z.out_featuregates (0.18s)

@luolanzone
Copy link
Contributor Author

@tnqn antctl_test is using a specific GetDebugCommands for testing purpose, since I added commandGroup for rawCommands, it failed to add the featuregates command as 'get' subcommand, I just refined the code here, hope it can fix the test failure.

if group, ok := groupCommands[cmd.commandGroup]; ok {
currentCommand = append(currentCommand, group.Use)
currentCommand = append(currentCommand, cmd.cobraCommand.Use)
allCommands = append(allCommands, currentCommand)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentCommand should be declared in this block if it's only used in it.
But to be more generic and reuse more code, it could be similar to L132-L137:

var currentCommand []string
if group, ok := groupCommands[cmd.commandGroup]; ok {
        currentCommand = append(currentCommand, group.Use)
}
currentCommand = append(currentCommand, strings.Split(cmd.cobraCommand.Use, " ")[:1])
allCommands = append(allCommands, currentCommand)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Split(cmd.cobraCommand.Use, " ")[:1] is a string slice and can't be append to another slice directly, not sure what's purpose to split cmd.cobraCommand.Use originally, I will check how to reuse code. thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like strings.Split(cmd.cobraCommand.Use, " ")[:1] is only converting cmd.cobraCommand.Use into a string slice so it can be appended to allCommands which is two dimensions array, so I change the code just the same as L132-L137 which will do the same thing once there is currentCommand string slice defined.

Copy link
Member

@tnqn tnqn Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. It could use strings.Split(cmd.cobraCommand.Use, " ")[0]. It splitted it because the "use" of other commands includes some arguments.
I guess the current commit will cause supportbundle command fail.

@tnqn
Copy link
Member

tnqn commented Jun 3, 2021

/test-all

@tnqn
Copy link
Member

tnqn commented Jun 3, 2021

/test-networkpolicy

@tnqn
Copy link
Member

tnqn commented Jun 3, 2021

/skip-e2e

It actually succeeded:

PASS
ok  	antrea.io/antrea/test/e2e	4820.880s
=== TEST SUCCESS !!! ===

@tnqn tnqn merged commit 79587c7 into antrea-io:main Jun 3, 2021
@luolanzone luolanzone deleted the featuregates branch June 4, 2021 09:37
wenqiq pushed a commit to wenqiq/antrea that referenced this pull request Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A way to get list of FeatureGates that are enabled (set to True)
6 participants