Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submit onboarding task #393

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

YQiu-oo
Copy link

@YQiu-oo YQiu-oo commented Oct 2, 2024

Hi all!
Here is my onboarding task result. You can find testrun.rar in the google drive: https://drive.google.com/file/d/1GDbRX6s0zrnv_1KY-dZEIJqkZFh_Wl71/view?usp=drive_link.

Alarm explanation was explained in the summary.md file.

Thanks,
Yukang Qiu

@TZ-zzz
Copy link
Member

TZ-zzz commented Oct 6, 2024

Hi @YQiu-oo, thank you for submitting the onboarding task.

Could you double-check the crd.yaml and operator.yaml you provided to Acto (the one specified in the config.json)? It seems like the operator isn't being deployed correctly during the tests, as it's consistently reporting the following error:

E1002 05:03:19.021840       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1alpha1.TidbDashboard: failed to list *v1alpha1.TidbDashboard: the server could not find the requested resource (get tidbdashboards.pingcap.com)

One way to ensure the crd.yaml and the operator.yaml is correct would be to deploy the operator with the files you provided manully before running acto.

@YQiu-oo
Copy link
Author

YQiu-oo commented Oct 7, 2024

Hi, @TZ-zzz! I am working on the issue, but my computer now consistently crash at the "deploy operator" stage(previously it worked successfully using the same stuff) and fails to proceed further. I don't know if Tidbdashboard is important to tidb-operator or actor. I mean will it impact acto to process further or will acto need this resource?

@TZ-zzz
Copy link
Member

TZ-zzz commented Oct 7, 2024

@YQiu-oo, yes this issue is critical since the operator was not working at all and all the tests acto runs are essentially no-ops. You can check the operator-logs in the testrun dir and you can see that the operator stuck at the deploying stage and was not taking actions.

@YQiu-oo
Copy link
Author

YQiu-oo commented Oct 8, 2024

@TZ-zzz, ok, I got you, but the operator log should be generated after it is deployed? Because my situation is I don't see operator log when stuck at the deploying stage.

@TZ-zzz
Copy link
Member

TZ-zzz commented Oct 8, 2024

@YQiu-oo, you can find the operator logs of each test case in the testrun_.../trial... folders.

@YQiu-oo
Copy link
Author

YQiu-oo commented Oct 10, 2024

@TZ-zzz Tidb-operator is weird on my computer, so I switched to MongoDB. Could you see my commit? Here is the link for the testrun.rar:https://drive.google.com/file/d/1ZKEs3y4an0kbzpbNdFdRGr62F6JDtl_h/view?usp=sharing

@TZ-zzz
Copy link
Member

TZ-zzz commented Oct 10, 2024

@YQiu-oo, I think the first alarm is due to that the previous changes haven't been reconciled. It seems the operator is stuck waiting for the pods to be reconciled, so the bug likely originates in an earlier phase, even though it's manifesting in the step that acto reports. It might be worth investigating why the previous state hasn't converged.

Btw, spec.statefulSet.meta is mapped to Kubernetes core resource StatefulSet so the changes to the metadata should trigger the change of StatefulSet. And labels and annotations are really important as lots of operators and controllers manage the Kubernetes resources based on these metadata.

For the second alarm, it's a misoperation instead of false alarm. The previous step generates some invalid values causing the internal state of mongoDB to be unready, which in turn prevents the operator from acting on the metadata.label changes.

@TZ-zzz
Copy link
Member

TZ-zzz commented Oct 13, 2024

@YQiu-oo, could you investigate the first alarm again, as the root cause is still a bit not clear.

@YQiu-oo
Copy link
Author

YQiu-oo commented Oct 13, 2024

@TZ-zzz I found the Agent in the Pod repeatedly failed to reach its goal state, preventing the MongoDB ReplicaSet from becoming ready. And then, I checked the previous operator log about configuration. The configuration could not find the required passwordSecretName, which is necessary for SCRAM authentication. I guess this unmatched configuration causes Agent to reach its goal state. (the previous changes haven't been reconciled)

@tylergu
Copy link
Member

tylergu commented Oct 13, 2024

@YQiu-oo Nice observation! Are you able to pinpoint the root cause in the mongodb operator which causes it to not reconcile after the system got into an error state?

@YQiu-oo
Copy link
Author

YQiu-oo commented Oct 14, 2024

@tylergu I double checked the mongodb operator repo. If targetConfigVersion is not equal to targetConfigVersion, then it will trigger an agent's issue and output The Agent in the Pod '%s' hasn't reached the goal state yet (goal: %d, agent: %s) (https://github.com/mongodb/mongodb-kubernetes-operator/blob/c83d4d487e36c835f022092d516ce622321172b0/pkg/agent/agent_readiness.go#L110).
And GetAllDesiredMembersAndArbitersPodState( https://github.com/mongodb/mongodb-kubernetes-operator/blob/c83d4d487e36c835f022092d516ce622321172b0/pkg/agent/agent_readiness.go#L67) is the function to check and return states of all desired pods in a replica set. Since goal state is not reached, replicat set is not ready and then reconcile then replicat set is not ready and then reconcile........... until stop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants