Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use low-level controller and handlers in SetupWithManager #1315

Merged
merged 6 commits into from
Aug 2, 2021

Conversation

Jeffwan
Copy link
Member

@Jeffwan Jeffwan commented Jul 30, 2021

Resolve #1312 #1314 #1316 #1317

  1. Since we use kubeflow/common to observe expections, current (kubebuilder v3) way is kind of hard to be compatible with kubeflow/common pattern. The expectation calculation is not accurate. I determine to change back to kubebuilder v1 pattern to use low level controller in SetupWithManager. Rest of the logics and programming model is still same as Kubebuilder v3.0 which is compatible

  2. Add validation back to reconciler.

  3. Set defaults in onOwnerCreateFunc

  4. Fix job status update issue.

  5. Make CI work in this dev branch

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 30, 2021

/cc @kubeflow/wg-training-leads

@google-oss-robot google-oss-robot requested a review from a team July 30, 2021 03:04
Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 🎉 👍

pkg/controller.v1/mxnet/mxjob_controller.go Outdated Show resolved Hide resolved
pkg/controller.v1/tensorflow/tfjob_controller.go Outdated Show resolved Hide resolved
@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 30, 2021

We enabled tests for this branch SDK tests failed and the running job takes long time to finish.

Note: The enabled test case tests against original operator. We have not tested new operator yet. (should come soon)

image

    _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)
/usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:80
  /usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:80: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`.
 To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type,
 use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.float: fmt.FloatFormatter,
/usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:86
  /usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:86: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To
silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.i
nt64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.int: fmt.IntegerFormatter,
-- Docs: https://docs.pytest.org/en/latest/warnings.html
- generated xml file: /mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/output/artifacts/junit_sdk-test.xml -
=========================== short test summary info ============================
FAILED sdk/python/test/test_e2e.py::test_sdk_e2e - RuntimeError: ("Timeout wa...
============= 1 failed, 7 passed, 3 warnings in 602.85s (0:10:02) ==============
Exception ignored in: <function ApiClient.__del__ at 0x7f4972ea8f70>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 80, in __del__
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 649, in close
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 362, in put
AttributeError: 'NoneType' object has no attribute 'dumps'
INFO|2021-07-30T06:41:25|/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/src/kubeflow/tf-operator/py/kubeflow/tf_operato
r/tf_job_client.py|109| Job simple-tfjob-tests-v1 in namespace kubeflow; uid=171ed79f-1cb5-4680-9dd3-4eb790106bf3; conditions=[]
INFO|2021-07-30T06:41:55|/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/src/kubeflow/tf-operator/py/kubeflow/tf_operato
r/tf_job_client.py|109| Job simple-tfjob-tests-v1 in namespace kubeflow; uid=171ed79f-1cb5-4680-9dd3-4eb790106bf3; conditions=[]
INFO|2021-07-30T06:42:25|/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/src/kubeflow/tf-operator/py/kubeflow/tf_operato
r/tf_job_client.py|109| Job simple-tfjob-tests-v1 in namespace kubeflow; uid=171ed79f-1cb5-4680-9dd3-4eb790106bf3; conditions=[]

@gaocegege
Copy link
Member

/retest

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 30, 2021

I will fix the test failure today

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

Seems tf-operator is not successfully deployed in the cluster which leads to all TFJob submitted to the cluster failed.
I will manually pull image to local and have a test

W
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tf-job-operator-token-vkgd4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tf-job-operator-token-vkgd4
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  5m5s                  default-scheduler                                      Successfully assigned kubeflow/tf-job-operator-6d9579f79f-rbmjj to ip-192-168-33-201.us-west-2.compute.internal
  Normal   Pulling    5m4s                  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Pulling image "809251082950.dkr.ecr.us-west-2.amazonaws.com/tf-operator:b8d5d3cd6e824865de196388c2c2a4e852b709e7"
  Normal   Pulled     5m2s                  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Successfully pulled image "809251082950.dkr.ecr.us-west-2.amazonaws.com/tf-operator:b8d5d3cd6e824865de196388c2c2a4e852b709e7"
  Normal   Created    3m34s (x5 over 5m2s)  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Created container tf-job-operator
  Normal   Started    3m34s (x5 over 5m2s)  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Started container tf-job-operator
  Normal   Pulled     3m34s (x4 over 5m1s)  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Container image "809251082950.dkr.ecr.us-west-2.amazonaws.com/tf-operator:b8d5d3cd6e824865de196388c2c2a4e852b709e7" already present on machine
  Warning  BackOff    3m21s (x9 over 5m)    kubelet, ip-192-168-33-201.us-west-2.compute.internal  Back-off restarting failed container

@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

em.. Our CI images are private actually and I have to rebuilt the image myself.

/opt/tf-operator.v1 flag redefined: kubeconfig
panic: /opt/tf-operator.v1 flag redefined: kubeconfig  -> 

goroutine 1 [running]:
flag.(*FlagSet).Var(0xc0000ca120, 0x1bd0d40, 0xc0001ce540, 0x195df71, 0xa, 0x197271c, 0x1b)
	/usr/local/go/src/flag/flag.go:848 +0x4ae
flag.(*FlagSet).StringVar(...)
	/usr/local/go/src/flag/flag.go:751
github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app/options.(*ServerOption).AddFlags(0xc0001ce540, 0xc0000ca120)
	/go/src/github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app/options/options.go:54 +0x8b
main.main()
	/go/src/github.com/kubeflow/tf-operator/cmd/tf-operator.v1/main.go:54 +0x50

This is because we import controller-runtime which defines kubeconfig as well. This seems like a breaking change
kubernetes-sigs/controller-runtime#878 (comment)

The original way should not involve new changes. Let me see if we can exclude those tools when build original binary

CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See kubeflow#1316 for more details.

Signed-off-by: Jiaxin Shan <[email protected]>
@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

It looks much better now. I will fix rest of the failures
image

This is to fix original controller issue. It leverages init to register default to scheme. In our unversal operator project, we did clean up for register.go and break the case.

For more details, please check kubeflow#1317 (comment)

Signed-off-by: Jiaxin Shan <[email protected]>
@Jeffwan
Copy link
Member Author

Jeffwan commented Jul 31, 2021

/cc @gaocegege test failures are fixed. Please have another look

@Jeffwan Jeffwan requested a review from zw0610 July 31, 2021 21:15
return nil
}
// using onOwnerCreateFunc is easier to set defaults
if err = c.Watch(&source.Kind{Type: &mxjobv1.MXJob{}}, &handler.EnqueueRequestForObject{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we prefer using Watch instead of Owns?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch is the low-level functions provided by builder. Owns and For can be replaced by Watch with some enqueue handlers.

We want to leverage onOwnerCreateFunc to set defaults instead of using Kubebuilder way (webhook). I notice owns can not provide pluggable handler.

If there's a way to do that, let me know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It makes sense.

@Jeffwan Jeffwan requested a review from gaocegege August 2, 2021 03:37
Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM!

@@ -51,7 +51,7 @@ func NewServerOption() *ServerOption {

// AddFlags adds flags for a specific CMServer to the specified FlagSet.
func (s *ServerOption) AddFlags(fs *flag.FlagSet) {
fs.StringVar(&s.Kubeconfig, "kubeconfig", "", "The path of kubeconfig file")
//fs.StringVar(&s.Kubeconfig, "kubeconfig", "", "The path of kubeconfig file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it temporary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can not find a way to workaround it. User can still use KUBECONFIG to specify the path of config files. This is the only user facing change to tf controller.v1 users.

Next step is to build universal operator to replace tf controller and hit against all test case. Once conformance test pass. We will remove tf controller.v1

@@ -448,17 +456,12 @@ func onOwnerCreateFunc() func(event.CreateEvent) bool {
}

// TODO: check default setting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the TODO?

Copy link
Member Author

@Jeffwan Jeffwan Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt check this comment. Let's keep it for now.

I remember the reason I put a TODO there is because we use mxjobv1.SetDefaults_MXJob(mxjob) to set defaults, We can use scheme.Scheme.Default(mxjob), This is equivalent actually but I prefer to change to scheme.Scheme.Default(mxjob) in the near future.

Currently, we have not registered defaulters in scheme so that scheme.Scheme.Default(mxjob) do nothing there.

description "check default setting" is not accurate.. I should change to something more descriptive later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaocegege
Copy link
Member

/lgtm

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 2, 2021

/approve

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-robot google-oss-robot merged commit 43d950d into kubeflow:all-in-one-operator Aug 2, 2021
@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 2, 2021

Resolve #1299

@Jeffwan Jeffwan deleted the make_it_work branch August 2, 2021 17:25
Jeffwan added a commit that referenced this pull request Aug 5, 2021
* Use low-level controller and handlers in SetupWithManager

Signed-off-by: Jiaxin Shan <[email protected]>

* Add job validation in reconcile loop

Signed-off-by: Jiaxin Shan <[email protected]>

* Set defaults in onOwnerCreateFunc

Signed-off-by: Jiaxin Shan <[email protected]>

* Correctly update job status in apiserver

Signed-off-by: Jiaxin Shan <[email protected]>

* Remove Flag for Kubeconfig to fix flag redefined

CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See #1316 for more details.

Signed-off-by: Jiaxin Shan <[email protected]>

* Fix tensorflow job missing default port issue

This is to fix original controller issue. It leverages init to register default to scheme. In our unversal operator project, we did clean up for register.go and break the case.

For more details, please check #1317 (comment)

Signed-off-by: Jiaxin Shan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants