Use low-level controller and handlers in SetupWithManager #1315

Jeffwan · 2021-07-30T02:59:07Z

Resolve #1312 #1314 #1316 #1317

Since we use kubeflow/common to observe expections, current (kubebuilder v3) way is kind of hard to be compatible with kubeflow/common pattern. The expectation calculation is not accurate. I determine to change back to kubebuilder v1 pattern to use low level controller in SetupWithManager. Rest of the logics and programming model is still same as Kubebuilder v3.0 which is compatible
- example: https://book-v1.book.kubebuilder.io/basics/simple_controller.html
Add validation back to reconciler.
Set defaults in onOwnerCreateFunc
Fix job status update issue.
Make CI work in this dev branch

Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan · 2021-07-30T03:04:05Z

/cc @kubeflow/wg-training-leads

gaocegege

Thanks for your contribution! 🎉 👍

pkg/controller.v1/mxnet/mxjob_controller.go

pkg/controller.v1/tensorflow/tfjob_controller.go

Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan · 2021-07-30T06:43:20Z

We enabled tests for this branch SDK tests failed and the running job takes long time to finish.

Note: The enabled test case tests against original operator. We have not tested new operator yet. (should come soon)

    _issue_warning_captured(deprecated.JUNIT_XML_DEFAULT_FAMILY, config.hook, 2)
/usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:80
  /usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:80: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`.
 To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type,
 use `np.float64` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.float: fmt.FloatFormatter,
/usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:86
  /usr/local/lib/python3.8/dist-packages/table_logger/table_logger.py:86: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To
silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.i
nt64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
    np.int: fmt.IntegerFormatter,
-- Docs: https://docs.pytest.org/en/latest/warnings.html
- generated xml file: /mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/output/artifacts/junit_sdk-test.xml -
=========================== short test summary info ============================
FAILED sdk/python/test/test_e2e.py::test_sdk_e2e - RuntimeError: ("Timeout wa...
============= 1 failed, 7 passed, 3 warnings in 602.85s (0:10:02) ==============
Exception ignored in: <function ApiClient.__del__ at 0x7f4972ea8f70>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 80, in __del__
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 649, in close
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 362, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

INFO|2021-07-30T06:41:25|/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/src/kubeflow/tf-operator/py/kubeflow/tf_operato
r/tf_job_client.py|109| Job simple-tfjob-tests-v1 in namespace kubeflow; uid=171ed79f-1cb5-4680-9dd3-4eb790106bf3; conditions=[]
INFO|2021-07-30T06:41:55|/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/src/kubeflow/tf-operator/py/kubeflow/tf_operato
r/tf_job_client.py|109| Job simple-tfjob-tests-v1 in namespace kubeflow; uid=171ed79f-1cb5-4680-9dd3-4eb790106bf3; conditions=[]
INFO|2021-07-30T06:42:25|/mnt/test-data-volume/kubeflow-tf-operator-presubmit-v1-1315-4082e03-8016-e389/src/kubeflow/tf-operator/py/kubeflow/tf_operato
r/tf_job_client.py|109| Job simple-tfjob-tests-v1 in namespace kubeflow; uid=171ed79f-1cb5-4680-9dd3-4eb790106bf3; conditions=[]

gaocegege · 2021-07-30T08:13:31Z

/retest

Jeffwan · 2021-07-30T17:14:11Z

I will fix the test failure today

Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan · 2021-07-31T06:35:58Z

Seems tf-operator is not successfully deployed in the cluster which leads to all TFJob submitted to the cluster failed.
I will manually pull image to local and have a test

W
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  tf-job-operator-token-vkgd4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  tf-job-operator-token-vkgd4
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                   From                                                   Message
  ----     ------     ----                  ----                                                   -------
  Normal   Scheduled  5m5s                  default-scheduler                                      Successfully assigned kubeflow/tf-job-operator-6d9579f79f-rbmjj to ip-192-168-33-201.us-west-2.compute.internal
  Normal   Pulling    5m4s                  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Pulling image "809251082950.dkr.ecr.us-west-2.amazonaws.com/tf-operator:b8d5d3cd6e824865de196388c2c2a4e852b709e7"
  Normal   Pulled     5m2s                  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Successfully pulled image "809251082950.dkr.ecr.us-west-2.amazonaws.com/tf-operator:b8d5d3cd6e824865de196388c2c2a4e852b709e7"
  Normal   Created    3m34s (x5 over 5m2s)  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Created container tf-job-operator
  Normal   Started    3m34s (x5 over 5m2s)  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Started container tf-job-operator
  Normal   Pulled     3m34s (x4 over 5m1s)  kubelet, ip-192-168-33-201.us-west-2.compute.internal  Container image "809251082950.dkr.ecr.us-west-2.amazonaws.com/tf-operator:b8d5d3cd6e824865de196388c2c2a4e852b709e7" already present on machine
  Warning  BackOff    3m21s (x9 over 5m)    kubelet, ip-192-168-33-201.us-west-2.compute.internal  Back-off restarting failed container

Jeffwan · 2021-07-31T06:51:59Z

em.. Our CI images are private actually and I have to rebuilt the image myself.

/opt/tf-operator.v1 flag redefined: kubeconfig
panic: /opt/tf-operator.v1 flag redefined: kubeconfig  -> 

goroutine 1 [running]:
flag.(*FlagSet).Var(0xc0000ca120, 0x1bd0d40, 0xc0001ce540, 0x195df71, 0xa, 0x197271c, 0x1b)
	/usr/local/go/src/flag/flag.go:848 +0x4ae
flag.(*FlagSet).StringVar(...)
	/usr/local/go/src/flag/flag.go:751
github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app/options.(*ServerOption).AddFlags(0xc0001ce540, 0xc0000ca120)
	/go/src/github.com/kubeflow/tf-operator/cmd/tf-operator.v1/app/options/options.go:54 +0x8b
main.main()
	/go/src/github.com/kubeflow/tf-operator/cmd/tf-operator.v1/main.go:54 +0x50

This is because we import controller-runtime which defines kubeconfig as well. This seems like a breaking change
kubernetes-sigs/controller-runtime#878 (comment)

The original way should not involve new changes. Let me see if we can exclude those tools when build original binary

CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See kubeflow#1316 for more details. Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan · 2021-07-31T18:38:30Z

It looks much better now. I will fix rest of the failures

This is to fix original controller issue. It leverages init to register default to scheme. In our unversal operator project, we did clean up for register.go and break the case. For more details, please check kubeflow#1317 (comment) Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan · 2021-07-31T21:09:39Z

/cc @gaocegege test failures are fixed. Please have another look

zw0610 · 2021-08-01T06:31:52Z

pkg/controller.v1/mxnet/mxjob_controller.go

-			return nil
-		}
+	// using onOwnerCreateFunc is easier to set defaults
+	if err = c.Watch(&source.Kind{Type: &mxjobv1.MXJob{}}, &handler.EnqueueRequestForObject{},


Any reason we prefer using Watch instead of Owns?

Watch is the low-level functions provided by builder. Owns and For can be replaced by Watch with some enqueue handlers.

We want to leverage onOwnerCreateFunc to set defaults instead of using Kubebuilder way (webhook). I notice owns can not provide pluggable handler.

If there's a way to do that, let me know

I see. It makes sense.

gaocegege

Generally LGTM!

gaocegege · 2021-08-02T04:35:49Z

cmd/tf-operator.v1/app/options/options.go

@@ -51,7 +51,7 @@ func NewServerOption() *ServerOption {

 // AddFlags adds flags for a specific CMServer to the specified FlagSet.
 func (s *ServerOption) AddFlags(fs *flag.FlagSet) {
-	fs.StringVar(&s.Kubeconfig, "kubeconfig", "", "The path of kubeconfig file")
+	//fs.StringVar(&s.Kubeconfig, "kubeconfig", "", "The path of kubeconfig file")


is it temporary?

Yeah, I can not find a way to workaround it. User can still use KUBECONFIG to specify the path of config files. This is the only user facing change to tf controller.v1 users.

Next step is to build universal operator to replace tf controller and hit against all test case. Once conformance test pass. We will remove tf controller.v1

gaocegege · 2021-08-02T04:37:12Z

pkg/controller.v1/mxnet/mxjob_controller.go

@@ -448,17 +456,12 @@ func onOwnerCreateFunc() func(event.CreateEvent) bool {
 		}

 		// TODO: check default setting


Can we remove the TODO?

I doubt check this comment. Let's keep it for now.

I remember the reason I put a TODO there is because we use mxjobv1.SetDefaults_MXJob(mxjob) to set defaults, We can use scheme.Scheme.Default(mxjob), This is equivalent actually but I prefer to change to scheme.Scheme.Default(mxjob) in the near future.

Currently, we have not registered defaulters in scheme so that scheme.Scheme.Default(mxjob) do nothing there.

description "check default setting" is not accurate.. I should change to something more descriptive later.

/cc @gaocegege

gaocegege · 2021-08-02T06:43:31Z

/lgtm

Jeffwan · 2021-08-02T16:48:17Z

/approve

google-oss-robot · 2021-08-02T16:48:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Jeffwan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jeffwan · 2021-08-02T17:25:26Z

Resolve #1299

* Use low-level controller and handlers in SetupWithManager Signed-off-by: Jiaxin Shan <[email protected]> * Add job validation in reconcile loop Signed-off-by: Jiaxin Shan <[email protected]> * Set defaults in onOwnerCreateFunc Signed-off-by: Jiaxin Shan <[email protected]> * Correctly update job status in apiserver Signed-off-by: Jiaxin Shan <[email protected]> * Remove Flag for Kubeconfig to fix flag redefined CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See #1316 for more details. Signed-off-by: Jiaxin Shan <[email protected]> * Fix tensorflow job missing default port issue This is to fix original controller issue. It leverages init to register default to scheme. In our unversal operator project, we did clean up for register.go and break the case. For more details, please check #1317 (comment) Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan added 2 commits July 29, 2021 17:43

Use low-level controller and handlers in SetupWithManager

cb0841b

Signed-off-by: Jiaxin Shan <[email protected]>

Add job validation in reconcile loop

93d5f0b

Signed-off-by: Jiaxin Shan <[email protected]>

google-oss-robot requested review from jinchihe and terrytangyuan July 30, 2021 02:59

google-oss-robot added the size/XL label Jul 30, 2021

google-oss-robot requested a review from a team July 30, 2021 03:04

gaocegege reviewed Jul 30, 2021

View reviewed changes

pkg/controller.v1/mxnet/mxjob_controller.go Outdated Show resolved Hide resolved

pkg/controller.v1/tensorflow/tfjob_controller.go Outdated Show resolved Hide resolved

Set defaults in onOwnerCreateFunc

4082e03

Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan mentioned this pull request Jul 31, 2021

Status is missing from job details #1314

Closed

Correctly update job status in apiserver

fa844ea

Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan mentioned this pull request Jul 31, 2021

flag redefined: kubeconfig: tf-operator.v1 and universal operator can not work together #1316

Closed

Remove Flag for Kubeconfig to fix flag redefined

c434a71

CI reports `flag redefined: kubeconfig` issue and this is due to duplicate flag registration. See kubeflow#1316 for more details. Signed-off-by: Jiaxin Shan <[email protected]>

Jeffwan mentioned this pull request Jul 31, 2021

CI pipeline can not successfully run in all-in-one-operator branch #1317

Closed

Jeffwan requested a review from zw0610 July 31, 2021 21:15

zw0610 reviewed Aug 1, 2021

View reviewed changes

Jeffwan requested a review from gaocegege August 2, 2021 03:37

gaocegege reviewed Aug 2, 2021

View reviewed changes

google-oss-robot requested a review from gaocegege August 2, 2021 05:41

google-oss-robot assigned gaocegege Aug 2, 2021

google-oss-robot added the lgtm label Aug 2, 2021

zw0610 mentioned this pull request Aug 2, 2021

feat(common) add reconciler.v1 kubeflow/common#141

Merged

3 tasks

google-oss-robot added the approved label Aug 2, 2021

google-oss-robot merged commit 43d950d into kubeflow:all-in-one-operator Aug 2, 2021

Jeffwan deleted the make_it_work branch August 2, 2021 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use low-level controller and handlers in SetupWithManager #1315

Use low-level controller and handlers in SetupWithManager #1315

Jeffwan commented Jul 30, 2021 •

edited

Loading

Jeffwan commented Jul 30, 2021

gaocegege left a comment

Jeffwan commented Jul 30, 2021 •

edited

Loading

gaocegege commented Jul 30, 2021

Jeffwan commented Jul 30, 2021 •

edited

Loading

Jeffwan commented Jul 31, 2021

Jeffwan commented Jul 31, 2021

Jeffwan commented Jul 31, 2021

Jeffwan commented Jul 31, 2021

zw0610 Aug 1, 2021

Jeffwan Aug 1, 2021

zw0610 Aug 1, 2021

gaocegege left a comment

gaocegege Aug 2, 2021

Jeffwan Aug 2, 2021

gaocegege Aug 2, 2021

Jeffwan Aug 2, 2021 •

edited

Loading

Jeffwan Aug 2, 2021

gaocegege commented Aug 2, 2021

Jeffwan commented Aug 2, 2021

google-oss-robot commented Aug 2, 2021

Jeffwan commented Aug 2, 2021

		@@ -448,17 +456,12 @@ func onOwnerCreateFunc() func(event.CreateEvent) bool {
		}

		// TODO: check default setting

Use low-level controller and handlers in SetupWithManager #1315

Use low-level controller and handlers in SetupWithManager #1315

Conversation

Jeffwan commented Jul 30, 2021 • edited Loading

Jeffwan commented Jul 30, 2021

gaocegege left a comment

Choose a reason for hiding this comment

Jeffwan commented Jul 30, 2021 • edited Loading

gaocegege commented Jul 30, 2021

Jeffwan commented Jul 30, 2021 • edited Loading

Jeffwan commented Jul 31, 2021

Jeffwan commented Jul 31, 2021

Jeffwan commented Jul 31, 2021

Jeffwan commented Jul 31, 2021

zw0610 Aug 1, 2021

Choose a reason for hiding this comment

Jeffwan Aug 1, 2021

Choose a reason for hiding this comment

zw0610 Aug 1, 2021

Choose a reason for hiding this comment

gaocegege left a comment

Choose a reason for hiding this comment

gaocegege Aug 2, 2021

Choose a reason for hiding this comment

Jeffwan Aug 2, 2021

Choose a reason for hiding this comment

gaocegege Aug 2, 2021

Choose a reason for hiding this comment

Jeffwan Aug 2, 2021 • edited Loading

Choose a reason for hiding this comment

Jeffwan Aug 2, 2021

Choose a reason for hiding this comment

gaocegege commented Aug 2, 2021

Jeffwan commented Aug 2, 2021

google-oss-robot commented Aug 2, 2021

Jeffwan commented Aug 2, 2021

Jeffwan commented Jul 30, 2021 •

edited

Loading

Jeffwan commented Jul 30, 2021 •

edited

Loading

Jeffwan commented Jul 30, 2021 •

edited

Loading

Jeffwan Aug 2, 2021 •

edited

Loading