Add Spark Job #1467

holdenk · 2018-09-03T19:11:16Z

Initial work in progress attempt at adding Spark to kubeflow using the spark-on-k8s-operator as a base starting point. This is super early but I'd love peoples feedback on the direction with this.

cc @texasmichelle

Known TODOs:

Tests that run on more than just my specific directory layout
Documentation
Cleanup

This change is

googlebot · 2018-09-03T19:11:22Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

pdmack · 2018-09-04T16:14:34Z

kubeflow/spark/parts.yaml

+   "name": "spark",
+   "apiVersion": "0.0.1",
+   "kind": "ksonnet.io/parts",
+   "description": "An empty package used as a stub for new packages.\n",


"Holden's awesome Spark Job prototype\n"

pdmack · 2018-09-04T16:17:02Z

/ok-to-test

@holdenk did you autoformat the *sonnet files yet? Guess we'll find out...

pdmack · 2018-09-04T17:08:24Z

winner, winner...

Please use scripts/autoformat_jsonnet.sh to fix up the sonnet files.

inc0

just few nits from me. Question, these are for spark worker, is controller cluster also in plans? Awesome work:)

inc0 · 2018-09-18T15:10:16Z

kubeflow/spark/all.libsonnet

+            },
+            labels: {
+              "app.kubernetes.io/name": name + "-sparkoperator",
+              "app.kubernetes.io/version": "v2.3.1-v1alpha1",


should this be a parameter with default ver?

Yes, good catch,

inc0 · 2018-09-18T15:13:23Z

kubeflow/spark/all.libsonnet

+        {
+          kind: "ServiceAccount",
+          name: name + "-spark",
+          namespace: "default",


this should be param

thanks for catching.

googlebot · 2018-10-10T14:40:30Z

CLAs look good, thanks!

holdenk · 2018-10-19T22:03:31Z

Hey @texasmichelle & @kunmingg I'm wondering about the best way to test this operator - looking at the test_deploy.py script it seems to use KF components @ master but since this is a PR I'm not sure that would work well.

kunmingg · 2018-10-19T22:43:41Z

@holdenk

You can edit tmp logic change in test_deploy.py like changing version to use.
presubmit test will then test under your new logic.
When test looks good you can revert change in 1. and merge this PR.

holdenk · 2018-10-19T22:49:19Z

Ok so I can point it at this branch. Is there a reason this isn't auto pointed at the branch in PR builds (seems like it could give us bad results on PRs in general)? Unless I'm missing something which is quite possible since it's my first pass through this code.

…

On Fri, Oct 19, 2018, 3:43 PM Kunming Qu ***@***.***> wrote: @holdenk <https://github.com/holdenk> 1. You can edit tmp logic change in test_deploy.py like changing version to use. 2. presubmit test will then test under your new logic. 3. When test looks good you can revert change in 1. and merge this PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1467 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AADp9Yti_u4x1R_FDkw4E8_59xz5UbZZks5umlWugaJpZM4WX-c_> .

kunmingg · 2018-10-19T23:08:28Z

@holdenk
For tests running on GKE I believe now they are covered by https://github.com/kubeflow/kubeflow/blob/master/prow_config.yaml#L15
which should deploy components from current PR code.
It would be easier to add new test here.

test_deploy.py is more focusing on minikube integration now.

krisnova · 2018-12-23T08:24:02Z

kubeflow/spark/all.libsonnet

+            "nodes",
+          ],
+          verbs: [
+            "get",


Good girl...

While Nodes are technically read-only this is still decent practice as the object is so weird in general. Do we need specific Node information? If so what?

I am wondering if this is why we are using ClusterRole instead of a Role

Just a nit/question - non blocking because IDGAF

So ClusterRole versus Role is now user configurable, if folks don't need to run jobs outside of the namespace where they created the operator we'll just do a Role, but if they want to have the operator and jobs sit in different namespaces we use clusterrole.

krisnova · 2018-12-23T08:46:36Z

kubeflow/spark/all.libsonnet

+      },
+    },
+    operatorClusterRole:: {
+      kind: "ClusterRole",


Why ClusterRole instead of Role? Looks like we are binding on a single namespace only, and opening up the broader permissions might be unnecessary?

https://kubernetes.io/docs/reference/access-authn-authz/rbac/#role-and-clusterrole

Not super sure since the RBACS are based on the ones from the spark-operator project from GCP folks, my gut is that if we wanted to support having jobs in different namesapces wed need the operator have a ClusterRole but I'm not super sure. The RBAC file for the spark-operator is at https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/8c7fdbb306dfd656093c1b2a4ede901d651c9bd5/manifest/spark-operator-rbac.yaml , I could try and scope it down though and see if it still works?

Ah I didn't realize it was a port from the spark-operator project. I can ask there. It's not necessarily a concern, just more of wandering why we needed the broader scope. If we are spanning namespaces that makes sense. No change needed, thanks for clarifying.

holdenk · 2019-01-03T00:47:34Z

/retest

holdenk · 2019-01-04T08:20:53Z

/retest

holdenk · 2019-02-09T22:27:11Z

Looks like the spark operator is applying successfully now, so it should be ready for review again. cc @jlewi @texasmichelle

jlewi

Reviewed 1 of 7 files at r1, 1 of 5 files at r4, 1 of 5 files at r10, 1 of 7 files at r11, 8 of 8 files at r13.
Reviewable status: all files reviewed, 17 unresolved discussions (waiting on @gaocegege, @holdenk, @inc0, @jlewi, @kris-nova, and @pdmack)

kubeflow/spark/all.libsonnet, line 101 at r4 (raw file):

Previously, kris-nova (Kris Nova) wrote…

Ah I didn't realize it was a port from the spark-operator project. I can ask there. It's not necessarily a concern, just more of wandering why we needed the broader scope. If we are spanning namespaces that makes sense. No change needed, thanks for clarifying.

I think the pattern we want is to install the operator in one namespace e.g. "kubeflow-system" and users will use a different namesapce.

So I do think we need a ClusterRole because the operator will want to claim jobs in other namespaces.

kubeflow/spark/all.libsonnet, line 26 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

Chatted with @texasmichelle about this and the weird behaviour I had scene back with the minikube tests and it make sense now so I'll keep as-is

Getting namespace from params is mostly a legacy. There was a time when ksonnet didn't support getting the namespace from the environment. So as a workaround we got namespace from params.

The current pattern is to always get namespace from environment and if users want to deploy in a specific namespace they should create a new environment.

kubeflow/spark/parts.yaml, line 5 at r1 (raw file):

Previously, holdenk (Holden Karau) wrote…

<3 :D

Maybe add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ?

kubeflow/spark/README.md, line 2 at r13 (raw file):

A very early attempt at allowing Apache Spark to be used with Kubeflow.
Starts a container to run the driver program in, and the rest is up to the Spark on K8s integration.

Add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator if that's what its based on?

testing/workflows/components/workflows.libsonnet, line 281 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

So this part seems to be triggered outside of the minikube tests.

That being said, I think it might make sense, in the future, to test this and other operator one-by-one on minikube, what do you think?

It might make sense to test on minikube one by one. That said the minikube test is probably in need of some major updating. So I don't know how useful this will be.

But I don't have a strong opinion either way.

testing/deploy_utils.py, line 100 at r10 (raw file):

Previously, holdenk (Holden Karau) wrote…

So I figured that hard coding master was a bad idea and installing Spark as part of the e2e minikube tests make sure it at least can be installed. I don't feel strongly about this since we don't need it for the full workflow tests so I'm happy to revert if this complicates matters.

The minikube test isn't using kfctl so it really isn't testing what we want anymore. So I'd probably recommend not worrying about it.

testing/deploy_utils.py, line 115 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

I wish I knew why, but it is unrelated so I'll get rid of this.

Make sense; per comment above minikube test is in't using kfctl so its not really testing what we want.

I don't know if this comment is even still relevant.

testing/test_deploy.py, line 129 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

TODO: revert this

Still planning on reverting this?

jlewi · 2019-02-10T18:46:57Z

Thanks @holdenk. It looks like the spark apply test is still failing

+ ks apply default -c spark-operator --verbose
level=debug msg="setting log verbosity" verbosity-level=1
level=debug msg="loading application configuration from /mnt/test-data-volume/kubeflow-presubmit-kfctl-1467-39b503c-5575-5aca/kfctl-5aca/ks_app"
level=debug msg="creating ks pipeline for environment \"default\""
level=debug msg="building objects" action=pipeline module-name=/
level=debug msg="jsonnet evaluate snippet" elapsed=59.03316ms name=applyGlobals
level=error msg="find objects: open /mnt/test-data-volume/kubeflow-presubmit-kfctl-1467-39b503c-5575-5aca/kfctl-5aca/ks_app/environments/default/params.li
bsonnet: no such file or directory"

I think the problem is you aren't creating the default environment.
The default environment is created by the kfctl-apply-k8s step. So I think you want to make the spark-operator step depend on that step.

jlewi · 2019-02-11T03:45:45Z

/ok-to-test

holdenk · 2019-02-11T06:10:58Z

@jlewi so I think the apply operator is succeeding, the part which is failing is the part which depended on the Python script so I don't think it the env issue (although that was possibly the issue before).

…link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now

… test the operator applies for now.

holdenk

Reviewable status: 0 of 6 files reviewed, 13 unresolved discussions (waiting on @gaocegege, @holdenk, @inc0, @jlewi, @kris-nova, and @pdmack)

kubeflow/spark/all.libsonnet, line 82 at r2 (raw file):

Previously, holdenk (Holden Karau) wrote…

thanks for catching.

Done.

kubeflow/spark/all.libsonnet, line 247 at r2 (raw file):

Previously, holdenk (Holden Karau) wrote…

Yes, good catch,

Done.

kubeflow/spark/all.libsonnet, line 101 at r4 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

I think the pattern we want is to install the operator in one namespace e.g. "kubeflow-system" and users will use a different namesapce.

So I do think we need a ClusterRole because the operator will want to claim jobs in other namespaces.

Ok, I'll switch it to clusterrole

kubeflow/spark/parts.yaml, line 5 at r1 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Maybe add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator ?

Done

holdenk

Reviewable status: 0 of 6 files reviewed, 10 unresolved discussions (waiting on @gaocegege, @inc0, @jlewi, @kris-nova, and @pdmack)

kubeflow/spark/all.libsonnet, line 140 at r4 (raw file):

Previously, holdenk (Holden Karau) wrote…

So ClusterRole versus Role is now user configurable, if folks don't need to run jobs outside of the namespace where they created the operator we'll just do a Role, but if they want to have the operator and jobs sit in different namespaces we use clusterrole.

Resolved from @jlewi's comment

kubeflow/spark/README.md, line 2 at r13 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Add a link to https://github.com/GoogleCloudPlatform/spark-on-k8s-operator if that's what its based on?

Done.

testing/workflows/components/kfctl_test.jsonnet, line 221 at r10 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

Did you modify kfctl to add this?

Backed out this change anyways so it shouldn't matter.

testing/workflows/components/kfctl_test.jsonnet, line 235 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

Yeah I can revert those if we want, I figured it made sense to see Spark installed on minikube even if we only used the operator on the full version in e2e workflow.

Reverted changes to Python helper scripts.

testing/spark_temp/simple_test.sh, line 1 at r12 (raw file):

Previously, holdenk (Holden Karau) wrote…

This was for local testing, I can remove it.

Removed

testing/workflows/components/workflows.libsonnet, line 281 at r12 (raw file):

Previously, jlewi (Jeremy Lewi) wrote…

It might make sense to test on minikube one by one. That said the minikube test is probably in need of some major updating. So I don't know how useful this will be.

But I don't have a strong opinion either way.

Done. For now I took this out, it is used beyond e2e minikube but since I don't have this wired up to also work on e2e minikube tests doesn't make sense to put this in.

jlewi

Reviewable status: 0 of 6 files reviewed, 5 unresolved discussions (waiting on @gaocegege, @inc0, @jlewi, @kris-nova, and @pdmack)

jlewi · 2019-02-12T03:01:06Z

Woo Hoo!
Thanks for pushing this through.
/lgtm
/approve

holdenk · 2019-02-13T00:29:50Z

Test failures look unrelated, /retest

jlewi · 2019-02-13T05:27:43Z

We were having quota issues earlier. Should be fixed now.

/lgtm
/approve

k8s-ci-robot · 2019-02-13T05:27:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

holdenk · 2019-02-14T18:57:33Z

It's still failing before getting to the Spark specific code, and looks like a quota failure so lets try:
/retest

holdenk · 2019-02-14T18:58:11Z

Although IDK if the bot listens to me might need someone else to tell it to retest

holdenk · 2019-02-14T19:35:03Z

Ok the spark jobs passed but the notebooks test failed ?

holdenk · 2019-02-14T19:35:07Z

/retest

holdenk · 2019-02-14T19:37:18Z

/meow

@kunmingg

* Add a Spark operator to Kubeflow along with integration tests. Start adding converted spark operator elements Can generate empty service account for Spark Create the service account for the spark-operator. Add clusterRole for Spark Add cluster role bindings for Spark Add deployment (todo cleanup name/image) Can now launch spark operator Put in a reasonable default for namespace (e.g default not null) and make the image used for spark-operator configurable Start working to add job type We can now launch and operator and launch a job, but the service accounts don't quite line up. TODO(holden) refactor the service accounts for the job to only be created in the job and move sparkJob inside of all.json as well then have an all / operator / job entry point in all.json maybe? Add two hacked up temporary test scripts for use during dev (TODO refactor later into proper workflow) Able to launch a job fixed coreLimit and added job arguments. Remaining TODOs are handling of nulls & svc account hack + test cleanup. Start trying to re-organize the operator/job Fix handling of optional jobArguments and mainClass and now it _works_ :) Auto format the ksonnet. Reviewer feedback: switch description of Spark operator to something meaningful, use sparkVersion param instead of hard coded v2.3.1-v1alpha1, and fix hardcoded namespace. Clarify jobName param, remove Fix this since it has been integrated into all.libsonnet as intended. CR feedback: change description typo and add opitonal param to spark operator for sparkVersion Start trying to add spark tests to test_deploy.py At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests. This reverts commit 912a763. Start trying to add Spark to the e2e workflow for testing Looks like the prow tests call into the python tests normally so Revert "At @kunmingg suggestion Revert "Start trying to add spark tests to test_deploy.py" to focus on prow tests." This reverts commit 6c4c81f. autoformat jsonnet s/core/common/ and /var/log/syslog to README Race condition on first deployment Start adding SparkPI job to the workflow test. Generate spark operator during CI as well. Fix deploy kf indent Already covered by deploy. Install spark operator Revert "Install spark operator" This reverts commit cc559dd. Test against the PR not master. Fix string concat Take spark-deploy out of workflows since cover in kf presub anyways. Debug commit revert later. idk whats going on for real. hax Ok lets use where the sym link was coming from idk. Debug deploy kubeflow call... Pritn in maint oo. Specify a name. name Get all. More debugging also why do we eddit app.yaml; directly. don't gen common import for debug Just do spark-operator as verbose. spelling hmm namespace looked weird, lets run pytorch in verbose too so I can compare put verbose at the end Autoformat the json Add a deployment scope and give more things a namespace Format. Gen pytorch and spark ops as verbose idk wtf this is. Don't deploy the spark job in the releaser test no kfctl test either. Just use name We don't append any junk anymore format json Don't do spark in deploy_kubeflow anymore Spark job deployment with workflows Apply spark operator. Add a sleep hack Fix multi-line add a working dir for the ks app temp debug garbage specify working dir Working dir was not happy, just cd cause why not testdir not appDir change to tests.testDir Move operator deployment Make sure we are in the ks_app? Remove debugging and YOLO 90% less YOLO Add that comma Change deps well CD seems to work in the other command so uhhh who knows? Use runpath + pushd instead of kfctl generate Just generate for now Do both Generate k8s Install operator Break down setting up the spark operator into different steps We are in default rather than ghke Use the run script to do the dpeloy Change the namespace to stepsNamespace and add debug step cauise idk Append the params to generate cmd Remove params_str since we're doing list and a param of namespace s/extends/extend/ Move params to the right place Remove debug cluster step Remove local test since we now use the regular e2e argo triggered tests. Respond to the CR feedback Fix paramterization of spark executor config. Plumb through spark version to executor version label Remove unecessary whitespace change in otherwise unmodified file. * re-run autoformat * default doesn't seem to exists anymore * Debug the env list cause it changed * re-run autoformat again * Specify the env since env list shows default env is the only env present. * Remove debug env list since the operator now works * autofrmat and indent default * Address CR feedback: remove deploymentscope and just use clusterole, link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now * Take out the spark job from ther workflows in components test we just test the operator applies for now. * Remove namespace as a param and just use the env. * Fix end of line on namespace from ; to ,

k8s-ci-robot added the do-not-merge/work-in-progress label Sep 3, 2018

k8s-ci-robot requested review from gaocegege and pdmack September 3, 2018 19:11

googlebot added the cla: no label Sep 3, 2018

k8s-ci-robot added size/L needs-ok-to-test labels Sep 3, 2018

pdmack reviewed Sep 4, 2018

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test label Sep 4, 2018

inc0 reviewed Sep 18, 2018

View reviewed changes

holdenk force-pushed the spark-job branch from 24170d7 to 824f341 Compare September 25, 2018 20:38

holdenk force-pushed the spark-job branch from 824f341 to c259542 Compare October 10, 2018 14:40

googlebot added cla: yes and removed cla: no labels Oct 10, 2018

holdenk force-pushed the spark-job branch from 982dcd4 to 4ba2c9e Compare October 19, 2018 21:21

holdenk changed the title ~~[WIP] Add Spark Job - not ready for merge / full review~~ [WIP] Add Spark Job - not ready for merge Oct 19, 2018

krisnova reviewed Dec 23, 2018

View reviewed changes

holdenk force-pushed the spark-job branch from 0cc5fc8 to 7fd0a24 Compare December 28, 2018 19:16

k8s-ci-robot added size/XL and removed size/L labels Dec 28, 2018

k8s-ci-robot added size/L and removed size/XL labels Feb 9, 2019

holdenk changed the title ~~Add Spark Job - not ready for merge~~ Add Spark Job Feb 9, 2019

jlewi suggested changes Feb 10, 2019

View reviewed changes

holdenk added 3 commits February 10, 2019 22:25

Address CR feedback: remove deploymentscope and just use clusterole, …

6080dd1

…link to upstream base operator in doc, remove downstream job test since it's triggered in both minikube and kfctl tests and we don't want to test it in minikube right now

Take out the spark job from ther workflows in components test we just…

8a688d2

… test the operator applies for now.

Remove namespace as a param and just use the env.

028184d

holdenk commented Feb 11, 2019

View reviewed changes

Fix end of line on namespace from ; to ,

0218bd5

holdenk commented Feb 11, 2019

View reviewed changes

jlewi approved these changes Feb 12, 2019

View reviewed changes

k8s-ci-robot assigned jlewi Feb 12, 2019

k8s-ci-robot added lgtm approved labels Feb 12, 2019

k8s-ci-robot merged commit f5dc021 into kubeflow:master Feb 14, 2019

holdenk mentioned this pull request Oct 7, 2019

Update spark operator & add to kfdef kubeflow/manifests#441

Merged

Add Spark Job #1467

Add Spark Job #1467

Conversation

holdenk commented Sep 3, 2018 • edited by jlewi Loading

googlebot commented Sep 3, 2018

What to do if you already signed the CLA

Individual signers

Corporate signers

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdmack commented Sep 4, 2018

pdmack commented Sep 4, 2018

inc0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

googlebot commented Oct 10, 2018

holdenk commented Oct 19, 2018

kunmingg commented Oct 19, 2018

holdenk commented Oct 19, 2018 via email

kunmingg commented Oct 19, 2018

krisnova Dec 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Jan 3, 2019

holdenk commented Jan 4, 2019

holdenk commented Feb 9, 2019

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Feb 10, 2019

jlewi commented Feb 11, 2019

holdenk commented Feb 11, 2019

holdenk left a comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

jlewi left a comment

Choose a reason for hiding this comment

jlewi commented Feb 12, 2019

holdenk commented Feb 13, 2019

jlewi commented Feb 13, 2019

k8s-ci-robot commented Feb 13, 2019

holdenk commented Feb 14, 2019

holdenk commented Feb 14, 2019

holdenk commented Feb 14, 2019

holdenk commented Feb 14, 2019

holdenk commented Feb 14, 2019

holdenk commented Sep 3, 2018 •

edited by jlewi

Loading

krisnova Dec 23, 2018 •

edited

Loading