Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added image_version support to Dataproc #1833

Merged
merged 8 commits into from
Sep 5, 2016

Conversation

piotrpolatowski
Copy link
Contributor

Description

Adds the image_version parameter. Version list

Motivation and Context

Relates to issue Dataproc image-version param not supported

Have you tested this? If so, how?

Manually tested against Google Dataproc. Both 1.0 and 1.1 cluster creation worked.

@mention-bot
Copy link

@piotrpolatowski, thanks for your PR! By analyzing the annotation information on this pull request, we identified @constantijn to be a potential reviewer

@erikbern
Copy link
Contributor

lgtm

@Tarrasch
Copy link
Contributor

lgtm. Can someone who knows gcp review this too?

@constantijn
Copy link

I like the idea, but i don't think image version 1.1 should be the default. It might be the latest image now but eventually v1.1 will become obsolete and you won't even be able to provision that version anymore.

Can you change it so that if not specified it won't pass the image version to the API at all? (which means it'll default to giving you the latest version).

More info on dataproc versioning here:
https://cloud.google.com/dataproc/docs/concepts/versioning

@piotrpolatowski
Copy link
Contributor Author

Is there a way in luigi to define an optional parameter without default value?

@Tarrasch
Copy link
Contributor

Nope. But you can typically use default="" and do things like if my_task.my_param: { .. do stuff if the variable is set .. }

@piotrpolatowski
Copy link
Contributor Author

Done, manual test created 1.1 when no version specified, 1.0 for image_version = "1.0"

software_config = {}

if self.image_version:
software_config["imageVersion"] = self.image_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This oneliner is much more readable imo:

software_config = { "imageVersion": self.image_version} if self.image_version else {}

@constantijn
Copy link

lgtm

@piotrpolatowski
Copy link
Contributor Author

Any idea why CI might have failed?

@constantijn
Copy link

Looks like flake8 (Stylecheck/linter) failed with:
flake8 runtests: commands[0] | flake8 --max-line-length=160 --exclude=doc,luigi/six.py,.tox ./luigi/contrib/dataproc.py:166:28: E201 whitespace after '{' ./luigi/contrib/dataproc.py:166:63: E202 whitespace before '}'

@Tarrasch
Copy link
Contributor

Well. All you need now is to add tests I believe: https://github.com/spotify/luigi/blob/master/test/contrib/dataproc_test.py

@piotrpolatowski
Copy link
Contributor Author

Other properties are not tested, I thought there is a reason for that, is there?

@Tarrasch
Copy link
Contributor

I bet it's laziness ;p. But we have the author here so I let @constantijn answer it. :)

@constantijn
Copy link

Heh, there's a full set of tests to test if the Luigi side of things work properly.
I didn't add tests that test googles API in detail, working under the assumption google tests their own API before putting it live ;)

@piotrpolatowski
Copy link
Contributor Author

piotrpolatowski commented Aug 31, 2016

I kind of agree with @constantijn we can't test every parameter.

I've added a simple unit test with image-version parameter.
@Tarrasch did you expect sth like that? The problem with unit tests it that these are actually integration tests run against Google Cloud which I don't know how to run on my local PC.

@constantijn
Copy link

Yeah they're integration tests that check if the Luigi task does what you expect it to do on google cloud.

Not sure if this is the right way to test since what you have now leaves a cluster up and running after a successful test run, which will confuse future tests runs and use up unnecessary cloud credit.

At least we need to add a delete cluster call after this one, probably also add a "2" to the cluster name so it can't interfere with the other tests.

The "hard" way would be to write a real unit test that intercepts the api calls to goole cloud (It's all REST under the hood) and runs assertions on that to make sure that the Luigi options you set result in the API calls you expect.

@constantijn
Copy link

Oh and Flake8 is complaining again ;)

./test/contrib/dataproc_test.py:149:33: W292 no newline at end of file

@constantijn
Copy link

Flake8 again:

./test/contrib/dataproc_test.py:32:1: E302 expected 2 blank lines, found 1

Other then that I'm happy if @Tarrasch is happy :)

@Tarrasch
Copy link
Contributor

Yea I just want some minimal test (something is better than nothing principle). But I totally leave it up to you guys to decide if it should be there or not. Because you guys know this code much better than me. :)

@constantijn
Copy link

@Tarrasch & @erikbern merge away ;)

@piotrpolatowski
Copy link
Contributor Author

@Tarrasch @erikbern ready to merge?

@erikbern
Copy link
Contributor

erikbern commented Sep 2, 2016

go ahead!

@Tarrasch Tarrasch merged commit add86fa into spotify:master Sep 5, 2016
@Tarrasch
Copy link
Contributor

Tarrasch commented Sep 5, 2016

@constantijn @piotrpolatowski,

Can you see if you added a regression for this test here? https://travis-ci.org/spotify/luigi/jobs/157539534

Unfortunately people without write access to this repository cannot let Travis run the gcp tests. But perhaps you can try to see if you get the same error.

@constantijn
Copy link

The error i see is:
clusterName must be a match of regex '(?:[a-z](?:[-a-z0-9]{0,53}[a-z0-9])?).
clusterName is defined with:
CLUSTER_NAME = os.environ.get('DATAPROC_TEST_CLUSTER', 'unit-test-cluster')

What's the env variable? Or what's the cluster name that gets passed to the test?

In the travis output I only see that the api didn't like the clusterName it got passed ... I can't see what the actual clusterName is that it got passed.

@Tarrasch
Copy link
Contributor

Tarrasch commented Sep 6, 2016

@constantijn, I can't help you much as I don't know anything about dataproc. Can you either send a patch increasing the debug output? Is there any way you can see how this patch can have caused the build to start failing?

@constantijn
Copy link

I'm kinda debugging in the dark here, but I can make a guess:
One things that jumps out at me is the "1.0" string defined in line 30 that gets appended to the cluster name for the new tests .... pretty sure that doesn't match with that regex. Can you try changing it to "1-0" and see if that fixes it? (Not setup myself atm to run the integration tests, or i'd do it myself)

No idea why this wasn't caught before if this is indeed the issue.

Tarrasch added a commit that referenced this pull request Sep 7, 2016
@Tarrasch
Copy link
Contributor

Tarrasch commented Sep 7, 2016

@constantijn ok I submitted a PR trying what you suggested. But if it doesn't work, perhaps you or @piotrpolatowski can try to run the tests against your own gcp cluster?

Tarrasch added a commit that referenced this pull request Sep 7, 2016
This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants