Add unit test to spark_ec2 script #134

logc · 2014-03-13T09:51:19Z

Add a unittest for a refactored function to destroy cluster. Relies on mock and moto dependencies to avoid sending out EC2 requests.

We noticed at work that the supplied script does not work always when destroying clusters, specially for regions outside "us-east-1". This pull request adds a test on the command. If this is accepted, we can continue debugging why exactly are such clusters not terminated correctly.

Run the test with:

$ python ec2/tests.py
Searching for existing cluster cluster_name...
Terminating master...
Terminating slaves...
Deleting security groups (this will take some time)...
Attempt 1
.
----------------------------------------------------------------------
Ran 1 test in 1.040s

Please notice that the test relies on moto and mock to run. Since I did not know how you handle such dependencies, I have not added a requirements.txt or a setup.py script, but can update the pull request in that sense.

Add a unittest for refactored function to destroy cluster. Relies on mock and moto dependencies to avoid sending out EC2 requests.

AmplabJenkins · 2014-03-13T10:38:02Z

Can one of the admins verify this patch?

logc · 2014-03-17T15:04:14Z

If I may clarify this pull request in any way, please ask me 😄

AmplabJenkins · 2014-03-28T01:55:08Z

Can one of the admins verify this patch?

logc · 2014-03-28T09:07:46Z

It seems there is no interest in my proposal. If the Jenkins server complains again in 2 weeks, I will just close the PR. Thanks.

rxin · 2014-04-17T00:51:01Z

Jenkins, test this please.

rxin · 2014-04-17T00:52:10Z

@shivaram can you take a look at this?

AmplabJenkins · 2014-04-17T00:53:14Z

Merged build triggered.

AmplabJenkins · 2014-04-17T01:01:51Z

Merged build started.

AmplabJenkins · 2014-04-17T01:02:46Z

Merged build finished.

AmplabJenkins · 2014-04-17T01:02:46Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14196/

shivaram · 2014-04-21T03:29:10Z

Sorry for the delay in looking at this. The refactoring into a new function destroy_cluster looks good to me. I am not sure I understand the purpose of the unit test. If we want to debug AWS consistency issues then those will not occur in the unit test ? Am I missing something ?

Regd. dependencies, we actually ship the boto version we use as a part of Spark in https://github.com/apache/spark/tree/master/ec2/third_party -- What are the licenses of moto and mock ? If they are Apache compatible and not very large, we can add them to third_party.

logc · 2014-04-21T09:05:04Z

Hi @shivaram , thanks for your comments!

The purpose of the unit test is to start defining invariants in the code, and later be sure that changes do not break invariants. That means, even if the ultimate purpose of the code is to start a machine on EC2, we can consider the purpose fulfilled if we send the corresponding request out (leaving the actual start of the machine as a responsibility of AWS). The moto library is in charge of faithfully replicating the AWS interface -- which, to my knowledge, it does -- but avoiding to start actual machines and incur in costs.

The situation is similar to when you mock out your database: you want to check your logic, and leave the intended secondary effects as a concern of the database.

You are right that this first test is quite uninteresting: it just checks that no exceptions are raised. But this was intended as a way to start setting up the test infrastructure. I intend to expand this in order to test other functions; at some point, this test will receive a (mocked) started cluster on EC2, and check that terminating it does eliminate all of its instances.

I am submitting this change set as it is because it is self-contained, and it is complete: others can start developing tests of their own, and the first test does guarantee that other changes do not break the destroy_cluster functionality.

Regarding licenses: moto is under the Apache license as you can see here, and mock is under the BSD license. As far as I know, BSD and Apache are compatible. The only restriction I can think of is that you should leave the BSD notice on the library.

shivaram · 2014-04-21T18:37:28Z

@logc I see, so the goal is to check for invariants in spark_ec2.py to detect bugs etc. Okay that sounds good.

Can you create a zip file for moto, mock similar to the boto zip file we include in third_party ? If the size of the zip file is small, I think this should be fine.

And the license stuff looks good. It'll be great if you can add the BSD parts to https://github.com/apache/spark/blob/master/LICENSE (similar to boto)

logc · 2014-04-23T11:04:08Z

@shivaram : sorry, I did not have time to work on this yesterday.

I added the library zip files and license terms as requested. Zipfiles are for the release I worked with, not the latest release. Otherwise, as you may notice, mock is roughly the same size as the boto library and moto is substantially less.

The license terms are added right underneath the boto license terms (I did not know if there is another ordering you want to keep). I am not a lawyer and I can't assure this is the way to comply with license terms, but it looks reasonable to me.

Out of curiosity: why don't you manage Python dependencies through a requirements.txt or a setup.py? It would take care of downloading such packages for you, and license terms are met automatically as far as I know ...

shivaram · 2014-04-24T04:28:50Z

ec2/tests.py

+import moto
+
+import spark_ec2
+


extra newline

The extra newline before classes is required by the flake8 tool in its default settings, which is what I am using to review Python style for this script. If you tell me this is not your enforced style, then I will change it. However, if you do not have any enforced style, please allow me to keep it and suggest flake8 defaults as the style standard.

Actually I check code in pyspark as well and we do follow the flake8 defaults of extra newline before classes. So keep this in is absolutely fine.

shivaram · 2014-04-24T04:32:00Z

Thanks @logc -- I had some style comments that I left inline. The dependencies + License look good to me.

I am not really sure why we dont use requirements.txt / setup.py -- One reason I can see is that we want it to be zero overhead for users to run spark_ec2.py without installing or setting up anything before that.

logc · 2014-04-24T10:50:34Z

I just corrected the style comments, except for one where I would like clarification on your style standard.

I understand that the Spark project uses mainly SBT for construction, installation and testing. Probably you do not want to mix in another build system for the Python part (in the sense of managing dependencies). However, please make the project aware that the second mechanism I propose --defining a setup.py-- works without the user needing anything except a Python interpreter (which is anyway needed just to launch the script).

And let me also mention that I reviewed the Travis error on my previous commit, and it seems unrelated to my changes: an SBT test for BagelSuite, whatever that is, does not halt after a large number of iterations.

shivaram · 2014-04-24T17:56:55Z

Thanks @logc for making the changes. The travis error can be ignored (we are still testing out the Travis setup). We do use Jenkins to test things before merging, so I am going to trigger that now.

One question about running tests: Do we need to somehow include the dependencies in the PYTHONPATH ? Or does running python tests.py automatically get them in somehow ?

shivaram · 2014-04-24T17:57:05Z

Jenkins, test this please

AmplabJenkins · 2014-04-24T17:57:56Z

Merged build triggered.

AmplabJenkins · 2014-04-24T17:58:05Z

Merged build started.

AmplabJenkins · 2014-04-24T18:35:43Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-24T18:35:43Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14442/

logc · 2014-04-25T07:46:42Z

Yes, the dependencies have to be included in PYTHONPATH. I installed them on my machine using pip, and included the zip files as requested. If you would like to work solely from the zip files, then you should extract them and run python setup.py install in each of the extracted source trees.

Again, this is something that would be automated if this script had its own setup.py, but I don't want to sound repetitive 😄

logc · 2014-04-28T10:46:49Z

@shivaram : I hope all is well with my last PYTHONPATH answer ... ?

shivaram · 2014-04-28T19:32:55Z

@logc -- Sorry I am caught with with a deadline and didn't get a chance to look at this. I think the basic idea is that we don't want users to run setup.py to use spark_ec2.py. This script is meant to be a utility script that just works out of the box without any installation. This is part of the reason why we ship the dependencies and include them in the PYTHONPATH in the shell script spark-ec2.

Anyways I think this change looks good. We could add a run-testsmethod that will put moto and mock in the PYTHONPATH, but I am not sure that is very necessary as these tests won't be executed by all users.

@pwendell can you take a look before merging ?

logc · 2014-05-22T12:55:11Z

Since this pull request does not seem to get much attention, I am closing it. Thank you for your time!

rxin · 2014-05-22T22:10:06Z

Actually no need to close this. A reminder would be fine (as you can see, there are lots of pull requests that are piling up, the rate of them coming in is higher than the rate of review).

I think this is good to have in general, but one thing is we cannot include binary artifacts (even zip files) in Spark code base (this is a rule by Apache Software Foundation). Is there anyway we can work around that?

rxin · 2014-05-22T22:10:15Z

(And yes - please re-open it) Thanks!

logc · 2014-05-27T14:10:38Z

@rxin - I did not intend to bring more attention to this PR ... if you scroll back in the discussion, you can read that I reminded the people reviewing this to keep it on their radar (or dismiss it) at least twice. I know you have lots of requests piling up; I would be also happy with a reject decision. (Well, not exactly happy, but content).

On the issue of binary artifacts, I included them on request of the pull request reviewer, because this part of the project aims to be self-contained. Do I remove them or keep them?

AmplabJenkins · 2014-05-27T14:12:58Z

Can one of the admins verify this patch?

Fixed a typo in Hadoop version in README. (cherry picked from commit d407c07) Signed-off-by: Reynold Xin <[email protected]>

rxin · 2014-06-09T05:44:32Z

@pwendell what do we do with respect to binary files included with this?

pwendell · 2014-06-09T06:00:15Z

Hey @rxin so I think the "no binaries" means "no compiled code" - I don't think it means we can't have a zip file. These zip files only contain uncompiled .py files, so I think it's okay.

That said @logc I noticed the zip files are pretty large (e.g. almost one MB total). They include a lot of docs/html/etc that aren't needed. Would you mind re-packaging them into smaller zips (i.e. just remove the stuff in there that is not needed). I do like having it be self contained.

Also, would you mind actually running these tests in the default test runner? Just add something to to the dev/run-tests script at the end. I also couldn't seem to actually get these to run, but maybe I'm not setting up the PYTHONPATH correctly:

PYTHON_PATH=ec2/third_party/mock-1.0.1.zip/mock-1.0.1 ./ec2/tests.py
Traceback (most recent call last):
  File "./ec2/tests.py", line 32, in <module>
    import mock
ImportError: No module named mock

JoshRosen · 2014-08-12T18:23:22Z

This PR is a pretty neat idea; it would be great to get more testing of the Spark EC2 scripts, especially as we add new features (like #1899). Does anyone want to adopt this code and bring it up-to-date with the latest version of Spark? If not, I might be able to do it myself as soon as I'm a little less busy.

SparkQA · 2014-09-05T23:47:27Z

Can one of the admins verify this patch?

pwendell · 2014-09-15T04:41:11Z

Let's close this issue for now and hopefully someone can take it up and bring it up to date.

JoshRosen · 2014-12-16T02:18:04Z

/bump. Let's see if we can get the auto-close script to pick up this JIRA.

close this issue

close this pr

close this please

## What changes were proposed in this pull request? `HookCallingExternalCatalog.dropPartitions` was accidentally broken during the last merge with Apache Spark. This fixes the build. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes apache#134 from hvanhovell/HotFix-HookCallingExternalCatalog.

* Add dist dependency to 'make docker' * update comment * copy the correct spark dist directory * set TEMPLATE_SPARK_DIST_URI

### What changes were proposed in this pull request? Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { val df1 = spark.range(100).select($"id".as("k1")) val df2 = spark.range(100).select($"id".as("k2")) val df3 = spark.range(3).select($"id".as("k3")) val df4 = spark.range(100).select($"id".as("k4")) val plan = df1.join(df2, $"k1" === $"k2") .join(df3, $"k1" === $"k3") .join(df4, $"k1" === $"k4") .queryExecution .executedPlan } ``` Current physical plan (extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` Ideal physical plan (no extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` ### Why are the changes needed? To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table. Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29181 from c21/ordering. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Luis Osa added 2 commits March 13, 2014 10:49

Add unittest for destroy cluster command

2e7ca99

Add a unittest for refactored function to destroy cluster. Relies on mock and moto dependencies to avoid sending out EC2 requests.

Add pyc files to gitignore

534edcc

Add license terms and zipfiles for dependencies

4b2d264

shivaram reviewed Apr 24, 2014
View reviewed changes

Add shebang line, header and docstrings

a58bc0c

logc closed this May 22, 2014

logc reopened this May 27, 2014

jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014

Merge pull request alteryx#134 from rxin/readme

054d97b

Fixed a typo in Hadoop version in README. (cherry picked from commit d407c07) Signed-off-by: Reynold Xin <[email protected]>

logc closed this Dec 16, 2014

Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018

[SPARK-444] Add dist dependency to 'make docker' (apache#134)

58faa6a

* Add dist dependency to 'make docker' * update comment * copy the correct spark dist directory * set TEMPLATE_SPARK_DIST_URI

kiszk mentioned this pull request May 16, 2019

[SPARK-27752][Core] Upgrade lz4-java from 1.5.1 to 1.6.0 #24629

Closed

fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020

add localfile appender and pattern layout (apache#134)

5770a68

c21 mentioned this pull request Jul 22, 2020

[SPARK-32383][SQL] Preserve hash join (BHJ and SHJ) stream side ordering #29181

Closed

microbearz pushed a commit to microbearz/spark that referenced this pull request Dec 15, 2020

add localfile appender and pattern layout (apache#134)

1fa9242

Add unit test to spark_ec2 script #134

Add unit test to spark_ec2 script #134

Conversation

logc commented Mar 13, 2014

AmplabJenkins commented Mar 13, 2014

logc commented Mar 17, 2014

AmplabJenkins commented Mar 28, 2014

logc commented Mar 28, 2014

rxin commented Apr 17, 2014

rxin commented Apr 17, 2014

AmplabJenkins commented Apr 17, 2014

AmplabJenkins commented Apr 17, 2014

AmplabJenkins commented Apr 17, 2014

AmplabJenkins commented Apr 17, 2014

shivaram commented Apr 21, 2014

logc commented Apr 21, 2014

shivaram commented Apr 21, 2014

logc commented Apr 23, 2014

shivaram Apr 24, 2014

Choose a reason for hiding this comment

logc Apr 24, 2014

Choose a reason for hiding this comment

shivaram Apr 24, 2014

Choose a reason for hiding this comment

shivaram commented Apr 24, 2014

logc commented Apr 24, 2014

shivaram commented Apr 24, 2014

shivaram commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

logc commented Apr 25, 2014

logc commented Apr 28, 2014

shivaram commented Apr 28, 2014

logc commented May 22, 2014

rxin commented May 22, 2014

rxin commented May 22, 2014

logc commented May 27, 2014

AmplabJenkins commented May 27, 2014

rxin commented Jun 9, 2014

pwendell commented Jun 9, 2014

JoshRosen commented Aug 12, 2014

SparkQA commented Sep 5, 2014

pwendell commented Sep 15, 2014

JoshRosen commented Dec 16, 2014