Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unit test to spark_ec2 script #134

Closed
wants to merge 4 commits into from
Closed

Add unit test to spark_ec2 script #134

wants to merge 4 commits into from

Conversation

logc
Copy link

@logc logc commented Mar 13, 2014

Add a unittest for a refactored function to destroy cluster. Relies on mock and moto dependencies to avoid sending out EC2 requests.

We noticed at work that the supplied script does not work always when destroying clusters, specially for regions outside "us-east-1". This pull request adds a test on the command. If this is accepted, we can continue debugging why exactly are such clusters not terminated correctly.

Run the test with:

$ python ec2/tests.py
Searching for existing cluster cluster_name...
Terminating master...
Terminating slaves...
Deleting security groups (this will take some time)...
Attempt 1
.
----------------------------------------------------------------------
Ran 1 test in 1.040s

Please notice that the test relies on moto and mock to run. Since I did not know how you handle such dependencies, I have not added a requirements.txt or a setup.py script, but can update the pull request in that sense.

Luis Osa added 2 commits March 13, 2014 10:49
Add a unittest for refactored function to destroy cluster.  Relies on
mock and moto dependencies to avoid sending out EC2 requests.
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@logc
Copy link
Author

logc commented Mar 17, 2014

If I may clarify this pull request in any way, please ask me 😄

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@logc
Copy link
Author

logc commented Mar 28, 2014

It seems there is no interest in my proposal. If the Jenkins server complains again in 2 weeks, I will just close the PR. Thanks.

@rxin
Copy link
Contributor

rxin commented Apr 17, 2014

Jenkins, test this please.

@rxin
Copy link
Contributor

rxin commented Apr 17, 2014

@shivaram can you take a look at this?

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14196/

@shivaram
Copy link
Contributor

Sorry for the delay in looking at this. The refactoring into a new function destroy_cluster looks good to me. I am not sure I understand the purpose of the unit test. If we want to debug AWS consistency issues then those will not occur in the unit test ? Am I missing something ?

Regd. dependencies, we actually ship the boto version we use as a part of Spark in https://github.com/apache/spark/tree/master/ec2/third_party -- What are the licenses of moto and mock ? If they are Apache compatible and not very large, we can add them to third_party.

@logc
Copy link
Author

logc commented Apr 21, 2014

Hi @shivaram , thanks for your comments!

The purpose of the unit test is to start defining invariants in the code, and later be sure that changes do not break invariants. That means, even if the ultimate purpose of the code is to start a machine on EC2, we can consider the purpose fulfilled if we send the corresponding request out (leaving the actual start of the machine as a responsibility of AWS). The moto library is in charge of faithfully replicating the AWS interface -- which, to my knowledge, it does -- but avoiding to start actual machines and incur in costs.

The situation is similar to when you mock out your database: you want to check your logic, and leave the intended secondary effects as a concern of the database.

You are right that this first test is quite uninteresting: it just checks that no exceptions are raised. But this was intended as a way to start setting up the test infrastructure. I intend to expand this in order to test other functions; at some point, this test will receive a (mocked) started cluster on EC2, and check that terminating it does eliminate all of its instances.

I am submitting this change set as it is because it is self-contained, and it is complete: others can start developing tests of their own, and the first test does guarantee that other changes do not break the destroy_cluster functionality.

Regarding licenses: moto is under the Apache license as you can see here, and mock is under the BSD license. As far as I know, BSD and Apache are compatible. The only restriction I can think of is that you should leave the BSD notice on the library.

@shivaram
Copy link
Contributor

@logc I see, so the goal is to check for invariants in spark_ec2.py to detect bugs etc. Okay that sounds good.

Can you create a zip file for moto, mock similar to the boto zip file we include in third_party ? If the size of the zip file is small, I think this should be fine.

And the license stuff looks good. It'll be great if you can add the BSD parts to https://github.com/apache/spark/blob/master/LICENSE (similar to boto)

@logc
Copy link
Author

logc commented Apr 23, 2014

@shivaram : sorry, I did not have time to work on this yesterday.

I added the library zip files and license terms as requested. Zipfiles are for the release I worked with, not the latest release. Otherwise, as you may notice, mock is roughly the same size as the boto library and moto is substantially less.

The license terms are added right underneath the boto license terms (I did not know if there is another ordering you want to keep). I am not a lawyer and I can't assure this is the way to comply with license terms, but it looks reasonable to me.

Out of curiosity: why don't you manage Python dependencies through a requirements.txt or a setup.py? It would take care of downloading such packages for you, and license terms are met automatically as far as I know ...

import moto

import spark_ec2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra newline

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra newline before classes is required by the flake8 tool in its default settings, which is what I am using to review Python style for this script. If you tell me this is not your enforced style, then I will change it. However, if you do not have any enforced style, please allow me to keep it and suggest flake8 defaults as the style standard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I check code in pyspark as well and we do follow the flake8 defaults of extra newline before classes. So keep this in is absolutely fine.

@shivaram
Copy link
Contributor

Thanks @logc -- I had some style comments that I left inline. The dependencies + License look good to me.

I am not really sure why we dont use requirements.txt / setup.py -- One reason I can see is that we want it to be zero overhead for users to run spark_ec2.py without installing or setting up anything before that.

@logc
Copy link
Author

logc commented Apr 24, 2014

I just corrected the style comments, except for one where I would like clarification on your style standard.

I understand that the Spark project uses mainly SBT for construction, installation and testing. Probably you do not want to mix in another build system for the Python part (in the sense of managing dependencies). However, please make the project aware that the second mechanism I propose --defining a setup.py-- works without the user needing anything except a Python interpreter (which is anyway needed just to launch the script).

And let me also mention that I reviewed the Travis error on my previous commit, and it seems unrelated to my changes: an SBT test for BagelSuite, whatever that is, does not halt after a large number of iterations.

@shivaram
Copy link
Contributor

Thanks @logc for making the changes. The travis error can be ignored (we are still testing out the Travis setup). We do use Jenkins to test things before merging, so I am going to trigger that now.

One question about running tests: Do we need to somehow include the dependencies in the PYTHONPATH ? Or does running python tests.py automatically get them in somehow ?

@shivaram
Copy link
Contributor

Jenkins, test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14442/

@logc
Copy link
Author

logc commented Apr 25, 2014

Yes, the dependencies have to be included in PYTHONPATH. I installed them on my machine using pip, and included the zip files as requested. If you would like to work solely from the zip files, then you should extract them and run python setup.py install in each of the extracted source trees.

Again, this is something that would be automated if this script had its own setup.py, but I don't want to sound repetitive 😄

@logc
Copy link
Author

logc commented Apr 28, 2014

@shivaram : I hope all is well with my last PYTHONPATH answer ... ?

@shivaram
Copy link
Contributor

@logc -- Sorry I am caught with with a deadline and didn't get a chance to look at this. I think the basic idea is that we don't want users to run setup.py to use spark_ec2.py. This script is meant to be a utility script that just works out of the box without any installation. This is part of the reason why we ship the dependencies and include them in the PYTHONPATH in the shell script spark-ec2.

Anyways I think this change looks good. We could add a run-testsmethod that will put moto and mock in the PYTHONPATH, but I am not sure that is very necessary as these tests won't be executed by all users.

@pwendell can you take a look before merging ?

@logc
Copy link
Author

logc commented May 22, 2014

Since this pull request does not seem to get much attention, I am closing it. Thank you for your time!

@logc logc closed this May 22, 2014
@rxin
Copy link
Contributor

rxin commented May 22, 2014

Actually no need to close this. A reminder would be fine (as you can see, there are lots of pull requests that are piling up, the rate of them coming in is higher than the rate of review).

I think this is good to have in general, but one thing is we cannot include binary artifacts (even zip files) in Spark code base (this is a rule by Apache Software Foundation). Is there anyway we can work around that?

@rxin
Copy link
Contributor

rxin commented May 22, 2014

(And yes - please re-open it) Thanks!

@logc
Copy link
Author

logc commented May 27, 2014

@rxin - I did not intend to bring more attention to this PR ... if you scroll back in the discussion, you can read that I reminded the people reviewing this to keep it on their radar (or dismiss it) at least twice. I know you have lots of requests piling up; I would be also happy with a reject decision. (Well, not exactly happy, but content).

On the issue of binary artifacts, I included them on request of the pull request reviewer, because this part of the project aims to be self-contained. Do I remove them or keep them?

@logc logc reopened this May 27, 2014
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

jhartlaub referenced this pull request in jhartlaub/spark May 27, 2014
Fixed a typo in Hadoop version in README.

(cherry picked from commit d407c07)
Signed-off-by: Reynold Xin <[email protected]>
@rxin
Copy link
Contributor

rxin commented Jun 9, 2014

@pwendell what do we do with respect to binary files included with this?

@pwendell
Copy link
Contributor

pwendell commented Jun 9, 2014

Hey @rxin so I think the "no binaries" means "no compiled code" - I don't think it means we can't have a zip file. These zip files only contain uncompiled .py files, so I think it's okay.

That said @logc I noticed the zip files are pretty large (e.g. almost one MB total). They include a lot of docs/html/etc that aren't needed. Would you mind re-packaging them into smaller zips (i.e. just remove the stuff in there that is not needed). I do like having it be self contained.

Also, would you mind actually running these tests in the default test runner? Just add something to to the dev/run-tests script at the end. I also couldn't seem to actually get these to run, but maybe I'm not setting up the PYTHONPATH correctly:

PYTHON_PATH=ec2/third_party/mock-1.0.1.zip/mock-1.0.1 ./ec2/tests.py
Traceback (most recent call last):
  File "./ec2/tests.py", line 32, in <module>
    import mock
ImportError: No module named mock

@JoshRosen
Copy link
Contributor

This PR is a pretty neat idea; it would be great to get more testing of the Spark EC2 scripts, especially as we add new features (like #1899). Does anyone want to adopt this code and bring it up-to-date with the latest version of Spark? If not, I might be able to do it myself as soon as I'm a little less busy.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@pwendell
Copy link
Contributor

Let's close this issue for now and hopefully someone can take it up and bring it up to date.

@JoshRosen
Copy link
Contributor

/bump. Let's see if we can get the auto-close script to pick up this JIRA.

close this issue

close this pr

close this please

@logc logc closed this Dec 16, 2014
ericl pushed a commit to ericl/spark that referenced this pull request Dec 21, 2016
## What changes were proposed in this pull request?
`HookCallingExternalCatalog.dropPartitions` was accidentally broken during the last merge with Apache Spark. This fixes the build.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <[email protected]>

Closes apache#134 from hvanhovell/HotFix-HookCallingExternalCatalog.
Igosuki pushed a commit to Adikteev/spark that referenced this pull request Jul 31, 2018
* Add dist dependency to 'make docker'

* update comment

* copy the correct spark dist directory

* set TEMPLATE_SPARK_DIST_URI
fishcus pushed a commit to fishcus/spark that referenced this pull request Jul 8, 2020
cloud-fan pushed a commit that referenced this pull request Jul 27, 2020
### What changes were proposed in this pull request?

Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins.

Example:

```
withSQLConf(
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") {
      val df1 = spark.range(100).select($"id".as("k1"))
      val df2 = spark.range(100).select($"id".as("k2"))
      val df3 = spark.range(3).select($"id".as("k3"))
      val df4 = spark.range(100).select($"id".as("k4"))
      val plan = df1.join(df2, $"k1" === $"k2")
        .join(df3, $"k1" === $"k3")
        .join(df4, $"k1" === $"k4")
        .queryExecution
        .executedPlan
}
```

Current physical plan (extra sort on `k1` before top sort merge join):

```
*(9) SortMergeJoin [k1#220L], [k4#232L], Inner
:- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0
:  +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
:     :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
:     :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
:     :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128]
:     :  :     +- *(1) Project [id#218L AS k1#220L]
:     :  :        +- *(1) Range (0, 100, step=1, splits=2)
:     :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
:     :     +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134]
:     :        +- *(3) Project [id#222L AS k2#224L]
:     :           +- *(3) Range (0, 100, step=1, splits=2)
:     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141]
:        +- *(5) Project [id#226L AS k3#228L]
:           +- *(5) Range (0, 3, step=1, splits=2)
+- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148]
      +- *(7) Project [id#230L AS k4#232L]
         +- *(7) Range (0, 100, step=1, splits=2)
```

Ideal physical plan (no extra sort on `k1` before top sort merge join):

```
*(9) SortMergeJoin [k1#220L], [k4#232L], Inner
:- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight
:  :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner
:  :  :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0
:  :  :  +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127]
:  :  :     +- *(1) Project [id#218L AS k1#220L]
:  :  :        +- *(1) Range (0, 100, step=1, splits=2)
:  :  +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0
:  :     +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133]
:  :        +- *(3) Project [id#222L AS k2#224L]
:  :           +- *(3) Range (0, 100, step=1, splits=2)
:  +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140]
:     +- *(5) Project [id#226L AS k3#228L]
:        +- *(5) Range (0, 3, step=1, splits=2)
+- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146]
      +- *(7) Project [id#230L AS k4#232L]
         +- *(7) Range (0, 100, step=1, splits=2)
```

### Why are the changes needed?

To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table.
Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite`.

Closes #29181 from c21/ordering.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
microbearz pushed a commit to microbearz/spark that referenced this pull request Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants