-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unit test to spark_ec2 script #134
Conversation
Add a unittest for refactored function to destroy cluster. Relies on mock and moto dependencies to avoid sending out EC2 requests.
Can one of the admins verify this patch? |
If I may clarify this pull request in any way, please ask me 😄 |
Can one of the admins verify this patch? |
It seems there is no interest in my proposal. If the Jenkins server complains again in 2 weeks, I will just close the PR. Thanks. |
Jenkins, test this please. |
@shivaram can you take a look at this? |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14196/ |
Sorry for the delay in looking at this. The refactoring into a new function Regd. dependencies, we actually ship the boto version we use as a part of Spark in https://github.com/apache/spark/tree/master/ec2/third_party -- What are the licenses of moto and mock ? If they are Apache compatible and not very large, we can add them to third_party. |
Hi @shivaram , thanks for your comments! The purpose of the unit test is to start defining invariants in the code, and later be sure that changes do not break invariants. That means, even if the ultimate purpose of the code is to start a machine on EC2, we can consider the purpose fulfilled if we send the corresponding request out (leaving the actual start of the machine as a responsibility of AWS). The moto library is in charge of faithfully replicating the AWS interface -- which, to my knowledge, it does -- but avoiding to start actual machines and incur in costs. The situation is similar to when you mock out your database: you want to check your logic, and leave the intended secondary effects as a concern of the database. You are right that this first test is quite uninteresting: it just checks that no exceptions are raised. But this was intended as a way to start setting up the test infrastructure. I intend to expand this in order to test other functions; at some point, this test will receive a (mocked) started cluster on EC2, and check that terminating it does eliminate all of its instances. I am submitting this change set as it is because it is self-contained, and it is complete: others can start developing tests of their own, and the first test does guarantee that other changes do not break the Regarding licenses: moto is under the Apache license as you can see here, and mock is under the BSD license. As far as I know, BSD and Apache are compatible. The only restriction I can think of is that you should leave the BSD notice on the library. |
@logc I see, so the goal is to check for invariants in spark_ec2.py to detect bugs etc. Okay that sounds good. Can you create a zip file for moto, mock similar to the boto zip file we include in third_party ? If the size of the zip file is small, I think this should be fine. And the license stuff looks good. It'll be great if you can add the BSD parts to https://github.com/apache/spark/blob/master/LICENSE (similar to boto) |
@shivaram : sorry, I did not have time to work on this yesterday. I added the library zip files and license terms as requested. Zipfiles are for the release I worked with, not the latest release. Otherwise, as you may notice, The license terms are added right underneath the Out of curiosity: why don't you manage Python dependencies through a |
import moto | ||
|
||
import spark_ec2 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extra newline before classes is required by the flake8
tool in its default settings, which is what I am using to review Python style for this script. If you tell me this is not your enforced style, then I will change it. However, if you do not have any enforced style, please allow me to keep it and suggest flake8 defaults as the style standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I check code in pyspark as well and we do follow the flake8 defaults of extra newline before classes. So keep this in is absolutely fine.
Thanks @logc -- I had some style comments that I left inline. The dependencies + License look good to me. I am not really sure why we dont use |
I just corrected the style comments, except for one where I would like clarification on your style standard. I understand that the Spark project uses mainly SBT for construction, installation and testing. Probably you do not want to mix in another build system for the Python part (in the sense of managing dependencies). However, please make the project aware that the second mechanism I propose --defining a And let me also mention that I reviewed the Travis error on my previous commit, and it seems unrelated to my changes: an SBT test for |
Thanks @logc for making the changes. The travis error can be ignored (we are still testing out the Travis setup). We do use Jenkins to test things before merging, so I am going to trigger that now. One question about running tests: Do we need to somehow include the dependencies in the PYTHONPATH ? Or does running python tests.py automatically get them in somehow ? |
Jenkins, test this please |
Merged build triggered. |
Merged build started. |
Merged build finished. All automated tests passed. |
All automated tests passed. |
Yes, the dependencies have to be included in PYTHONPATH. I installed them on my machine using Again, this is something that would be automated if this script had its own |
@shivaram : I hope all is well with my last PYTHONPATH answer ... ? |
@logc -- Sorry I am caught with with a deadline and didn't get a chance to look at this. I think the basic idea is that we don't want users to run setup.py to use spark_ec2.py. This script is meant to be a utility script that just works out of the box without any installation. This is part of the reason why we ship the dependencies and include them in the PYTHONPATH in the shell script Anyways I think this change looks good. We could add a @pwendell can you take a look before merging ? |
Since this pull request does not seem to get much attention, I am closing it. Thank you for your time! |
Actually no need to close this. A reminder would be fine (as you can see, there are lots of pull requests that are piling up, the rate of them coming in is higher than the rate of review). I think this is good to have in general, but one thing is we cannot include binary artifacts (even zip files) in Spark code base (this is a rule by Apache Software Foundation). Is there anyway we can work around that? |
(And yes - please re-open it) Thanks! |
@rxin - I did not intend to bring more attention to this PR ... if you scroll back in the discussion, you can read that I reminded the people reviewing this to keep it on their radar (or dismiss it) at least twice. I know you have lots of requests piling up; I would be also happy with a reject decision. (Well, not exactly happy, but content). On the issue of binary artifacts, I included them on request of the pull request reviewer, because this part of the project aims to be self-contained. Do I remove them or keep them? |
Can one of the admins verify this patch? |
Fixed a typo in Hadoop version in README. (cherry picked from commit d407c07) Signed-off-by: Reynold Xin <[email protected]>
@pwendell what do we do with respect to binary files included with this? |
Hey @rxin so I think the "no binaries" means "no compiled code" - I don't think it means we can't have a zip file. These zip files only contain uncompiled That said @logc I noticed the zip files are pretty large (e.g. almost one MB total). They include a lot of docs/html/etc that aren't needed. Would you mind re-packaging them into smaller zips (i.e. just remove the stuff in there that is not needed). I do like having it be self contained. Also, would you mind actually running these tests in the default test runner? Just add something to to the
|
This PR is a pretty neat idea; it would be great to get more testing of the Spark EC2 scripts, especially as we add new features (like #1899). Does anyone want to adopt this code and bring it up-to-date with the latest version of Spark? If not, I might be able to do it myself as soon as I'm a little less busy. |
Can one of the admins verify this patch? |
Let's close this issue for now and hopefully someone can take it up and bring it up to date. |
/bump. Let's see if we can get the auto-close script to pick up this JIRA. close this issue close this pr close this please |
## What changes were proposed in this pull request? `HookCallingExternalCatalog.dropPartitions` was accidentally broken during the last merge with Apache Spark. This fixes the build. ## How was this patch tested? Existing tests. Author: Herman van Hovell <[email protected]> Closes apache#134 from hvanhovell/HotFix-HookCallingExternalCatalog.
* Add dist dependency to 'make docker' * update comment * copy the correct spark dist directory * set TEMPLATE_SPARK_DIST_URI
### What changes were proposed in this pull request? Currently `BroadcastHashJoinExec` and `ShuffledHashJoinExec` do not preserve children output ordering information (inherit from `SparkPlan.outputOrdering`, which is Nil). This can add unnecessary sort in complex queries involved multiple joins. Example: ``` withSQLConf( SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "50") { val df1 = spark.range(100).select($"id".as("k1")) val df2 = spark.range(100).select($"id".as("k2")) val df3 = spark.range(3).select($"id".as("k3")) val df4 = spark.range(100).select($"id".as("k4")) val plan = df1.join(df2, $"k1" === $"k2") .join(df3, $"k1" === $"k3") .join(df4, $"k1" === $"k4") .queryExecution .executedPlan } ``` Current physical plan (extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) Sort [k1#220L ASC NULLS FIRST], false, 0 : +- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#128] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#134] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#141] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#148] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` Ideal physical plan (no extra sort on `k1` before top sort merge join): ``` *(9) SortMergeJoin [k1#220L], [k4#232L], Inner :- *(6) BroadcastHashJoin [k1#220L], [k3#228L], Inner, BuildRight : :- *(6) SortMergeJoin [k1#220L], [k2#224L], Inner : : :- *(2) Sort [k1#220L ASC NULLS FIRST], false, 0 : : : +- Exchange hashpartitioning(k1#220L, 5), true, [id=#127] : : : +- *(1) Project [id#218L AS k1#220L] : : : +- *(1) Range (0, 100, step=1, splits=2) : : +- *(4) Sort [k2#224L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(k2#224L, 5), true, [id=#133] : : +- *(3) Project [id#222L AS k2#224L] : : +- *(3) Range (0, 100, step=1, splits=2) : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])), [id=#140] : +- *(5) Project [id#226L AS k3#228L] : +- *(5) Range (0, 3, step=1, splits=2) +- *(8) Sort [k4#232L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(k4#232L, 5), true, [id=#146] +- *(7) Project [id#230L AS k4#232L] +- *(7) Range (0, 100, step=1, splits=2) ``` ### Why are the changes needed? To avoid unnecessary sort in query, and it has most impact when users read sorted bucketed table. Though the unnecessary sort is operating on already sorted data, it would have obvious negative impact on IO and query run time if the data is large and external sorting happens. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29181 from c21/ordering. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
Add a unittest for a refactored function to destroy cluster. Relies on mock and moto dependencies to avoid sending out EC2 requests.
We noticed at work that the supplied script does not work always when destroying clusters, specially for regions outside "us-east-1". This pull request adds a test on the command. If this is accepted, we can continue debugging why exactly are such clusters not terminated correctly.
Run the test with:
Please notice that the test relies on moto and mock to run. Since I did not know how you handle such dependencies, I have not added a
requirements.txt
or asetup.py
script, but can update the pull request in that sense.