-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. #16082
Conversation
dc7127c
to
7c2d0fc
Compare
7c2d0fc
to
ab51ae3
Compare
Test build #69412 has finished for PR 16082 at commit
|
Test build #69415 has finished for PR 16082 at commit
|
Test build #69420 has finished for PR 16082 at commit
|
Test build #69416 has finished for PR 16082 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on. I've got a few minor questions but overall looks useful.
@@ -17,6 +17,8 @@ | |||
global-exclude *.py[cod] __pycache__ .DS_Store | |||
recursive-include deps/jars *.jar | |||
graft deps/bin | |||
graft deps/data | |||
graft deps/licenses |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we later limit this to text files, maybe recursive-include would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
data_files = glob.glob(os.path.join(LICENSES_PATH, "*")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? None of the other packages we're making seem to need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question is whether whatever product this produces includes copies of third-party software, as in an assembly jar? if so the licenses should be included. If no Java code is being packaged, most of this won't be relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no I mean, including the license files is reasonable its more is it necessary to use data_files
here (and down bellow) when all of the other packages we are producing simply use package_data
to specify the backing files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll remove the data_files related to make them consistent.
@@ -17,6 +17,8 @@ | |||
global-exclude *.py[cod] __pycache__ .DS_Store | |||
recursive-include deps/jars *.jar | |||
graft deps/bin | |||
graft deps/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we limit this to *.data and *.txt later would it make sense to restrict this here as well? That being said I can see an argument that we might have csv or tsv files in the future in which case we'd want to switch the include downstream to be * rather than just txt / data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think limiting it in one place would be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second thought I think it would be better to keep all the packages consistent (specifying the patterns to include in both MANIFEST.in and setup.py), so I changed the graft
to recursive-include
with proper patters as you suggested.
'pyspark.examples.src.main.python': ['*.py', '*/*.py']}, | ||
data_files=[('', data_files)], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above - do we really need this directive?
Also just if you can tag your Python PRs with [PYTHON] it makes searching for them a bit faster. |
SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin") | ||
JARS_TARGET = os.path.join(TEMP_PATH, "jars") | ||
EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples") | ||
|
||
DATA_TARGET = os.path.join(TEMP_PATH, "data") | ||
LICENSES_PATH = os.path.join(SPARK_HOME, "licenses") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor, and sorry I didn't notice it earlier, but we define all of the _PATHs before the target - just for someone skimming the code would be good to keep the current flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Good catch!
One minor nit, but other than that LGTM - thank you for taking the time to do this @lins05 :) |
This looks good to me, maybe @srowen, @davies, or @JoshRosen could do a final pass/merge? |
Test build #69636 has finished for PR 16082 at commit
|
Test build #69637 has finished for PR 16082 at commit
|
…es in pyspark package. ## What changes were proposed in this pull request? Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. ## How was this patch tested? Manually tested with python2.7 and python3.4 ```sh $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/ graphx mllib streaming $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/ 600K /usr/local/lib/python2.7/dist-packages/pyspark/data/ $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5 LICENSE-AnchorJS.txt LICENSE-DPark.txt LICENSE-Mockito.txt LICENSE-SnapTree.txt LICENSE-antlr.txt ``` Author: Shuai Lin <[email protected]> Closes #16082 from lins05/include-data-in-pyspark-dist. (cherry picked from commit bd9a4a5) Signed-off-by: Sean Owen <[email protected]>
Merged to master, and 2.1 as it's relevant to packaging of the release |
…es in pyspark package. ## What changes were proposed in this pull request? Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. ## How was this patch tested? Manually tested with python2.7 and python3.4 ```sh $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/ graphx mllib streaming $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/ 600K /usr/local/lib/python2.7/dist-packages/pyspark/data/ $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5 LICENSE-AnchorJS.txt LICENSE-DPark.txt LICENSE-Mockito.txt LICENSE-SnapTree.txt LICENSE-antlr.txt ``` Author: Shuai Lin <[email protected]> Closes apache#16082 from lins05/include-data-in-pyspark-dist.
…es in pyspark package. ## What changes were proposed in this pull request? Since we already include the python examples in the pyspark package, we should include the example data with it as well. We should also include the third-party licences since we distribute their jars with the pyspark package. ## How was this patch tested? Manually tested with python2.7 and python3.4 ```sh $ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package $ cd python $ python setup.py sdist $ pip install dist/pyspark-2.1.0.dev0.tar.gz $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/ graphx mllib streaming $ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/ 600K /usr/local/lib/python2.7/dist-packages/pyspark/data/ $ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5 LICENSE-AnchorJS.txt LICENSE-DPark.txt LICENSE-Mockito.txt LICENSE-SnapTree.txt LICENSE-antlr.txt ``` Author: Shuai Lin <[email protected]> Closes apache#16082 from lins05/include-data-in-pyspark-dist.
What changes were proposed in this pull request?
Since we already include the python examples in the pyspark package, we should include the example data with it as well.
We should also include the third-party licences since we distribute their jars with the pyspark package.
How was this patch tested?
Manually tested with python2.7 and python3.4