Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. #16082

Closed
wants to merge 6 commits into from

Conversation

lins05
Copy link
Contributor

@lins05 lins05 commented Nov 30, 2016

What changes were proposed in this pull request?

Since we already include the python examples in the pyspark package, we should include the example data with it as well.

We should also include the third-party licences since we distribute their jars with the pyspark package.

How was this patch tested?

Manually tested with python2.7 and python3.4

$ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz 

$ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
graphx
mllib
streaming

$ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/

$ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5                                  
LICENSE-AnchorJS.txt
LICENSE-DPark.txt
LICENSE-Mockito.txt
LICENSE-SnapTree.txt
LICENSE-antlr.txt

@lins05 lins05 changed the title [SPARK-18652] Include the data in pyspark package. [SPARK-18652] Include the example data and third-party licenses in pyspark package. Nov 30, 2016
@lins05 lins05 force-pushed the include-data-in-pyspark-dist branch from dc7127c to 7c2d0fc Compare November 30, 2016 17:44
@lins05 lins05 force-pushed the include-data-in-pyspark-dist branch from 7c2d0fc to ab51ae3 Compare November 30, 2016 17:45
@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69412 has finished for PR 16082 at commit 43019db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69415 has finished for PR 16082 at commit 9735e20.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69420 has finished for PR 16082 at commit 90ca61d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 30, 2016

Test build #69416 has finished for PR 16082 at commit ab51ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@lins05
Copy link
Contributor Author

lins05 commented Dec 2, 2016

@holdenk @rxin Can we merge this?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on. I've got a few minor questions but overall looks useful.

@@ -17,6 +17,8 @@
global-exclude *.py[cod] __pycache__ .DS_Store
recursive-include deps/jars *.jar
graft deps/bin
graft deps/data
graft deps/licenses
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we later limit this to text files, maybe recursive-include would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


data_files = glob.glob(os.path.join(LICENSES_PATH, "*"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? None of the other packages we're making seem to need it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question is whether whatever product this produces includes copies of third-party software, as in an assembly jar? if so the licenses should be included. If no Java code is being packaged, most of this won't be relevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no I mean, including the license files is reasonable its more is it necessary to use data_files here (and down bellow) when all of the other packages we are producing simply use package_data to specify the backing files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll remove the data_files related to make them consistent.

@@ -17,6 +17,8 @@
global-exclude *.py[cod] __pycache__ .DS_Store
recursive-include deps/jars *.jar
graft deps/bin
graft deps/data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we limit this to *.data and *.txt later would it make sense to restrict this here as well? That being said I can see an argument that we might have csv or tsv files in the future in which case we'd want to switch the include downstream to be * rather than just txt / data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think limiting it in one place would be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second thought I think it would be better to keep all the packages consistent (specifying the patterns to include in both MANIFEST.in and setup.py), so I changed the graft to recursive-include with proper patters as you suggested.

'pyspark.examples.src.main.python': ['*.py', '*/*.py']},
data_files=[('', data_files)],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above - do we really need this directive?

@holdenk
Copy link
Contributor

holdenk commented Dec 2, 2016

Also just if you can tag your Python PRs with [PYTHON] it makes searching for them a bit faster.

@lins05 lins05 changed the title [SPARK-18652] Include the example data and third-party licenses in pyspark package. [SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. Dec 4, 2016
SCRIPTS_TARGET = os.path.join(TEMP_PATH, "bin")
JARS_TARGET = os.path.join(TEMP_PATH, "jars")
EXAMPLES_TARGET = os.path.join(TEMP_PATH, "examples")

DATA_TARGET = os.path.join(TEMP_PATH, "data")
LICENSES_PATH = os.path.join(SPARK_HOME, "licenses")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, and sorry I didn't notice it earlier, but we define all of the _PATHs before the target - just for someone skimming the code would be good to keep the current flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Good catch!

@holdenk
Copy link
Contributor

holdenk commented Dec 4, 2016

One minor nit, but other than that LGTM - thank you for taking the time to do this @lins05 :)

@holdenk
Copy link
Contributor

holdenk commented Dec 4, 2016

This looks good to me, maybe @srowen, @davies, or @JoshRosen could do a final pass/merge?

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69636 has finished for PR 16082 at commit 5706d8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69637 has finished for PR 16082 at commit 8d3ef53.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 6, 2016
…es in pyspark package.

## What changes were proposed in this pull request?

Since we already include the python examples in the pyspark package, we should include the example data with it as well.

We should also include the third-party licences since we distribute their jars with the pyspark package.

## How was this patch tested?

Manually tested with python2.7 and python3.4
```sh
$ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz

$ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
graphx
mllib
streaming

$ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/

$ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
LICENSE-AnchorJS.txt
LICENSE-DPark.txt
LICENSE-Mockito.txt
LICENSE-SnapTree.txt
LICENSE-antlr.txt
```

Author: Shuai Lin <[email protected]>

Closes #16082 from lins05/include-data-in-pyspark-dist.

(cherry picked from commit bd9a4a5)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Dec 6, 2016

Merged to master, and 2.1 as it's relevant to packaging of the release

@asfgit asfgit closed this in bd9a4a5 Dec 6, 2016
@lins05 lins05 deleted the include-data-in-pyspark-dist branch December 8, 2016 15:18
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…es in pyspark package.

## What changes were proposed in this pull request?

Since we already include the python examples in the pyspark package, we should include the example data with it as well.

We should also include the third-party licences since we distribute their jars with the pyspark package.

## How was this patch tested?

Manually tested with python2.7 and python3.4
```sh
$ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz

$ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
graphx
mllib
streaming

$ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/

$ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
LICENSE-AnchorJS.txt
LICENSE-DPark.txt
LICENSE-Mockito.txt
LICENSE-SnapTree.txt
LICENSE-antlr.txt
```

Author: Shuai Lin <[email protected]>

Closes apache#16082 from lins05/include-data-in-pyspark-dist.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…es in pyspark package.

## What changes were proposed in this pull request?

Since we already include the python examples in the pyspark package, we should include the example data with it as well.

We should also include the third-party licences since we distribute their jars with the pyspark package.

## How was this patch tested?

Manually tested with python2.7 and python3.4
```sh
$ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz

$ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
graphx
mllib
streaming

$ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/

$ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
LICENSE-AnchorJS.txt
LICENSE-DPark.txt
LICENSE-Mockito.txt
LICENSE-SnapTree.txt
LICENSE-antlr.txt
```

Author: Shuai Lin <[email protected]>

Closes apache#16082 from lins05/include-data-in-pyspark-dist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants