Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent module shadowing in pyspark_runner.py #2232

Merged
merged 1 commit into from
Sep 19, 2017
Merged

Prevent module shadowing in pyspark_runner.py #2232

merged 1 commit into from
Sep 19, 2017

Conversation

adaitche
Copy link
Contributor

Motivation

PySparkTask uses spark-submit to run the script pyspark_runner.py. Because it is run as a script the modules and packages from its directory (luigi/contrib/) shadow the global modules. In particular the package hdfs which is used by webhdfs_client.py is shadowed by the luigi.contrib.hdfs package. Thus PySparkTask does not work together with webhdfs.

An example task which highlights the problem can be found below:

import luigi
import luigi.contrib.hdfs
import luigi.contrib.spark


class MyTask(luigi.contrib.spark.PySparkTask):
    def output(self):
        return luigi.LocalTarget("not_present.txt")

    def main(self, sc):
        client = luigi.contrib.hdfs.webhdfs_client.WebHdfsClient("some_host")
        print(client.client)

Description

The problem is resolved by putting the current directory at the end of the path list sys.path in pyspark_runner.py.

Testing

I tested manually that this resolves the problem (for several weeks in production). I thought about adding a unit test for this but did not come with a good solution. The difficulty is that the problem occurs only when pyspark_runner.py is run as a script. But running the script necessarily tries to start spark by creating a SparkContext, which I guess is not desired for the unit tests. Suggestions on how to test this are welcome.

PySparkTask uses spark-submit to run the script pyspark_runner.py.
Because it is run as a script the modules and packages from its
directory (luigi/contrib/) shadow the global modules. In particular the
package `hdfs` which is used by webhdfs_client.py is shadowed by the
`luigi.contrib.hdfs` package. Thus PySparkTask does not work together
with webhdfs.

The problem is resolved by putting the current directory at the end of
the path list `sys.path`.
Copy link
Collaborator

@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. Mind blaming another pyspark contributor for confirmed review?

Thanks!

@adaitche
Copy link
Contributor Author

@jthi3rry Could you take a quick look at the patch?

@jthi3rry
Copy link
Contributor

LGTM @adaitche @dlstadther

@dlstadther dlstadther merged commit cff9d34 into spotify:master Sep 19, 2017
@dlstadther
Copy link
Collaborator

Thanks @adaitche !

This was referenced Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants