Prevent module shadowing in pyspark_runner.py #2232
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
PySparkTask uses spark-submit to run the script pyspark_runner.py. Because it is run as a script the modules and packages from its directory (luigi/contrib/) shadow the global modules. In particular the package
hdfs
which is used by webhdfs_client.py is shadowed by theluigi.contrib.hdfs
package. Thus PySparkTask does not work together with webhdfs.An example task which highlights the problem can be found below:
Description
The problem is resolved by putting the current directory at the end of the path list
sys.path
inpyspark_runner.py
.Testing
I tested manually that this resolves the problem (for several weeks in production). I thought about adding a unit test for this but did not come with a good solution. The difficulty is that the problem occurs only when
pyspark_runner.py
is run as a script. But running the script necessarily tries to start spark by creating a SparkContext, which I guess is not desired for the unit tests. Suggestions on how to test this are welcome.