Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LIVY-194. Share SparkContext across languages #178

Closed
wants to merge 1 commit into from
Closed

LIVY-194. Share SparkContext across languages #178

wants to merge 1 commit into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Aug 11, 2016

This PR is to share the same SparkContext across languages (scala/python/R). I introduce another new kind shared interpreter which support scala/python/R. And they share the same SparkContext.

Main changes:

  • new kind interpreter SharedInterpreter which contains a map of other kinds of interpreter. The input code format of SharedInterpreter should be %kind code, e.g.
%spark sc.parallelize(1 to 10).sum()
  • SparkContext/SQLContext will be created from SparkFactory so that we can share the same SparkContext/SQLContext. Scala/Python/R will all get SparkContext/SQLContext from SparkFactory.

@zjffdu
Copy link
Contributor Author

zjffdu commented Aug 11, 2016

@vanzin Please help review the approach, I will add more test later.

@codecov-io
Copy link

codecov-io commented Aug 11, 2016

Current coverage is 69.83% (diff: 57.97%)

Merging #178 into master will decrease coverage by 0.91%

@@             master       #178   diff @@
==========================================
  Files            91         81    -10   
  Lines          4697       3915   -782   
  Methods           0          0          
  Messages          0          0          
  Branches        811        655   -156   
==========================================
- Hits           3323       2734   -589   
+ Misses          899        805    -94   
+ Partials        475        376    -99   

Powered by Codecov. Last update 69ac11e...2e4dd74

@alex-the-man
Copy link
Contributor

alex-the-man commented Aug 11, 2016

Besides using different languages with the same SparkContext, we want to support the use case where multiple interactive sessions share the same SparkContext too.

Some of our customers have a data scientists team working on the same set of data. Data scientists want to share the cached RDDs but yet having their own interactive sessions. Most of them want to use Python. I think your approach doesn't support this use case because there could at most be one interpreter per language.

One alternative is, instead of creating a new SharedInterpreter, we can make interpreter children of interactive sessions. Each session has just 1 SparkContext and can have multiple interpreters. The SparkContext is shared among all interpreters in that session. Interpreters in a session can use the same or different languages.

The REST interface will look like this:
To create a session (for a SparkContext):

POST /sessions

Then, to create an interpreter:

POST /sessions/0/interpreter
{"kind":"pyspark"}

To post a statement:

POST /sessions/0/interpreter/0/statements
{"code":"1+1"}

I think this design is more flexibility and supports more potential use cases. What do you think?

@zjffdu
Copy link
Contributor Author

zjffdu commented Aug 12, 2016

Thanks @tc0312 for the quick feedback. Your use case is very interesting. But I feel the scope is little big and seems not easy to implement. I would suggest to put it in another ticket. Here's my several concerns.

  • It adds new rest api for user. We need to be careful about adding new rest api, as once we add we need to maintain it in future for backward compatibility.
  • How do create new interpreter in an existing session ? Say we have created an such session (yarn app is created), now we want to add new PySparkInterpreter. How do we ask the driver to launch another python process ? Seems we need to add new protocol between RSC client & server for that. And even we implement that, I just concern about the scalability of running multiple SparkIMain/Python Process/R Process in one JVM.

@alex-the-man
Copy link
Contributor

Can u close and reopen the PR to run the test again?

@zjffdu zjffdu closed this Aug 17, 2016
@zjffdu zjffdu reopened this Aug 17, 2016
@alex-the-man
Copy link
Contributor

This indicates my test's flaky. I will stress test it tomorrow.

@alex-the-man
Copy link
Contributor

The previous test failure is caused by a bad timeout value. I'm going to fix it in LIVY-186.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 13, 2016

@tc0312 @linchan-ms Do you have any progress on this ? Like design doc or something else.
This PR would implement the SparkContext sharing between languages. If your idear of multiple interpreters per language would not break this, then I think I can continue this work as part of the whole implementation. What do you think ?

@alex-the-man
Copy link
Contributor

We talked to @felixcheung recently and he told us depends on configuration, Zeppelin supports multiple interpreters per language too.

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 13, 2016

That's correct. Zeppelin has one kind of mode named scoped which support multiple interpreter per language. But the SparkContext sharing across languages is supported no matter what configuration it is.

@alex-the-man
Copy link
Contributor

alex-the-man commented Dec 14, 2016

If a customer wants 2 PySpark interpreters with different virtualenv or Python versions for visualization, 1 Scala interpreter for computation, this PR doesn't seem to support it.

@jerryshao
Copy link
Contributor

I think what @tc0312 mentioned about multiple interpreter per spark context could cover the scenario here. Here the shared session is just one specific use case of multiple interpreter per sc (one python/scala/R interpreter per sc).

For the better code design and evolving, I would suggest to have a skeleton of multiple interpreter per sc first, then we could add shared session as a special case (if this feature is not urgent).

@alex-the-man
Copy link
Contributor

alex-the-man commented Dec 16, 2016

One thing our customers want in Zeppelin and Jupyter is sharing SparkContext but not variables.
I think that's scoped mode in Zeppelin. This PR doesn't support that use case.

@gss2002
Copy link

gss2002 commented Feb 8, 2017

@tc0312 and @zjffdu any status on this initiative. As we keep getting asked about context sharing between the different languages from Zeppelin specifically sparkR to scala for dataframes and such.

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 10, 2017

Thanks for your interests on this feature @gss2002, we are working on a more sophisticated design.

@zjffdu zjffdu closed this Aug 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants