Skip to content

Commit

Permalink
[SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guide
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?
* Add all R examples for ML wrappers which were added during 2.1 release cycle.
* Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them.
* Add corresponding examples to ML user guide.
* Update ML section of SparkR user guide.

Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```.

## How was this patch tested?
Run all examples manually.

Author: Yanbo Liang <[email protected]>

Closes apache#16148 from yanboliang/spark-18325.
  • Loading branch information
yanboliang committed Dec 8, 2016
1 parent b47b892 commit 9bf8f3c
Show file tree
Hide file tree
Showing 19 changed files with 810 additions and 178 deletions.
67 changes: 64 additions & 3 deletions docs/ml-classification-regression.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,13 @@ More details on parameters can be found in the [Python API documentation](api/py
{% include_example python/ml/logistic_regression_with_elastic_net.py %}
</div>

<div data-lang="r" markdown="1">

More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).

{% include_example binomial r/ml/logit.R %}
</div>

</div>

The `spark.ml` implementation of logistic regression also supports
Expand Down Expand Up @@ -171,6 +178,13 @@ model with elastic net regularization.
{% include_example python/ml/multiclass_logistic_regression_with_elastic_net.py %}
</div>

<div data-lang="r" markdown="1">

More details on parameters can be found in the [R API documentation](api/R/spark.logit.html).

{% include_example multinomial r/ml/logit.R %}
</div>

</div>


Expand Down Expand Up @@ -242,6 +256,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat

{% include_example python/ml/random_forest_classifier_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.randomForest.html) for more details.

{% include_example classification r/ml/randomForest.R %}
</div>

</div>

## Gradient-boosted tree classifier
Expand Down Expand Up @@ -275,6 +297,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat

{% include_example python/ml/gradient_boosted_tree_classifier_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.gbt.html) for more details.

{% include_example classification r/ml/gbt.R %}
</div>

</div>

## Multilayer perceptron classifier
Expand Down Expand Up @@ -324,6 +354,13 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat
{% include_example python/ml/multilayer_perceptron_classification.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.mlp.html) for more details.

{% include_example r/ml/mlp.R %}
</div>

</div>


Expand Down Expand Up @@ -400,7 +437,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat

Refer to the [R API docs](api/R/spark.naiveBayes.html) for more details.

{% include_example naiveBayes r/ml.R %}
{% include_example r/ml/naiveBayes.R %}
</div>

</div>
Expand Down Expand Up @@ -584,7 +621,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.

Refer to the [R API docs](api/R/spark.glm.html) for more details.

{% include_example glm r/ml.R %}
{% include_example r/ml/glm.R %}
</div>

</div>
Expand Down Expand Up @@ -656,6 +693,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.

{% include_example python/ml/random_forest_regressor_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.randomForest.html) for more details.

{% include_example regression r/ml/randomForest.R %}
</div>

</div>

## Gradient-boosted tree regression
Expand Down Expand Up @@ -689,6 +734,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.

{% include_example python/ml/gradient_boosted_tree_regressor_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.gbt.html) for more details.

{% include_example regression r/ml/gbt.R %}
</div>

</div>


Expand Down Expand Up @@ -780,7 +833,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.regression.

Refer to the [R API docs](api/R/spark.survreg.html) for more details.

{% include_example survreg r/ml.R %}
{% include_example r/ml/survreg.R %}
</div>

</div>
Expand Down Expand Up @@ -853,6 +906,14 @@ Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.ml.html#pyspa

{% include_example python/ml/isotonic_regression_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [`IsotonicRegression` R API docs](api/R/spark.isoreg.html) for more details on the API.

{% include_example r/ml/isoreg.R %}
</div>

</div>

# Linear methods
Expand Down
18 changes: 17 additions & 1 deletion docs/ml-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.

Refer to the [R API docs](api/R/spark.kmeans.html) for more details.

{% include_example kmeans r/ml.R %}
{% include_example r/ml/kmeans.R %}
</div>

</div>
Expand Down Expand Up @@ -126,6 +126,14 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.

{% include_example python/ml/lda_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.lda.html) for more details.

{% include_example r/ml/lda.R %}
</div>

</div>

## Bisecting k-means
Expand Down Expand Up @@ -241,4 +249,12 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.

{% include_example python/ml/gaussian_mixture_example.py %}
</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details.

{% include_example r/ml/gaussianMixture.R %}
</div>

</div>
8 changes: 8 additions & 0 deletions docs/ml-collaborative-filtering.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,4 +149,12 @@ als = ALS(maxIter=5, regParam=0.01, implicitPrefs=True,
{% endhighlight %}

</div>

<div data-lang="r" markdown="1">

Refer to the [R API docs](api/R/spark.als.html) for more details.

{% include_example r/ml/als.R %}
</div>

</div>
46 changes: 20 additions & 26 deletions docs/sparkr.md
Original file line number Diff line number Diff line change
Expand Up @@ -512,39 +512,33 @@ head(teenagers)

# Machine Learning

SparkR supports the following machine learning algorithms currently: `Generalized Linear Model`, `Accelerated Failure Time (AFT) Survival Regression Model`, `Naive Bayes Model` and `KMeans Model`.
Under the hood, SparkR uses MLlib to train the model.
Users can call `summary` to print a summary of the fitted model, [predict](api/R/predict.html) to make predictions on new data, and [write.ml](api/R/write.ml.html)/[read.ml](api/R/read.ml.html) to save/load fitted models.
SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘.

## Algorithms

### Generalized Linear Model

[spark.glm()](api/R/spark.glm.html) or [glm()](api/R/glm.html) fits generalized linear model against a Spark DataFrame.
Currently "gaussian", "binomial", "poisson" and "gamma" families are supported.
{% include_example glm r/ml.R %}

### Accelerated Failure Time (AFT) Survival Regression Model

[spark.survreg()](api/R/spark.survreg.html) fits an accelerated failure time (AFT) survival regression model on a SparkDataFrame.
Note that the formula of [spark.survreg()](api/R/spark.survreg.html) does not support operator '.' currently.
{% include_example survreg r/ml.R %}

### Naive Bayes Model

[spark.naiveBayes()](api/R/spark.naiveBayes.html) fits a Bernoulli naive Bayes model against a SparkDataFrame. Only categorical data is supported.
{% include_example naiveBayes r/ml.R %}

### KMeans Model
SparkR supports the following machine learning algorithms currently:

* [`spark.glm`](api/R/spark.glm.html) or [`glm`](api/R/glm.html): [`Generalized Linear Model`](ml-classification-regression.html#generalized-linear-regression)
* [`spark.survreg`](api/R/spark.survreg.html): [`Accelerated Failure Time (AFT) Survival Regression Model`](ml-classification-regression.html#survival-regression)
* [`spark.naiveBayes`](api/R/spark.naiveBayes.html): [`Naive Bayes Model`](ml-classification-regression.html#naive-bayes)
* [`spark.kmeans`](api/R/spark.kmeans.html): [`K-Means Model`](ml-clustering.html#k-means)
* [`spark.logit`](api/R/spark.logit.html): [`Logistic Regression Model`](ml-classification-regression.html#logistic-regression)
* [`spark.isoreg`](api/R/spark.isoreg.html): [`Isotonic Regression Model`](ml-classification-regression.html#isotonic-regression)
* [`spark.gaussianMixture`](api/R/spark.gaussianMixture.html): [`Gaussian Mixture Model`](ml-clustering.html#gaussian-mixture-model-gmm)
* [`spark.lda`](api/R/spark.lda.html): [`Latent Dirichlet Allocation (LDA) Model`](ml-clustering.html#latent-dirichlet-allocation-lda)
* [`spark.mlp`](api/R/spark.mlp.html): [`Multilayer Perceptron Classification Model`](ml-classification-regression.html#multilayer-perceptron-classifier)
* [`spark.gbt`](api/R/spark.gbt.html): `Gradient Boosted Tree Model for` [`Regression`](ml-classification-regression.html#gradient-boosted-tree-regression) `and` [`Classification`](ml-classification-regression.html#gradient-boosted-tree-classifier)
* [`spark.randomForest`](api/R/spark.randomForest.html): `Random Forest Model for` [`Regression`](ml-classification-regression.html#random-forest-regression) `and` [`Classification`](ml-classification-regression.html#random-forest-classifier)
* [`spark.als`](api/R/spark.als.html): [`Alternating Least Squares (ALS) matrix factorization Model`](ml-collaborative-filtering.html#collaborative-filtering)
* [`spark.kstest`](api/R/spark.kstest.html): `Kolmogorov-Smirnov Test`

Under the hood, SparkR uses MLlib to train the model. Please refer to the corresponding section of MLlib user guide for example code.
Users can call `summary` to print a summary of the fitted model, [predict](api/R/predict.html) to make predictions on new data, and [write.ml](api/R/write.ml.html)/[read.ml](api/R/read.ml.html) to save/load fitted models.
SparkR supports a subset of the available R formula operators for model fitting, including ‘~’, ‘.’, ‘:’, ‘+’, and ‘-‘.

[spark.kmeans()](api/R/spark.kmeans.html) fits a k-means clustering model against a Spark DataFrame, similarly to R's kmeans().
{% include_example kmeans r/ml.R %}

## Model persistence

The following example shows how to save/load a MLlib model by SparkR.
{% include_example read_write r/ml.R %}
{% include_example read_write r/ml/ml.R %}

# R Function Name Conflicts

Expand Down
Loading

0 comments on commit 9bf8f3c

Please sign in to comment.