Skip to content

Commit

Permalink
[WIP][SPARK-1871][MLLIB] Improve MLlib guide for v1.0
Browse files Browse the repository at this point in the history
Some improvements to MLlib guide:

1. [SPARK-1872] Update API links for unidoc.
2. [SPARK-1783] Added `page.displayTitle` to the global layout. If it is defined, use it instead of `page.title` for title display.
3. Add more Java/Python examples.

Author: Xiangrui Meng <[email protected]>

Closes apache#816 from mengxr/mllib-doc and squashes the following commits:

ec2e407 [Xiangrui Meng] format scala example for ALS
cd9f40b [Xiangrui Meng] add a paragraph to summarize distributed matrix types
4617f04 [Xiangrui Meng] add python example to loadLibSVMFile and fix Java example
d6509c2 [Xiangrui Meng] [SPARK-1783] update mllib titles
561fdc0 [Xiangrui Meng] add a displayTitle option to global layout
195d06f [Xiangrui Meng] add Java example for summary stats and minor fix
9f1ff89 [Xiangrui Meng] update java api links in mllib-basics
7dad18e [Xiangrui Meng] update java api links in NB
3a0f4a6 [Xiangrui Meng] api/pyspark -> api/python
35bdeb9 [Xiangrui Meng] api/mllib -> api/scala
e4afaa8 [Xiangrui Meng] explicity state what might change
  • Loading branch information
mengxr authored and mateiz committed May 19, 2014
1 parent 4ce4793 commit df0aa83
Show file tree
Hide file tree
Showing 10 changed files with 153 additions and 90 deletions.
6 changes: 5 additions & 1 deletion docs/_layouts/global.html
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,11 @@
</div>

<div class="container" id="content">
<h1 class="title">{{ page.title }}</h1>
{% if page.displayTitle %}
<h1 class="title">{{ page.displayTitle }}</h1>
{% else %}
<h1 class="title">{{ page.title }}</h1>
{% endif %}

{{ content }}

Expand Down
125 changes: 86 additions & 39 deletions docs/mllib-basics.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
layout: global
title: <a href="mllib-guide.html">MLlib</a> - Basics
title: Basics - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
---

* Table of contents
Expand All @@ -26,11 +27,11 @@ of the vector.
<div data-lang="scala" markdown="1">

The base class of local vectors is
[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
implementations: [`DenseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseVector) and
[`SparseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend
[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and
[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend
using the factory methods implemented in
[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.

{% highlight scala %}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
Expand All @@ -53,11 +54,11 @@ Scala imports `scala.collection.immutable.Vector` by default, so you have to imp
<div data-lang="java" markdown="1">

The base class of local vectors is
[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
implementations: [`DenseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseVector) and
[`SparseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend
[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two
implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html). We recommend
using the factory methods implemented in
[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.
[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create local vectors.

{% highlight java %}
import org.apache.spark.mllib.linalg.Vector;
Expand All @@ -78,13 +79,13 @@ MLlib recognizes the following types as dense vectors:

and the following as sparse vectors:

* MLlib's [`SparseVector`](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html).
* MLlib's [`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html).
* SciPy's
[`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
with a single column

We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
in [`Vectors`](api/pyspark/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors.
in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors.

{% highlight python %}
import numpy as np
Expand Down Expand Up @@ -117,7 +118,7 @@ For multiclass classification, labels should be class indices staring from zero:
<div data-lang="scala" markdown="1">

A labeled point is represented by the case class
[`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint).
[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).

{% highlight scala %}
import org.apache.spark.mllib.linalg.Vectors
Expand All @@ -134,7 +135,7 @@ val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
<div data-lang="java" markdown="1">

A labeled point is represented by
[`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint).
[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).

{% highlight java %}
import org.apache.spark.mllib.linalg.Vectors;
Expand All @@ -151,7 +152,7 @@ LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new
<div data-lang="python" markdown="1">

A labeled point is represented by
[`LabeledPoint`](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html).
[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html).

{% highlight python %}
from pyspark.mllib.linalg import SparseVector
Expand Down Expand Up @@ -184,28 +185,40 @@ After loading, the feature indices are converted to zero-based.
<div class="codetabs">
<div data-lang="scala" markdown="1">

[`MLUtils.loadLibSVMFile`](api/mllib/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
examples stored in LIBSVM format.

{% highlight scala %}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD

val training: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`MLUtils.loadLibSVMFile`](api/mllib/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training
examples stored in LIBSVM format.

{% highlight java %}
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.rdd.RDDimport;
import org.apache.spark.api.java.JavaRDD;

JavaRDD<LabeledPoint> examples =
MLUtils.loadLibSVMFile(jsc.sc(), "mllib/data/sample_libsvm_data.txt").toJavaRDD();
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) reads training
examples stored in LIBSVM format.

RDD<LabeledPoint> training = MLUtils.loadLibSVMFile(jsc, "mllib/data/sample_libsvm_data.txt");
{% highlight python %}
from pyspark.mllib.util import MLUtils

examples = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
{% endhighlight %}
</div>
</div>
Expand All @@ -227,10 +240,10 @@ We are going to add sparse matrix in the next release.
<div data-lang="scala" markdown="1">

The base class of local matrices is
[`Matrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
implementation: [`DenseMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
implementation: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
in [`Matrices`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
matrices.

{% highlight scala %}
Expand All @@ -244,10 +257,10 @@ val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
<div data-lang="java" markdown="1">

The base class of local matrices is
[`Matrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
implementation: [`DenseMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide one
implementation: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
Sparse matrix will be added in the next release. We recommend using the factory methods implemented
in [`Matrices`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local
matrices.

{% highlight java %}
Expand All @@ -269,6 +282,15 @@ and distributed matrices. Converting a distributed matrix to a different format
global shuffle, which is quite expensive. We implemented three types of distributed matrices in
this release and will add more types in the future.

The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
matrix without meaningful row indices, e.g., a collection of feature vectors.
It is backed by an RDD of its rows, where each row is a local vector.
We assume that the number of columns is not huge for a `RowMatrix`.
An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
which can be used for identifying rows and joins.
A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix) format,
backed by an RDD of its entries.

***Note***

The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
Expand All @@ -284,7 +306,7 @@ limited by the integer range but it should be much smaller in practice.
<div class="codetabs">
<div data-lang="scala" markdown="1">

A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
created from an `RDD[Vector]` instance. Then we can compute its column summary statistics.

{% highlight scala %}
Expand All @@ -303,7 +325,7 @@ val n = mat.numCols()

<div data-lang="java" markdown="1">

A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.

{% highlight java %}
Expand Down Expand Up @@ -333,8 +355,8 @@ which could be faster if the rows are sparse.
<div class="codetabs">
<div data-lang="scala" markdown="1">

`RowMatrix#computeColumnSummaryStatistics` returns an instance of
[`MultivariateStatisticalSummary`](api/mllib/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
[`RowMatrix#computeColumnSummaryStatistics`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
total count.

Expand All @@ -355,6 +377,31 @@ println(summary.numNonzeros) // number of nonzeros in each column
val cov: Matrix = mat.computeCovariance()
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">

[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
total count.

{% highlight java %}
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;

RowMatrix mat = ... // a RowMatrix

// Compute column summary statistics.
MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
System.out.println(summary.mean()); // a dense vector containing the mean value for each column
System.out.println(summary.variance()); // column-wise variance
System.out.println(summary.numNonzeros()); // number of nonzeros in each column

// Compute the covariance matrix.
Matrix cov = mat.computeCovariance();
{% endhighlight %}
</div>
</div>

### IndexedRowMatrix
Expand All @@ -366,9 +413,9 @@ an RDD of indexed rows, which each row is represented by its index (long-typed)
<div data-lang="scala" markdown="1">

An
[`IndexedRowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
can be created from an `RDD[IndexedRow]` instance, where
[`IndexedRow`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
its row indices.

Expand All @@ -391,9 +438,9 @@ val rowMat: RowMatrix = mat.toRowMatrix()
<div data-lang="java" markdown="1">

An
[`IndexedRowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
can be created from an `JavaRDD<IndexedRow>` instance, where
[`IndexedRow`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
its row indices.

Expand Down Expand Up @@ -427,9 +474,9 @@ dimensions of the matrix are huge and the matrix is very sparse.
<div data-lang="scala" markdown="1">

A
[`CoordinateMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
can be created from an `RDD[MatrixEntry]` instance, where
[`MatrixEntry`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
with sparse rows by calling `toIndexedRowMatrix`. In this release, we do not provide other
computation for `CoordinateMatrix`.
Expand All @@ -453,21 +500,21 @@ val indexedRowMatrix = mat.toIndexedRowMatrix()
<div data-lang="java" markdown="1">

A
[`CoordinateMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
can be created from a `JavaRDD<MatrixEntry>` instance, where
[`MatrixEntry`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
with sparse rows by calling `toIndexedRowMatrix`.

{% highlight scala %}
{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;

JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
// Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
CoordinateMatrix mat = new CoordinateMatrix(entries);
CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());

// Get its size.
long m = mat.numRows();
Expand Down
5 changes: 3 additions & 2 deletions docs/mllib-clustering.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
layout: global
title: <a href="mllib-guide.html">MLlib</a> - Clustering
title: Clustering - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering
---

* Table of contents
Expand Down Expand Up @@ -40,7 +41,7 @@ a given dataset, the algorithm returns the best clustering result).
Following code snippets can be executed in `spark-shell`.

In the following example after loading and parsing data, we use the
[`KMeans`](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
Expand Down
29 changes: 17 additions & 12 deletions docs/mllib-collaborative-filtering.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
layout: global
title: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering
title: Collaborative Filtering - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering
---

* Table of contents
Expand Down Expand Up @@ -48,7 +49,7 @@ user for an item.

<div data-lang="scala" markdown="1">
In the following example we load rating data. Each row consists of a user, a product and a rating.
We use the default [ALS.train()](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS$)
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
method which assumes ratings are explicit. We evaluate the
recommendation model by measuring the Mean Squared Error of rating prediction.

Expand All @@ -58,25 +59,29 @@ import org.apache.spark.mllib.recommendation.Rating

// Load and parse the data
val data = sc.textFile("mllib/data/als/test.data")
val ratings = data.map(_.split(',') match {
case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
})
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})

// Build the recommendation model using ALS
val rank = 10
val numIterations = 20
val model = ALS.train(ratings, rank, numIterations, 0.01)

// Evaluate the model on rating data
val usersProducts = ratings.map{ case Rating(user, product, rate) => (user, product)}
val predictions = model.predict(usersProducts).map{
case Rating(user, product, rate) => ((user, product), rate)
val usersProducts = ratings.map { case Rating(user, product, rate) =>
(user, product)
}
val ratesAndPreds = ratings.map{
case Rating(user, product, rate) => ((user, product), rate)
val predictions =
model.predict(usersProducts).map { case Rating(user, product, rate) =>
((user, product), rate)
}
val ratesAndPreds = ratings.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictions)
val MSE = ratesAndPreds.map{
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
val err = (r1 - r2)
err * err
}.mean()
println("Mean Squared Error = " + MSE)
{% endhighlight %}
Expand Down
3 changes: 2 additions & 1 deletion docs/mllib-decision-tree.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
layout: global
title: <a href="mllib-guide.html">MLlib</a> - Decision Tree
title: Decision Tree - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Decision Tree
---

* Table of contents
Expand Down
Loading

0 comments on commit df0aa83

Please sign in to comment.