[WIP][SPARK-1871][MLLIB] Improve MLlib guide for v1.0

Some improvements to MLlib guide: 1. [SPARK-1872] Update API links for unidoc. 2. [SPARK-1783] Added `page.displayTitle` to the global layout. If it is defined, use it instead of `page.title` for title display. 3. Add more Java/Python examples. Author: Xiangrui Meng <[email protected]> Closes apache#816 from mengxr/mllib-doc and squashes the following commits: ec2e407 [Xiangrui Meng] format scala example for ALS cd9f40b [Xiangrui Meng] add a paragraph to summarize distributed matrix types 4617f04 [Xiangrui Meng] add python example to loadLibSVMFile and fix Java example d6509c2 [Xiangrui Meng] [SPARK-1783] update mllib titles 561fdc0 [Xiangrui Meng] add a displayTitle option to global layout 195d06f [Xiangrui Meng] add Java example for summary stats and minor fix 9f1ff89 [Xiangrui Meng] update java api links in mllib-basics 7dad18e [Xiangrui Meng] update java api links in NB 3a0f4a6 [Xiangrui Meng] api/pyspark -> api/python 35bdeb9 [Xiangrui Meng] api/mllib -> api/scala e4afaa8 [Xiangrui Meng] explicity state what might change
vanzin · May 19, 2014 · df0aa83 · df0aa83
1 parent 4ce4793
commit df0aa83
Show file tree

Hide file tree

Showing 10 changed files with 153 additions and 90 deletions.
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
@@ -114,7 +114,11 @@
         </div>
 
         <div class="container" id="content">
-          <h1 class="title">{{ page.title }}</h1>
+          {% if page.displayTitle %}
+            <h1 class="title">{{ page.displayTitle }}</h1>
+          {% else %}
+            <h1 class="title">{{ page.title }}</h1>
+          {% endif %}
 
           {{ content }}
 

diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md
@@ -1,6 +1,7 @@
 ---
 layout: global
-title: <a href="mllib-guide.html">MLlib</a> - Basics
+title: Basics - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
 ---
 
 * Table of contents
@@ -26,11 +27,11 @@ of the vector.
 <div data-lang="scala" markdown="1">
 
 The base class of local vectors is
-[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
-implementations: [`DenseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseVector) and
-[`SparseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.SparseVector).  We recommend
+[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
+implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and
+[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector).  We recommend
 using the factory methods implemented in
-[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.
+[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.
 
 {% highlight scala %}
 import org.apache.spark.mllib.linalg.{Vector, Vectors}
@@ -53,11 +54,11 @@ Scala imports `scala.collection.immutable.Vector` by default, so you have to imp
 <div data-lang="java" markdown="1">
 
 The base class of local vectors is
-[`Vector`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
-implementations: [`DenseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseVector) and
-[`SparseVector`](api/mllib/index.html#org.apache.spark.mllib.linalg.SparseVector).  We recommend
+[`Vector`](api/java/org/apache/spark/mllib/linalg/Vector.html), and we provide two
+implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVector.html) and
+[`SparseVector`](api/java/org/apache/spark/mllib/linalg/SparseVector.html).  We recommend
 using the factory methods implemented in
-[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) to create local vectors.
+[`Vectors`](api/java/org/apache/spark/mllib/linalg/Vector.html) to create local vectors.
 
 {% highlight java %}
 import org.apache.spark.mllib.linalg.Vector;
@@ -78,13 +79,13 @@ MLlib recognizes the following types as dense vectors:
 
 and the following as sparse vectors:
 
-* MLlib's [`SparseVector`](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html).
+* MLlib's [`SparseVector`](api/python/pyspark.mllib.linalg.SparseVector-class.html).
 * SciPy's
   [`csc_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix)
   with a single column
 
 We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
-in [`Vectors`](api/pyspark/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors.
+in [`Vectors`](api/python/pyspark.mllib.linalg.Vectors-class.html) to create sparse vectors.
 
 {% highlight python %}
 import numpy as np
@@ -117,7 +118,7 @@ For multiclass classification, labels should be class indices staring from zero:
 <div data-lang="scala" markdown="1">
 
 A labeled point is represented by the case class
-[`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint).
+[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).
 
 {% highlight scala %}
 import org.apache.spark.mllib.linalg.Vectors
@@ -134,7 +135,7 @@ val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))
 <div data-lang="java" markdown="1">
 
 A labeled point is represented by
-[`LabeledPoint`](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint).
+[`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html).
 
 {% highlight java %}
 import org.apache.spark.mllib.linalg.Vectors;
@@ -151,7 +152,7 @@ LabeledPoint neg = new LabeledPoint(1.0, Vectors.sparse(3, new int[] {0, 2}, new
 <div data-lang="python" markdown="1">
 
 A labeled point is represented by
-[`LabeledPoint`](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html).
+[`LabeledPoint`](api/python/pyspark.mllib.regression.LabeledPoint-class.html).
 
 {% highlight python %}
 from pyspark.mllib.linalg import SparseVector
@@ -184,28 +185,40 @@ After loading, the feature indices are converted to zero-based.
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-[`MLUtils.loadLibSVMFile`](api/mllib/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
+[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
 examples stored in LIBSVM format.
 
 {% highlight scala %}
 import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.util.MLUtils
 import org.apache.spark.rdd.RDD
 
-val training: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
+val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
 {% endhighlight %}
 </div>
 
 <div data-lang="java" markdown="1">
-[`MLUtils.loadLibSVMFile`](api/mllib/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
+[`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training
 examples stored in LIBSVM format.
 
 {% highlight java %}
 import org.apache.spark.mllib.regression.LabeledPoint;
 import org.apache.spark.mllib.util.MLUtils;
-import org.apache.spark.rdd.RDDimport;
+import org.apache.spark.api.java.JavaRDD;
+
+JavaRDD<LabeledPoint> examples = 
+  MLUtils.loadLibSVMFile(jsc.sc(), "mllib/data/sample_libsvm_data.txt").toJavaRDD();
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+[`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.util.MLUtils-class.html) reads training
+examples stored in LIBSVM format.
 
-RDD<LabeledPoint> training = MLUtils.loadLibSVMFile(jsc, "mllib/data/sample_libsvm_data.txt");
+{% highlight python %}
+from pyspark.mllib.util import MLUtils
+
+examples = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt")
 {% endhighlight %}
 </div>
 </div>
@@ -227,10 +240,10 @@ We are going to add sparse matrix in the next release.
 <div data-lang="scala" markdown="1">
 
 The base class of local matrices is
-[`Matrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
-implementation: [`DenseMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
+[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
+implementation: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
 Sparse matrix will be added in the next release.  We recommend using the factory methods implemented
-in [`Matrices`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
+in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
 matrices.
 
 {% highlight scala %}
@@ -244,10 +257,10 @@ val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
 <div data-lang="java" markdown="1">
 
 The base class of local matrices is
-[`Matrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide one
-implementation: [`DenseMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.DenseMatrix).
+[`Matrix`](api/java/org/apache/spark/mllib/linalg/Matrix.html), and we provide one
+implementation: [`DenseMatrix`](api/java/org/apache/spark/mllib/linalg/DenseMatrix.html).
 Sparse matrix will be added in the next release.  We recommend using the factory methods implemented
-in [`Matrices`](api/mllib/index.html#org.apache.spark.mllib.linalg.Matrices) to create local
+in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local
 matrices.
 
 {% highlight java %}
@@ -269,6 +282,15 @@ and distributed matrices.  Converting a distributed matrix to a different format
 global shuffle, which is quite expensive.  We implemented three types of distributed matrices in
 this release and will add more types in the future.
 
+The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
+matrix without meaningful row indices, e.g., a collection of feature vectors.
+It is backed by an RDD of its rows, where each row is a local vector.
+We assume that the number of columns is not huge for a `RowMatrix`.
+An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
+which can be used for identifying rows and joins.
+A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix) format,
+backed by an RDD of its entries.
+
 ***Note***
 
 The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.
@@ -284,7 +306,7 @@ limited by the integer range but it should be much smaller in practice.
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
+A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
 created from an `RDD[Vector]` instance.  Then we can compute its column summary statistics.
 
 {% highlight scala %}
@@ -303,7 +325,7 @@ val n = mat.numCols()
 
 <div data-lang="java" markdown="1">
 
-A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
+A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
 created from a `JavaRDD<Vector>` instance.  Then we can compute its column summary statistics.
 
 {% highlight java %}
@@ -333,8 +355,8 @@ which could be faster if the rows are sparse.
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-`RowMatrix#computeColumnSummaryStatistics` returns an instance of
-[`MultivariateStatisticalSummary`](api/mllib/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
+[`RowMatrix#computeColumnSummaryStatistics`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) returns an instance of
+[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
 which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
 total count.
 
@@ -355,6 +377,31 @@ println(summary.numNonzeros) // number of nonzeros in each column
 val cov: Matrix = mat.computeCovariance()
 {% endhighlight %}
 </div>
+
+<div data-lang="java" markdown="1">
+
+[`RowMatrix#computeColumnSummaryStatistics`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html#computeColumnSummaryStatistics()) returns an instance of
+[`MultivariateStatisticalSummary`](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
+which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
+total count.
+
+{% highlight java %}
+import org.apache.spark.mllib.linalg.Matrix;
+import org.apache.spark.mllib.linalg.distributed.RowMatrix;
+import org.apache.spark.mllib.stat.MultivariateStatisticalSummary;
+
+RowMatrix mat = ... // a RowMatrix
+
+// Compute column summary statistics.
+MultivariateStatisticalSummary summary = mat.computeColumnSummaryStatistics();
+System.out.println(summary.mean()); // a dense vector containing the mean value for each column
+System.out.println(summary.variance()); // column-wise variance
+System.out.println(summary.numNonzeros()); // number of nonzeros in each column
+
+// Compute the covariance matrix.
+Matrix cov = mat.computeCovariance();
+{% endhighlight %}
+</div>
 </div>
 
 ### IndexedRowMatrix
@@ -366,9 +413,9 @@ an RDD of indexed rows, which each row is represented by its index (long-typed)
 <div data-lang="scala" markdown="1">
 
 An
-[`IndexedRowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
+[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
 can be created from an `RDD[IndexedRow]` instance, where
-[`IndexedRow`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
+[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
 wrapper over `(Long, Vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
 its row indices.
 
@@ -391,9 +438,9 @@ val rowMat: RowMatrix = mat.toRowMatrix()
 <div data-lang="java" markdown="1">
 
 An
-[`IndexedRowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
+[`IndexedRowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
 can be created from an `JavaRDD<IndexedRow>` instance, where
-[`IndexedRow`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
+[`IndexedRow`](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
 wrapper over `(long, Vector)`.  An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
 its row indices.
 
@@ -427,9 +474,9 @@ dimensions of the matrix are huge and the matrix is very sparse.
 <div data-lang="scala" markdown="1">
 
 A
-[`CoordinateMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
+[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
 can be created from an `RDD[MatrixEntry]` instance, where
-[`MatrixEntry`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
+[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
 wrapper over `(Long, Long, Double)`.  A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
 with sparse rows by calling `toIndexedRowMatrix`.  In this release, we do not provide other
 computation for `CoordinateMatrix`.
@@ -453,21 +500,21 @@ val indexedRowMatrix = mat.toIndexedRowMatrix()
 <div data-lang="java" markdown="1">
 
 A
-[`CoordinateMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
+[`CoordinateMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
 can be created from a `JavaRDD<MatrixEntry>` instance, where
-[`MatrixEntry`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
+[`MatrixEntry`](api/java/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
 wrapper over `(long, long, double)`.  A `CoordinateMatrix` can be converted to a `IndexedRowMatrix`
 with sparse rows by calling `toIndexedRowMatrix`.
 
-{% highlight scala %}
+{% highlight java %}
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
 import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
 import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
 
 JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
 // Create a CoordinateMatrix from a JavaRDD<MatrixEntry>.
-CoordinateMatrix mat = new CoordinateMatrix(entries);
+CoordinateMatrix mat = new CoordinateMatrix(entries.rdd());
 
 // Get its size.
 long m = mat.numRows();

diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
@@ -1,6 +1,7 @@
 ---
 layout: global
-title: <a href="mllib-guide.html">MLlib</a> - Clustering
+title: Clustering - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering
 ---
 
 * Table of contents
@@ -40,7 +41,7 @@ a given dataset, the algorithm returns the best clustering result).
 Following code snippets can be executed in `spark-shell`.
 
 In the following example after loading and parsing data, we use the
-[`KMeans`](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
+[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
 into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
 Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the
 optimal *k* is usually one where there is an "elbow" in the WSSSE graph.

diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
@@ -1,6 +1,7 @@
 ---
 layout: global
-title: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering 
+title: Collaborative Filtering - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Collaborative Filtering 
 ---
 
 * Table of contents
@@ -48,7 +49,7 @@ user for an item.
 
 <div data-lang="scala" markdown="1">
 In the following example we load rating data. Each row consists of a user, a product and a rating.
-We use the default [ALS.train()](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS$) 
+We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$) 
 method which assumes ratings are explicit. We evaluate the
 recommendation model by measuring the Mean Squared Error of rating prediction.
 
@@ -58,25 +59,29 @@ import org.apache.spark.mllib.recommendation.Rating
 
 // Load and parse the data
 val data = sc.textFile("mllib/data/als/test.data")
-val ratings = data.map(_.split(',') match {
-    case Array(user, item, rate) =>  Rating(user.toInt, item.toInt, rate.toDouble)
-})
+val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
+    Rating(user.toInt, item.toInt, rate.toDouble)
+  })
 
 // Build the recommendation model using ALS
 val rank = 10
 val numIterations = 20
 val model = ALS.train(ratings, rank, numIterations, 0.01)
 
 // Evaluate the model on rating data
-val usersProducts = ratings.map{ case Rating(user, product, rate)  => (user, product)}
-val predictions = model.predict(usersProducts).map{
-    case Rating(user, product, rate) => ((user, product), rate)
+val usersProducts = ratings.map { case Rating(user, product, rate) =>
+  (user, product)
 }
-val ratesAndPreds = ratings.map{
-    case Rating(user, product, rate) => ((user, product), rate)
+val predictions = 
+  model.predict(usersProducts).map { case Rating(user, product, rate) => 
+    ((user, product), rate)
+  }
+val ratesAndPreds = ratings.map { case Rating(user, product, rate) => 
+  ((user, product), rate)
 }.join(predictions)
-val MSE = ratesAndPreds.map{
-    case ((user, product), (r1, r2)) =>  math.pow((r1- r2), 2)
+val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => 
+  val err = (r1 - r2)
+  err * err
 }.mean()
 println("Mean Squared Error = " + MSE)
 {% endhighlight %}

diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
@@ -1,6 +1,7 @@
 ---
 layout: global
-title: <a href="mllib-guide.html">MLlib</a> - Decision Tree
+title: Decision Tree - MLlib
+displayTitle: <a href="mllib-guide.html">MLlib</a> - Decision Tree
 ---
 
 * Table of contents