Skip to content

Commit

Permalink
[SPARK-25584][ML][DOC] datasource for libsvm user guide
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?
it seems that doc for libsvm datasource is not added in #22675.
This pr is to add it.

## How was this patch tested?
doc built locally
![图片](https://user-images.githubusercontent.com/7322292/62044350-4ad51480-b235-11e9-8f09-cbcbe9d3b7f9.png)

Closes #25286 from zhengruifeng/doc_libsvm_data_source.

Authored-by: zhengruifeng <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
  • Loading branch information
zhengruifeng authored and srowen committed Aug 1, 2019
1 parent 80ab19b commit b29829e
Show file tree
Hide file tree
Showing 2 changed files with 118 additions and 4 deletions.
2 changes: 1 addition & 1 deletion docs/_data/menu-ml.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
- text: Basic statistics
url: ml-statistics.html
- text: Data sources
url: ml-datasource
url: ml-datasource.html
- text: Pipelines
url: ml-pipeline.html
- text: Extracting, transforming and selecting features
Expand Down
120 changes: 117 additions & 3 deletions docs/ml-datasource.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ scala> df.select("image.origin", "image.width", "image.height").show(truncate=fa

<div data-lang="java" markdown="1">
[`ImageDataSource`](api/java/org/apache/spark/ml/source/image/ImageDataSource.html)
implements Spark SQL data source API for loading image data as DataFrame.
implements Spark SQL data source API for loading image data as a DataFrame.

{% highlight java %}
Dataset<Row> imagesDF = spark.read().format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens");
Expand All @@ -83,7 +83,7 @@ Will output:
</div>

<div data-lang="python" markdown="1">
In PySpark we provide Spark SQL data source API for loading image data as DataFrame.
In PySpark we provide Spark SQL data source API for loading image data as a DataFrame.

{% highlight python %}
>>> df = spark.read.format("image").option("dropInvalid", true).load("data/mllib/images/origin/kittens")
Expand All @@ -100,7 +100,7 @@ In PySpark we provide Spark SQL data source API for loading image data as DataFr
</div>

<div data-lang="r" markdown="1">
In SparkR we provide Spark SQL data source API for loading image data as DataFrame.
In SparkR we provide Spark SQL data source API for loading image data as a DataFrame.

{% highlight r %}
> df = read.df("data/mllib/images/origin/kittens", "image")
Expand All @@ -120,4 +120,118 @@ In SparkR we provide Spark SQL data source API for loading image data as DataFra
</div>


</div>


## LIBSVM data source

This `LIBSVM` data source is used to load 'libsvm' type files from a directory.
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
The schemas of the columns are:
- label: `DoubleType` (represents the instance label)
- features: `VectorUDT` (represents the feature vector)

<div class="codetabs">
<div data-lang="scala" markdown="1">
[`LibSVMDataSource`](api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource)
implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.

{% highlight scala %}
scala> val df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> df.show(10)
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
[`LibSVMDataSource`](api/java/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
implements Spark SQL data source API for loading `LIBSVM` data as a DataFrame.

{% highlight java %}
Dataset<Row> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt");
df.show(10);
/*
Will output:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
*/
{% endhighlight %}
</div>

<div data-lang="python" markdown="1">
In PySpark we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.

{% highlight python %}
>>> df = spark.read.format("libsvm").option("numFeatures", "780").load("data/mllib/sample_libsvm_data.txt")
>>> df.show(10)
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|(780,[127,128,129...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[124,125,126...|
| 1.0|(780,[152,153,154...|
| 1.0|(780,[151,152,153...|
| 0.0|(780,[129,130,131...|
| 1.0|(780,[158,159,160...|
| 1.0|(780,[99,100,101,...|
| 0.0|(780,[154,155,156...|
| 0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows
{% endhighlight %}
</div>

<div data-lang="r" markdown="1">
In SparkR we provide Spark SQL data source API for loading `LIBSVM` data as a DataFrame.

{% highlight r %}
> df = read.df("data/mllib/sample_libsvm_data.txt", "libsvm")
> head(select(df, df$label, df$features), 10)
label features
1 0 <environment: 0x7fe6d35366e8>
2 1 <environment: 0x7fe6d353bf78>
3 1 <environment: 0x7fe6d3541840>
4 1 <environment: 0x7fe6d3545108>
5 1 <environment: 0x7fe6d354c8e0>
6 0 <environment: 0x7fe6d35501a8>
7 1 <environment: 0x7fe6d3555a70>
8 1 <environment: 0x7fe6d3559338>
9 0 <environment: 0x7fe6d355cc00>
10 0 <environment: 0x7fe6d35643d8>

{% endhighlight %}
</div>


</div>

0 comments on commit b29829e

Please sign in to comment.