-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12159] [ML] Add user guide section for IndexToString transformer #10166
Closed
Closed
Changes from 1 commit
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
10ba98a
documentation for the IndexToString label transformer
BenFradet 3e54dc0
removed setLabels calls for the IndexToString user guide
BenFradet e7a549e
added line regarding labels being inferred from metadata
BenFradet 27b2d0d
scala index to string example
BenFradet d275a9a
java IndexToString example
BenFradet 6327b7a
python IndexToString example
BenFradet cb33653
include_example in the docs for IndexToString
BenFradet 9591007
fixed python indent
BenFradet 9398743
forgot whitespace
BenFradet File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -835,10 +835,10 @@ dctDf.select("featuresDCT").show(3); | |
`StringIndexer` encodes a string column of labels to a column of label indices. | ||
The indices are in `[0, numLabels)`, ordered by label frequencies. | ||
So the most frequent label gets index `0`. | ||
If the input column is numeric, we cast it to string and index the string | ||
values. When downstream pipeline components such as `Estimator` or | ||
`Transformer` make use of this string-indexed label, you must set the input | ||
column of the component to this string-indexed column name. In many cases, | ||
If the input column is numeric, we cast it to string and index the string | ||
values. When downstream pipeline components such as `Estimator` or | ||
`Transformer` make use of this string-indexed label, you must set the input | ||
column of the component to this string-indexed column name. In many cases, | ||
you can set the input column with `setInputCol`. | ||
|
||
**Examples** | ||
|
@@ -951,9 +951,157 @@ indexed.show() | |
</div> | ||
</div> | ||
|
||
|
||
## IndexToString | ||
|
||
Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices | ||
back to a column containing the original labels as strings. The common use case | ||
is to produce indices from labels with `StringIndexer`, train a model with those | ||
indices and retrieve the original labels from the column of predicted indices | ||
with `IndexToString`. However, you are free to supply your own labels. | ||
|
||
**Examples** | ||
|
||
Building on the `StringIndexer` example, let's assume we have the following | ||
DataFrame with columns `id` and `categoryIndex`: | ||
|
||
~~~~ | ||
id | categoryIndex | ||
----|--------------- | ||
0 | 0.0 | ||
1 | 2.0 | ||
2 | 1.0 | ||
3 | 0.0 | ||
4 | 0.0 | ||
5 | 1.0 | ||
~~~~ | ||
|
||
Applying `IndexToString` with `categoryIndex` as the input column, | ||
`originalCategory` as the output column and the previous `StringIndexer`'s | ||
labels as labels, we are able to retrieve our original labels: | ||
|
||
~~~~ | ||
id | categoryIndex | originalCategory | ||
----|---------------|----------------- | ||
0 | 0.0 | a | ||
1 | 2.0 | b | ||
2 | 1.0 | c | ||
3 | 0.0 | a | ||
4 | 0.0 | a | ||
5 | 1.0 | c | ||
~~~~ | ||
|
||
<div class="codetabs"> | ||
<div data-lang="scala" markdown="1"> | ||
|
||
Refer to the [IndexToString Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IndexToString) | ||
for more details on the API. | ||
|
||
{% highlight scala %} | ||
import org.apache.spark.ml.feature.{IndexToString, StringIndexer} | ||
|
||
val df = sqlContext.createDataFrame(Seq( | ||
(0, "a"), | ||
(1, "b"), | ||
(2, "c"), | ||
(3, "a"), | ||
(4, "a"), | ||
(5, "c") | ||
)).toDF("id", "category") | ||
|
||
val indexer = new StringIndexer() | ||
.setInputCol("category") | ||
.setOutputCol("categoryIndex") | ||
.fit(df) | ||
val indexed = indexer.transform(df) | ||
|
||
val converter = new IndexToString() | ||
.setInputCol("categoryIndex") | ||
.setOutputCol("originalCategory") | ||
.setLabels(indexer.labels) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You probably don't need to specify labels; they should be pulled from column metadata. |
||
val converted = converter.transform(indexed) | ||
converted.select("id", "originalCategory").show() | ||
{% endhighlight %} | ||
</div> | ||
|
||
<div data-lang="java" markdown="1"> | ||
|
||
Refer to the [IndexToString Java docs](api/java/org/apache/spark/ml/feature/IndexToString.html) | ||
for more details on the API. | ||
|
||
{% highlight java %} | ||
import java.util.Arrays; | ||
|
||
import org.apache.spark.api.java.JavaRDD; | ||
import org.apache.spark.ml.feature.IndexToString; | ||
import org.apache.spark.ml.feature.StringIndexer; | ||
import org.apache.spark.ml.feature.StringIndexerModel; | ||
import org.apache.spark.sql.DataFrame; | ||
import org.apache.spark.sql.Row; | ||
import org.apache.spark.sql.RowFactory; | ||
import org.apache.spark.sql.types.DataTypes; | ||
import org.apache.spark.sql.types.Metadata; | ||
import org.apache.spark.sql.types.StructField; | ||
import org.apache.spark.sql.types.StructType; | ||
|
||
JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList( | ||
RowFactory.create(0, "a"), | ||
RowFactory.create(1, "b"), | ||
RowFactory.create(2, "c"), | ||
RowFactory.create(3, "a"), | ||
RowFactory.create(4, "a"), | ||
RowFactory.create(5, "c") | ||
)); | ||
StructType schema = new StructType(new StructField[]{ | ||
new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), | ||
new StructField("category", DataTypes.StringType, false, Metadata.empty()) | ||
}); | ||
DataFrame df = sqlContext.createDataFrame(jrdd, schema); | ||
StringIndexerModel indexer = new StringIndexer() | ||
.setInputCol("category") | ||
.setOutputCol("categoryIndex") | ||
.fit(df); | ||
DataFrame indexed = indexer.transform(df); | ||
|
||
IndexToString converter = new IndexToString() | ||
.setInputCol("categoryIndex") | ||
.setOutputCol("originalCategory") | ||
.setLabels(indexer.labels()); | ||
DataFrame converted = converter.transform(indexed); | ||
converted.select("id", "originalCategory").show(); | ||
{% endhighlight %} | ||
</div> | ||
|
||
<div data-lang="python" markdown="1"> | ||
|
||
Refer to the [IndexToString Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IndexToString) | ||
for more details on the API. | ||
|
||
{% highlight python %} | ||
from pyspark.ml.feature import IndexToString, StringIndexer | ||
|
||
df = sqlContext.createDataFrame([ | ||
(0, "a"), | ||
(1, "b"), | ||
(2, "c"), | ||
(3, "a"), | ||
(4, "a"), | ||
(5, "c") | ||
], ["id", "category"]) | ||
|
||
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") | ||
model = stringIndexer.fit(df) | ||
indexed = model.transform(df) | ||
converter = IndexToString(inputCol="categoryIndex", outputCol="originalCategory", labels=model.labels()) | ||
converted = converter.transform(indexed) | ||
converted.select("id", "originalCategory").show() | ||
{% endhighlight %} | ||
</div> | ||
</div> | ||
|
||
## OneHotEncoder | ||
|
||
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features | ||
[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features | ||
|
||
<div class="codetabs"> | ||
<div data-lang="scala" markdown="1"> | ||
|
@@ -979,10 +1127,11 @@ val indexer = new StringIndexer() | |
.fit(df) | ||
val indexed = indexer.transform(df) | ||
|
||
val encoder = new OneHotEncoder().setInputCol("categoryIndex"). | ||
setOutputCol("categoryVec") | ||
val encoder = new OneHotEncoder() | ||
.setInputCol("categoryIndex") | ||
.setOutputCol("categoryVec") | ||
val encoded = encoder.transform(indexed) | ||
encoded.select("id", "categoryVec").foreach(println) | ||
encoded.select("id", "categoryVec").show() | ||
{% endhighlight %} | ||
</div> | ||
|
||
|
@@ -1015,7 +1164,7 @@ JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList( | |
RowFactory.create(5, "c") | ||
)); | ||
StructType schema = new StructType(new StructField[]{ | ||
new StructField("id", DataTypes.DoubleType, false, Metadata.empty()), | ||
new StructField("id", DataTypes.IntegerType, false, Metadata.empty()), | ||
new StructField("category", DataTypes.StringType, false, Metadata.empty()) | ||
}); | ||
DataFrame df = sqlContext.createDataFrame(jrdd, schema); | ||
|
@@ -1029,6 +1178,7 @@ OneHotEncoder encoder = new OneHotEncoder() | |
.setInputCol("categoryIndex") | ||
.setOutputCol("categoryVec"); | ||
DataFrame encoded = encoder.transform(indexed); | ||
encoded.select("id", "categoryVec").show(); | ||
{% endhighlight %} | ||
</div> | ||
|
||
|
@@ -1054,6 +1204,7 @@ model = stringIndexer.fit(df) | |
indexed = model.transform(df) | ||
encoder = OneHotEncoder(includeFirst=False, inputCol="categoryIndex", outputCol="categoryVec") | ||
encoded = encoder.transform(indexed) | ||
encoded.select("id", "categoryVec").show() | ||
{% endhighlight %} | ||
</div> | ||
</div> | ||
|
@@ -1582,7 +1733,7 @@ from pyspark.mllib.linalg import Vectors | |
|
||
data = [(Vectors.dense([1.0, 2.0, 3.0]),), (Vectors.dense([4.0, 5.0, 6.0]),)] | ||
df = sqlContext.createDataFrame(data, ["vector"]) | ||
transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]), | ||
transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]), | ||
inputCol="vector", outputCol="transformedVector") | ||
transformer.transform(df).show() | ||
|
||
|
@@ -1713,15 +1864,15 @@ print(output.select("features", "clicked").first()) | |
sub-array of the original features. It is useful for extracting features from a vector column. | ||
|
||
`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column | ||
whose values are selected via those indices. There are two types of indices, | ||
whose values are selected via those indices. There are two types of indices, | ||
|
||
1. Integer indices that represents the indices into the vector, `setIndices()`; | ||
|
||
2. String indices that represents the names of features into the vector, `setNames()`. | ||
2. String indices that represents the names of features into the vector, `setNames()`. | ||
*This requires the vector column to have an `AttributeGroup` since the implementation matches on | ||
the name field of an `Attribute`.* | ||
|
||
Specification by integer and string are both acceptable. Moreover, you can use integer index and | ||
Specification by integer and string are both acceptable. Moreover, you can use integer index and | ||
string name simultaneously. At least one feature must be selected. Duplicate features are not | ||
allowed, so there can be no overlap between selected indices and names. Note that if names of | ||
features are selected, an exception will be threw out when encountering with empty input attributes. | ||
|
@@ -1734,9 +1885,9 @@ followed by the selected names (in the order given). | |
Suppose that we have a DataFrame with the column `userFeatures`: | ||
|
||
~~~ | ||
userFeatures | ||
userFeatures | ||
------------------ | ||
[0.0, 10.0, 0.5] | ||
[0.0, 10.0, 0.5] | ||
~~~ | ||
|
||
`userFeatures` is a vector column that contains three user features. Assuming that the first column | ||
|
@@ -1750,7 +1901,7 @@ column named `features`: | |
[0.0, 10.0, 0.5] | [10.0, 0.5] | ||
~~~ | ||
|
||
Suppose also that we have a potential input attributes for the `userFeatures`, i.e. | ||
Suppose also that we have a potential input attributes for the `userFeatures`, i.e. | ||
`["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them. | ||
|
||
~~~ | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind moving these to examples/ and pulling the code snippets into here using the include_example functionality? You can find examples of include_example in this .md file. This makes the examples easier to test & maintain.