Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11497] [MLlib] [Python] PySpark RowMatrix Constructor Has Type Erasure Issue #9458

Conversation

dusenberrymw
Copy link
Contributor

As noted in PR #9441, implementing tallSkinnyQR uncovered a bug with our PySpark RowMatrix constructor. As discussed on the dev list here, there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a RowMatrix from an RDD[Vector] in PythonMLlibAPI, the Vector type is erased, resulting in an RDD[Object]. Thus, when calling Scala's tallSkinnyQR from PySpark, we get a Java ClassCastException in which an Object cannot be cast to a Spark Vector. As noted in the aforementioned dev list thread, this issue was also encountered with DecisionTrees, and the fix involved an explicit retag of the RDD with a Vector type. IndexedRowMatrix and CoordinateMatrix do not appear to have this issue likely due to their related helper functions in PythonMLlibAPI creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the createRowMatrix helper function in PythonMLlibAPI. This PR blocks #9441, so once this is merged, the other can be rebased.

cc @holdenk

@SparkQA
Copy link

SparkQA commented Nov 4, 2015

Test build #44992 has finished for PR 9458 at commit c1258a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dusenberrymw
Copy link
Contributor Author

@mengxr This PR is ready for review and discussion.

@dusenberrymw
Copy link
Contributor Author

Note: The following reproduces this issue on the latest Git head, 1.5.1, and 1.5.0:

from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([[3, -6], [4, -8], [0, 1]])
mat = RowMatrix(rows)
mat._java_matrix_wrapper.call("tallSkinnyQR", True)

Should result in the following exception:

java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector;

@jkbradley
Copy link
Member

Sorry for the long delay! I looked through the discussions, and this LGTM. I'll run tests once before merging it.

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #2207 has finished for PR 9458 at commit c1258a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merged with master, branch-1.6 and branch-1.5
This might not make 1.6.0, but we'll see.

asfgit pushed a commit that referenced this pull request Dec 11, 2015
…rasure Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <[email protected]>

Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b82203)
Signed-off-by: Joseph K. Bradley <[email protected]>
asfgit pushed a commit that referenced this pull request Dec 11, 2015
…rasure Issue

As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.

This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  This PR blocks #9441, so once this is merged, the other can be rebased.

cc holdenk

Author: Mike Dusenberry <[email protected]>

Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.

(cherry picked from commit 1b82203)
Signed-off-by: Joseph K. Bradley <[email protected]>
@asfgit asfgit closed this in 1b82203 Dec 11, 2015
@dusenberrymw
Copy link
Contributor Author

Great, thanks @jkbradley!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants