[SPARK-11439][ML] Optimization of creating sparse feature without dense one #9756

nakul02 · 2015-11-17T03:57:36Z

Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more.

SparkQA · 2015-11-17T04:57:00Z

Test build #46052 has finished for PR 9756 at commit b5038e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nakul02 · 2015-11-23T19:13:59Z

@Lewuathe, could you please take a look at this?

Lewuathe · 2015-11-24T01:21:35Z

mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala

-      if (sparsity == 0.0) {
+    val rnd = new Random(seed)
+    val rndG = new Random(seed)
+    if (sparsity <= 0.0) {


sparsicy is assumed between 0.0 and 1.0. Can we write if (sparsity == 0.0) here to clarify?

nakul02 · 2015-11-24T01:28:54Z

@Lewuathe - pushed fixes.

Lewuathe · 2015-11-24T01:31:31Z

mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala

+        }.unzip
+        val features = Vectors.sparse(weights.length, indices.toArray, values.toArray)
+        val label = BLAS.dot(Vectors.dense(weights), features) +
+          intercept + eps * rndG.nextGaussian()


Is there any reason to use separately rnd and rndG for gaussian distribution?
I think that if the same test result will be achieved because we use rnd and rndG separetely, it is reasonable.
But in this case, the number of use of rnd in generation of vector is already changed. So it might be unnecessary anymore. Only use of rndis sufficient.

I am working on this now. Regenerating results from the tests.

Lewuathe · 2015-11-24T01:46:44Z

@nakul02 Thank you so much for quick updating!

SparkQA · 2015-11-24T02:26:50Z

Test build #46573 has finished for PR 9756 at commit 70917f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nakul02 · 2015-11-24T02:31:45Z

@Lewuathe - sorry, it took a while to regenerate numbers for all the tests

SparkQA · 2015-11-24T03:22:12Z

Test build #46578 has finished for PR 9756 at commit aa8fef1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Lewuathe · 2015-11-24T09:13:42Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

-         as.numeric.data.V3. 5.712356)
+            (Intercept)       5.260103
+            as.numeric.d1.V2. 3.725522
+            as.numeric.d1.V3. 5.711203


Could you remove unnecessary white spaces?

nakul02 · 2015-11-24T17:55:01Z

@Lewuathe - removed the extraneous whitespace.

nakul02 · 2015-11-24T18:06:04Z

AmplabJenkins failed with the following error:
ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'

I assume this is a temporary issue.

nakul02 · 2015-11-25T02:02:10Z

jenkins retest this please

SparkQA · 2015-11-25T20:25:29Z

Test build #46700 has finished for PR 9756 at commit 327babd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…feature

SparkQA · 2015-12-01T21:05:18Z

Test build #46980 has finished for PR 9756 at commit 7c17454.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nakul02 · 2015-12-01T21:35:00Z

@Lewuathe - Can you take a look at this?

Lewuathe · 2015-12-02T02:23:59Z

@nakul02 LGTM. It seems that retesting on Jenkins requires some admin privilege.

nakul02 · 2015-12-03T00:45:58Z

Thanks @Lewuathe!
Retesting does require committer privileges. I tried and now I know

…feature

holdenk · 2015-12-03T03:59:31Z

LGTM pending tests - maybe @mengxr or @srowen who are two of the more recent committers working in this file could take a look.

holdenk · 2015-12-03T04:01:29Z

Also like +1 on having more of the R code in the tests comments so its easier to regenerate the next time we need to.

SparkQA · 2015-12-03T04:51:50Z

Test build #47113 has finished for PR 9756 at commit 89d84b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-12-03T08:50:49Z

mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala

+        LabeledPoint(label, features)
+      }
+    } else {
+      val sparseRnd = new Random(seed)


Why a second Random in this block?

The original code uses two RNGs. The sparse RNG is at line 138 in the original file.

Yes, but why keep it? it looks like an oversight and is not necessary.

…feature

nakul02 · 2015-12-03T20:00:30Z

@srowen - I've refactored generateLinearInput.
I am not entirely sure what to set the thresholds at.
The correctness of the test depends on the way in which the RNG is called. Which is why I needed to re-evaulate all the results in R and stick in the numbers.
I do not have a background in ML and don't know how much of a threshold to set before the results are considered incorrect.

srowen · 2015-12-03T20:05:50Z

These tests haven't set thresholds in a principled way, and clearly they're wrong in some cases. It's easier to see when they're too tight. The goal is detect obviously wrong behavior, so erring on the side of too wide is OK. It's not a huge deal, but I think it might be worth correcting them where we go. Make them 10x bigger where they've failed.

nakul02 · 2015-12-03T20:12:06Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

-      assert(model.summary.r2 ~== 0.9998749 relTol 1E-5)
+      assert(model.summary.meanSquaredError ~== 0.00985449 relTol 1E-5)
+      assert(model.summary.meanAbsoluteError ~== 0.07961668 relTol 1E-5)
+      assert(model.summary.r2 ~== 0.9998737 relTol 1E-5)


Increasing the threshold by 10x won't work here.

Yes, "or more" -- I'm just saying nobody expects an exact tolerance based on some principled analysis.

@srowen - all of the tests in these 2 files (LinearRegressionSuite & RegressionEvaluatorSuite) seed the Random number generator with a fixed number (42 in this case).
The Random number generator then, for a given platform, should always create the same pseudo random sequence. When this is true, the tests will pass with the set thresholds (or maybe lower).
As luck would have it, tests haven't failed with either JDK7 or 8 on Linux or Mac (or so I understand).

For a "principled" analysis, for all possible pseudo-random sequences of a given size (10000 for one of the test cases), that are possible, one would calculate the result and the threshold. The lowest threshold allowed would then be set into the test. This is obviously a lot of work and could be set aside as a separate JIRA if someone really wants it done this way.

As to increasing it 10x, IMHO - there is no more discipline (or reason) in doing this than there was in setting it to the values that are present.

Yes, I get all that. I'm not suggesting trying a bunch of seeds though any data so generated should produce the same answer within some tolerance. Same goes for your new generation process. The fact that the test then fails means your data generation process is wrong or the test is. So, something has to be done right?

You did, but your change suggests that the 'expected value' of the data changed. It is not clear we should believe that. Hence fix the threshold and yes 10x isn't any more principled but has the advantage of being not incorrect in that it is too loose if anything.

Really the current change is only very slightly suboptimal and just pushes the tiny problem to a future change. Maybe it is worth punting on, even though making the test righter here seems easy.

nakul02 · 2015-12-04T18:39:42Z

@srowen - I've increased the threshold values 10x times.

srowen · 2015-12-04T18:42:55Z

OK in the end I think it's close enough even if not 100% what I had in mind. But the right thing to do is a bit subjective here. Any other thoughts? let me retest it.

SparkQA · 2015-12-04T19:27:45Z

Test build #2172 has finished for PR 9756 at commit 60b0092.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-04T19:36:29Z

Test build #47203 has finished for PR 9756 at commit 60b0092.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nakul02 · 2015-12-04T20:07:45Z

@srowen - I agree, the right thing to do is a bit subjective. Nothing else to add though.

srowen · 2015-12-07T09:39:30Z

I'm going to merge this if there are no other comments.

srowen · 2015-12-08T11:08:58Z

Merged to master

Efficient functional implemenation of generateLinearInput

b5038e8

Lewuathe reviewed Nov 24, 2015
View reviewed changes

Bug fix in generation of sparse vector.

70917f8

Lewuathe reviewed Nov 24, 2015
View reviewed changes

Removed use of separate Random num generator for nextGaussian.

aa8fef1

Lewuathe reviewed Nov 24, 2015
View reviewed changes

Nakul Jindal added 4 commits November 24, 2015 09:44

Removed extranious whitespace.

97c812f

Another attempt at removing extranious space.

b1809a5

Removed extra line in LinearDataGenerator

841066b

Added newline to the end of LinearDataGenerator.

a195841

Changed (0 to a.length) to a.indices

327babd

Nakul Jindal added 3 commits November 25, 2015 14:18

Merge branch 'master' into SPARK-11439_sparse_without_creating_dense_…

7f5d016

…feature

Merge branch 'master' into SPARK-11439_sparse_without_creating_dense_…

3cda336

…feature

Merge branch 'master' into SPARK-11439_sparse_without_creating_dense_…

7c17454

…feature

Merge branch 'master' into SPARK-11439_sparse_without_creating_dense_…

89d84b8

…feature

srowen reviewed Dec 3, 2015
View reviewed changes

Nakul Jindal added 2 commits December 3, 2015 11:11

Merge branch 'master' into SPARK-11439_sparse_without_creating_dense_…

02e11b7

…feature

Refactored code in generateLinearInput per comments in PR.

97cc2bf

nakul02 reviewed Dec 3, 2015
View reviewed changes

Increased threshold to 10x of previous values.

60b0092

asfgit closed this in 037b7e7 Dec 8, 2015

[SPARK-11439][ML] Optimization of creating sparse feature without dense one #9756

[SPARK-11439][ML] Optimization of creating sparse feature without dense one #9756

Conversation

nakul02 commented Nov 17, 2015

SparkQA commented Nov 17, 2015

nakul02 commented Nov 23, 2015

Choose a reason for hiding this comment

nakul02 commented Nov 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lewuathe commented Nov 24, 2015

SparkQA commented Nov 24, 2015

nakul02 commented Nov 24, 2015

SparkQA commented Nov 24, 2015

Choose a reason for hiding this comment

nakul02 commented Nov 24, 2015

nakul02 commented Nov 24, 2015

nakul02 commented Nov 25, 2015

SparkQA commented Nov 25, 2015

SparkQA commented Dec 1, 2015

nakul02 commented Dec 1, 2015

Lewuathe commented Dec 2, 2015

nakul02 commented Dec 3, 2015

holdenk commented Dec 3, 2015

holdenk commented Dec 3, 2015

SparkQA commented Dec 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nakul02 commented Dec 3, 2015

srowen commented Dec 3, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nakul02 commented Dec 4, 2015

srowen commented Dec 4, 2015

SparkQA commented Dec 4, 2015

SparkQA commented Dec 4, 2015

nakul02 commented Dec 4, 2015

srowen commented Dec 7, 2015

srowen commented Dec 8, 2015