Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11439][ML] Optimization of creating sparse feature without dense one #9756

Conversation

nakul02
Copy link
Member

@nakul02 nakul02 commented Nov 17, 2015

Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more.

@SparkQA
Copy link

SparkQA commented Nov 17, 2015

Test build #46052 has finished for PR 9756 at commit b5038e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nakul02
Copy link
Member Author

nakul02 commented Nov 23, 2015

@Lewuathe, could you please take a look at this?

if (sparsity == 0.0) {
val rnd = new Random(seed)
val rndG = new Random(seed)
if (sparsity <= 0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparsicy is assumed between 0.0 and 1.0. Can we write if (sparsity == 0.0) here to clarify?

@nakul02
Copy link
Member Author

nakul02 commented Nov 24, 2015

@Lewuathe - pushed fixes.

}.unzip
val features = Vectors.sparse(weights.length, indices.toArray, values.toArray)
val label = BLAS.dot(Vectors.dense(weights), features) +
intercept + eps * rndG.nextGaussian()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to use separately rnd and rndG for gaussian distribution?
I think that if the same test result will be achieved because we use rnd and rndG separetely, it is reasonable.
But in this case, the number of use of rnd in generation of vector is already changed. So it might be unnecessary anymore. Only use of rndis sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on this now. Regenerating results from the tests.

@Lewuathe
Copy link
Contributor

@nakul02 Thank you so much for quick updating!

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46573 has finished for PR 9756 at commit 70917f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nakul02
Copy link
Member Author

nakul02 commented Nov 24, 2015

@Lewuathe - sorry, it took a while to regenerate numbers for all the tests

@SparkQA
Copy link

SparkQA commented Nov 24, 2015

Test build #46578 has finished for PR 9756 at commit aa8fef1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

as.numeric.data.V3. 5.712356)
(Intercept) 5.260103
as.numeric.d1.V2. 3.725522
as.numeric.d1.V3. 5.711203
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove unnecessary white spaces?

@nakul02
Copy link
Member Author

nakul02 commented Nov 24, 2015

@Lewuathe - removed the extraneous whitespace.

@nakul02
Copy link
Member Author

nakul02 commented Nov 24, 2015

AmplabJenkins failed with the following error:
ERROR: Timeout after 15 minutes
ERROR: Error fetching remote repo 'origin'

I assume this is a temporary issue.

@nakul02
Copy link
Member Author

nakul02 commented Nov 25, 2015

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Nov 25, 2015

Test build #46700 has finished for PR 9756 at commit 327babd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 1, 2015

Test build #46980 has finished for PR 9756 at commit 7c17454.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nakul02
Copy link
Member Author

nakul02 commented Dec 1, 2015

@Lewuathe - Can you take a look at this?

@Lewuathe
Copy link
Contributor

Lewuathe commented Dec 2, 2015

@nakul02 LGTM. It seems that retesting on Jenkins requires some admin privilege.

@nakul02
Copy link
Member Author

nakul02 commented Dec 3, 2015

Thanks @Lewuathe!
Retesting does require committer privileges. I tried and now I know

@holdenk
Copy link
Contributor

holdenk commented Dec 3, 2015

LGTM pending tests - maybe @mengxr or @srowen who are two of the more recent committers working in this file could take a look.

@holdenk
Copy link
Contributor

holdenk commented Dec 3, 2015

Also like +1 on having more of the R code in the tests comments so its easier to regenerate the next time we need to.

@SparkQA
Copy link

SparkQA commented Dec 3, 2015

Test build #47113 has finished for PR 9756 at commit 89d84b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

LabeledPoint(label, features)
}
} else {
val sparseRnd = new Random(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a second Random in this block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code uses two RNGs. The sparse RNG is at line 138 in the original file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why keep it? it looks like an oversight and is not necessary.

@nakul02
Copy link
Member Author

nakul02 commented Dec 3, 2015

@srowen - I've refactored generateLinearInput.
I am not entirely sure what to set the thresholds at.
The correctness of the test depends on the way in which the RNG is called. Which is why I needed to re-evaulate all the results in R and stick in the numbers.
I do not have a background in ML and don't know how much of a threshold to set before the results are considered incorrect.

@srowen
Copy link
Member

srowen commented Dec 3, 2015

These tests haven't set thresholds in a principled way, and clearly they're wrong in some cases. It's easier to see when they're too tight. The goal is detect obviously wrong behavior, so erring on the side of too wide is OK. It's not a huge deal, but I think it might be worth correcting them where we go. Make them 10x bigger where they've failed.

assert(model.summary.r2 ~== 0.9998749 relTol 1E-5)
assert(model.summary.meanSquaredError ~== 0.00985449 relTol 1E-5)
assert(model.summary.meanAbsoluteError ~== 0.07961668 relTol 1E-5)
assert(model.summary.r2 ~== 0.9998737 relTol 1E-5)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the threshold by 10x won't work here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, "or more" -- I'm just saying nobody expects an exact tolerance based on some principled analysis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen - all of the tests in these 2 files (LinearRegressionSuite & RegressionEvaluatorSuite) seed the Random number generator with a fixed number (42 in this case).
The Random number generator then, for a given platform, should always create the same pseudo random sequence. When this is true, the tests will pass with the set thresholds (or maybe lower).
As luck would have it, tests haven't failed with either JDK7 or 8 on Linux or Mac (or so I understand).

For a "principled" analysis, for all possible pseudo-random sequences of a given size (10000 for one of the test cases), that are possible, one would calculate the result and the threshold. The lowest threshold allowed would then be set into the test. This is obviously a lot of work and could be set aside as a separate JIRA if someone really wants it done this way.

As to increasing it 10x, IMHO - there is no more discipline (or reason) in doing this than there was in setting it to the values that are present.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get all that. I'm not suggesting trying a bunch of seeds though any data so generated should produce the same answer within some tolerance. Same goes for your new generation process. The fact that the test then fails means your data generation process is wrong or the test is. So, something has to be done right?

You did, but your change suggests that the 'expected value' of the data changed. It is not clear we should believe that. Hence fix the threshold and yes 10x isn't any more principled but has the advantage of being not incorrect in that it is too loose if anything.

Really the current change is only very slightly suboptimal and just pushes the tiny problem to a future change. Maybe it is worth punting on, even though making the test righter here seems easy.

@nakul02
Copy link
Member Author

nakul02 commented Dec 4, 2015

@srowen - I've increased the threshold values 10x times.

@srowen
Copy link
Member

srowen commented Dec 4, 2015

OK in the end I think it's close enough even if not 100% what I had in mind. But the right thing to do is a bit subjective here. Any other thoughts? let me retest it.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #2172 has finished for PR 9756 at commit 60b0092.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2015

Test build #47203 has finished for PR 9756 at commit 60b0092.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nakul02
Copy link
Member Author

nakul02 commented Dec 4, 2015

@srowen - I agree, the right thing to do is a bit subjective. Nothing else to add though.

@srowen
Copy link
Member

srowen commented Dec 7, 2015

I'm going to merge this if there are no other comments.

@srowen
Copy link
Member

srowen commented Dec 8, 2015

Merged to master

@asfgit asfgit closed this in 037b7e7 Dec 8, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants