-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11439][ML] Optimization of creating sparse feature without dense one #9756
Changes from 1 commit
b5038e8
70917f8
aa8fef1
97c812f
b1809a5
841066b
a195841
327babd
7f5d016
3cda336
7c17454
89d84b8
02e11b7
97cc2bf
60b0092
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,7 +24,7 @@ import com.github.fommil.netlib.BLAS.{getInstance => blas} | |
|
||
import org.apache.spark.SparkContext | ||
import org.apache.spark.annotation.{DeveloperApi, Since} | ||
import org.apache.spark.mllib.linalg.Vectors | ||
import org.apache.spark.mllib.linalg.{BLAS, Vectors} | ||
import org.apache.spark.mllib.regression.LabeledPoint | ||
import org.apache.spark.rdd.RDD | ||
|
||
|
@@ -131,39 +131,36 @@ object LinearDataGenerator { | |
eps: Double, | ||
sparsity: Double): Seq[LabeledPoint] = { | ||
require(0.0 <= sparsity && sparsity <= 1.0) | ||
val rnd = new Random(seed) | ||
val x = Array.fill[Array[Double]](nPoints)( | ||
Array.fill[Double](weights.length)(rnd.nextDouble())) | ||
|
||
val sparseRnd = new Random(seed) | ||
x.foreach { v => | ||
var i = 0 | ||
val len = v.length | ||
while (i < len) { | ||
if (sparseRnd.nextDouble() < sparsity) { | ||
v(i) = 0.0 | ||
} else { | ||
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i) | ||
} | ||
i += 1 | ||
} | ||
} | ||
|
||
val y = x.map { xi => | ||
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian() | ||
} | ||
|
||
y.zip(x).map { p => | ||
if (sparsity == 0.0) { | ||
val rnd = new Random(seed) | ||
val rndG = new Random(seed) | ||
if (sparsity <= 0.0) { | ||
(0 until nPoints).map { _ => | ||
val features = Vectors.dense((0 until weights.length).map { i => | ||
(rnd.nextDouble() - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe you could refactor this into a small There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way the array is built for dense and sparse features is different. I suspect it might be more clunky to try and find something common between them to refactor something out into a local def. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line is exactly the same though -- I'm talking about local function. |
||
}.toArray) | ||
val label = BLAS.dot(Vectors.dense(weights), features) + | ||
intercept + eps * rndG.nextGaussian() | ||
// Return LabeledPoints with DenseVector | ||
LabeledPoint(p._1, Vectors.dense(p._2)) | ||
} else { | ||
LabeledPoint(label, features) | ||
} | ||
} else { | ||
val sparseRnd = new Random(seed) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why a second There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original code uses two RNGs. The sparse RNG is at line 138 in the original file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but why keep it? it looks like an oversight and is not necessary. |
||
(0 until nPoints).map { _ => | ||
val (values, indices) = (0 until weights.length).filter { _ => | ||
sparseRnd.nextDouble() >= sparsity }.map { i => | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
((rnd.nextDouble() - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i), i) | ||
}.unzip | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be over-thinking this, but I wonder if it's significantly more efficient to choose the indices as an array, and then map that to values |
||
val features = Vectors.sparse(weights.length, indices.toArray, values.toArray) | ||
val label = BLAS.dot(Vectors.dense(weights), features) + | ||
intercept + eps * rndG.nextGaussian() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any reason to use separately There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am working on this now. Regenerating results from the tests. |
||
// Return LabeledPoints with SparseVector | ||
LabeledPoint(p._1, Vectors.dense(p._2).toSparse) | ||
LabeledPoint(label, features) | ||
} | ||
} | ||
} | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please remove the unnecessary empty line. |
||
/** | ||
* Generate an RDD containing sample data for Linear Regression models - including Ridge, Lasso, | ||
* and uregularized variants. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sparsicy
is assumed between 0.0 and 1.0. Can we writeif (sparsity == 0.0)
here to clarify?