Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11439][ML] Optimization of creating sparse feature without dense one #9756

Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ import com.github.fommil.netlib.BLAS.{getInstance => blas}

import org.apache.spark.SparkContext
import org.apache.spark.annotation.{DeveloperApi, Since}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.{BLAS, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD

Expand Down Expand Up @@ -131,39 +131,36 @@ object LinearDataGenerator {
eps: Double,
sparsity: Double): Seq[LabeledPoint] = {
require(0.0 <= sparsity && sparsity <= 1.0)
val rnd = new Random(seed)
val x = Array.fill[Array[Double]](nPoints)(
Array.fill[Double](weights.length)(rnd.nextDouble()))

val sparseRnd = new Random(seed)
x.foreach { v =>
var i = 0
val len = v.length
while (i < len) {
if (sparseRnd.nextDouble() < sparsity) {
v(i) = 0.0
} else {
v(i) = (v(i) - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
}
i += 1
}
}

val y = x.map { xi =>
blas.ddot(weights.length, xi, 1, weights, 1) + intercept + eps * rnd.nextGaussian()
}

y.zip(x).map { p =>
if (sparsity == 0.0) {
val rnd = new Random(seed)
val rndG = new Random(seed)
if (sparsity <= 0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparsicy is assumed between 0.0 and 1.0. Can we write if (sparsity == 0.0) here to clarify?

(0 until nPoints).map { _ =>
val features = Vectors.dense((0 until weights.length).map { i =>
(rnd.nextDouble() - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could refactor this into a small def local to the method to avoid repeating it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the array is built for dense and sparse features is different. I suspect it might be more clunky to try and find something common between them to refactor something out into a local def.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is exactly the same though -- I'm talking about local function.

}.toArray)
val label = BLAS.dot(Vectors.dense(weights), features) +
intercept + eps * rndG.nextGaussian()
// Return LabeledPoints with DenseVector
LabeledPoint(p._1, Vectors.dense(p._2))
} else {
LabeledPoint(label, features)
}
} else {
val sparseRnd = new Random(seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a second Random in this block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code uses two RNGs. The sparse RNG is at line 138 in the original file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but why keep it? it looks like an oversight and is not necessary.

(0 until nPoints).map { _ =>
val (values, indices) = (0 until weights.length).filter { _ =>
sparseRnd.nextDouble() >= sparsity }.map { i =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparsity specifis a ratio of non-zero values in sparse vector. So correctly we can write

sparseRnd.nextDouble() <= sparsity

((rnd.nextDouble() - 0.5) * math.sqrt(12.0 * xVariance(i)) + xMean(i), i)
}.unzip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be over-thinking this, but I wonder if it's significantly more efficient to choose the indices as an array, and then map that to values

val features = Vectors.sparse(weights.length, indices.toArray, values.toArray)
val label = BLAS.dot(Vectors.dense(weights), features) +
intercept + eps * rndG.nextGaussian()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to use separately rnd and rndG for gaussian distribution?
I think that if the same test result will be achieved because we use rnd and rndG separetely, it is reasonable.
But in this case, the number of use of rnd in generation of vector is already changed. So it might be unnecessary anymore. Only use of rndis sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on this now. Regenerating results from the tests.

// Return LabeledPoints with SparseVector
LabeledPoint(p._1, Vectors.dense(p._2).toSparse)
LabeledPoint(label, features)
}
}
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the unnecessary empty line.

/**
* Generate an RDD containing sample data for Linear Regression models - including Ridge, Lasso,
* and uregularized variants.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,14 @@ class RegressionEvaluatorSuite extends SparkFunSuite with MLlibTestSparkContext

// default = rmse
val evaluator = new RegressionEvaluator()
assert(evaluator.evaluate(predictions) ~== 0.1019382 absTol 0.001)
assert(evaluator.evaluate(predictions) ~== 0.09849278 absTol 0.001)

// r2 score
evaluator.setMetricName("r2")
assert(evaluator.evaluate(predictions) ~== 0.9998196 absTol 0.001)
assert(evaluator.evaluate(predictions) ~== 0.9998313 absTol 0.001)

// mae
evaluator.setMetricName("mae")
assert(evaluator.evaluate(predictions) ~== 0.08036075 absTol 0.001)
assert(evaluator.evaluate(predictions) ~== 0.07893991 absTol 0.001)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
In `LinearRegressionSuite`, we will make sure that the model trained by SparkML
is the same as the one trained by R's glmnet package. The following instruction
describes how to reproduce the data in R.
In a spark-shell, use the following code:

import org.apache.spark.mllib.util.LinearDataGenerator
val data =
Expand Down Expand Up @@ -591,21 +592,44 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
}

/*
Use the following R code to generate model training results.

predictions <- predict(fit, newx=features)
residuals <- label - predictions
> mean(residuals^2) # MSE
[1] 0.009720325
> mean(abs(residuals)) # MAD
[1] 0.07863206
> cor(predictions, label)^2# r^2
[,1]
s0 0.9998749
# Use the following R code to generate model training results.

# path/part-00000 is the file generated by running LinearDataGenerator.generateLinearInput
# as described before the beforeAll() method.
d1 <- read.csv("path/part-00000", header=FALSE, stringsAsFactors=FALSE)
fit <- glm(V1 ~ V2 + V3, data = d1, family = "gaussian")
f1 <- data.frame(as.numeric(d1$V2), as.numeric(d1$V3))
predictions <- predict(fit, newdata=f1)
l1 <- as.numeric(d1$V1)

residuals <- l1 - predictions
> mean(residuals^2) # MSE
[1] 0.009925853
> mean(abs(residuals)) # MAD
[1] 00.07975305
> cor(predictions, label)^2 # r^2
[1] 0.9998722

> summary(fit)

Call:
glm(formula = V1 ~ V2 + V3, family = "gaussian", data = d1)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.38568 -0.06770 0.00022 0.06724 0.39246

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.2988232 0.0018684 3371 <2e-16 ***
V2 4.7015664 0.0011880 3958 <2e-16 ***
V3 7.1995670 0.0009127 7888 <2e-16 ***
---
....
*/
assert(model.summary.meanSquaredError ~== 0.00972035 relTol 1E-5)
assert(model.summary.meanAbsoluteError ~== 0.07863206 relTol 1E-5)
assert(model.summary.r2 ~== 0.9998749 relTol 1E-5)
assert(model.summary.meanSquaredError ~== 0.009925853 relTol 1E-5)
assert(model.summary.meanAbsoluteError ~== 0.07975305 relTol 1E-5)
assert(model.summary.r2 ~== 0.9998722 relTol 1E-5)

// Normal solver uses "WeightedLeastSquares". This algorithm does not generate
// objective history because it does not run through iterations.
Expand All @@ -620,9 +644,9 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
// To clalify that the normal solver is used here.
assert(model.summary.objectiveHistory.length == 1)
assert(model.summary.objectiveHistory(0) == 0.0)
val devianceResidualsR = Array(-0.35566, 0.34504)
val seCoefR = Array(0.0011756, 0.0009032, 0.0018489)
val tValsR = Array(3998, 7971, 3407)
val devianceResidualsR = Array(-0.38568, 0.39246)
val seCoefR = Array(0.0011880, 0.0009127, 0.0018684)
val tValsR = Array(3958, 7888, 3371)
val pValsR = Array(0, 0, 0)
model.summary.devianceResiduals.zip(devianceResidualsR).foreach { x =>
assert(x._1 ~== x._2 absTol 1E-5) }
Expand Down