Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net #16139

Closed
wants to merge 4 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Dec 5, 2016

What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.

How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69661 has finished for PR 16139 at commit c39aa6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Unlike the original dataset which can only be stored in a distributed system,
these statistics can be loaded into memory on a single machine if the number of features is relatively small, and then we can solve the objective function through Cholesky factorization on the driver.
This objective function has an analytic solution and it requires only one pass over the data to collect necessary statistics to solve. For an
$n \times m$ data matrix, these statistics require only $O(m^2)$ storage and so can be stored on a single machine when $n$ (the number of features) is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$n$ (the number of features) -> $m$ (the number of features)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thanks!

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69671 has finished for PR 16139 at commit 1049a6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Dec 6, 2016

ping @yanboliang

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah Thanks for working on this, I left some comments.

@@ -59,17 +59,22 @@ Given $n$ weighted observations $(w_i, a_i, b_i)$:

The number of features for each observation is $m$. We use the following weighted least squares formulation:
`\[
minimize_{x}\frac{1}{2} \sum_{i=1}^n \frac{w_i(a_i^T x -b_i)^2}{\sum_{k=1}^n w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
\min_{\mathbf{x}}\frac{1}{2} \sum_{i=1}^n \frac{w_i(\mathbf{a}_i^T \mathbf{x} -b_i)^2}{\sum_{k=1}^n w_k} + \frac{1}{2}\frac{\lambda}{\delta}\sum_{j=1}^m(\sigma_{j} x_{j})^2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This formulation is out of date, we should update it to include L1/elasticNet regularization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Done.


WeightedLeastSquares only supports L2 regularization and provides options to enable or disable regularization and standardization.
In order to make the normal equation approach efficient, WeightedLeastSquares requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
Spark ML currently supports two types of solvers for the normal equations: Cholesky factorization and Quasi-Newton methods (L-BFGS/OWL-QN). Cholesky factorization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML -> MLlib, "Spark ML" is not an official name of the component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Spark ML currently supports two types of solvers for the normal equations: Cholesky factorization and Quasi-Newton methods (L-BFGS/OWL-QN). Cholesky factorization
depends on a positive definite covariance matrix (e.g. columns of the data matrix must be linearly independent) and will fail if this condition is violated. Quasi-Newton methods
are still capable of providing a reasonable solution even when the covariance matrix is not positive definite, so the normal equation solver can also fall back to
Quasi-Newton methods in this case. This fallback is currently always enabled for the `LinearRegression` estimator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and GeneralizedLinearRegression estimator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

are still capable of providing a reasonable solution even when the covariance matrix is not positive definite, so the normal equation solver can also fall back to
Quasi-Newton methods in this case. This fallback is currently always enabled for the `LinearRegression` estimator.

`WeightedLeastSquares` supports L1, L2, and elastic-net regularization and provides options to enable or disable regularization and standardization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding following clarification should be more clear?

  • For L2 or no regularization, Cholesky solver is the default choice and will fall back to Quasi-Newton solver if the covariance matrix is not positive definite.
  • For L1/elasticNet regularization, Quasi-Newton solver is the default and only choice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note about it. It's a bit unclear to me who the audience is here. Since WLS is private, this seems more informational than anything. So I just mentioned that L1 has no analytical solution and requires QN solver. Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current note suffices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audience here is developers/contributors, and the current change is perfectly OK.

@SparkQA
Copy link

SparkQA commented Dec 7, 2016

Test build #69766 has finished for PR 16139 at commit 2ab9675.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the one nit.

WeightedLeastSquares only supports L2 regularization and provides options to enable or disable regularization and standardization.
In order to make the normal equation approach efficient, WeightedLeastSquares requires that the number of features be no more than 4096. For larger problems, use L-BFGS instead.
Spark MLlib currently supports two types of solvers for the normal equations: Cholesky factorization and Quasi-Newton methods (L-BFGS/OWL-QN). Cholesky factorization
depends on a positive definite covariance matrix (e.g. columns of the data matrix must be linearly independent) and will fail if this condition is violated. Quasi-Newton methods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"e.g." -> "i.e."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@SparkQA
Copy link

SparkQA commented Dec 8, 2016

Test build #69838 has finished for PR 16139 at commit 4931133.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM, merged into master and branch-2.1. Thanks for all.

asfgit pushed a commit that referenced this pull request Dec 8, 2016
…or L1 and elastic-net

## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah <[email protected]>

Closes #16139 from sethah/SPARK-18705.

(cherry picked from commit 8225361)
Signed-off-by: Yanbo Liang <[email protected]>
@asfgit asfgit closed this in 8225361 Dec 8, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
…or L1 and elastic-net

## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah <[email protected]>

Closes apache#16139 from sethah/SPARK-18705.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…or L1 and elastic-net

## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah <[email protected]>

Closes apache#16139 from sethah/SPARK-18705.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants