Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast approximation of biglasso? #12

Closed
privefl opened this issue Dec 28, 2017 · 6 comments
Closed

Fast approximation of biglasso? #12

privefl opened this issue Dec 28, 2017 · 6 comments

Comments

@privefl
Copy link
Owner

privefl commented Dec 28, 2017

Find if there is a fast near-optimal rule approximation for computing multivariate linear/logistic regression on biobank-scale datasets in a few hours (or minutes).

@kaneplusplus
Copy link

You probably want to start with the STRONG rules which eliminates regressors that don't project well onto the response based on the penalty. You can then apply the lasso on the remaining regressors.

If you still have too much data you can trade off estimator accuracy for computational complexity either by reducing the numerical accuracy of the slope coefficients in the standard implementation or by using ADMM. I personally prefer the former is probably faster when the data randomly distributed over concurrent partitions.

@privefl
Copy link
Owner Author

privefl commented Jan 2, 2018

Thanks for the tips.
I was thinking maybe using the strong rules without checking KKT conditions.

For now, I have not the time to test this, but hopefully I will someday.
Or maybe someone else before me :-)

@kaneplusplus
Copy link

I need to implement this for a book I'm writing. If you can wait a few weeks then I can provide a reference implementation.

@privefl
Copy link
Owner Author

privefl commented Jan 3, 2018

Strong rules are implemented in the biglasso package of @YaohuiZeng. I'm also using the code in this package.

I'm looking forward to seeing your implementation.

@privefl privefl mentioned this issue Jan 31, 2018
@kaneplusplus
Copy link

The crux STRONG is checking the KKT conditions. Below is reference code similar to what I have in the book chapter to do this.

Note that if you were going to optimize for performance, you'd probably want the vector of slope cefficients b to be sparse, and you could easily parallelize the call to apply for the case where you have a lot of active slope coefficients.

Also, note that you'd probably want to change b == 0 to check to see when b is close to zero. I believe the this tolerance is 1e-6 in the glmnet Mortran code.

kkt_violation <- function(X, y, b, lambda, alpha) {
  # Calculate the residuals.
  resids <- y - X %*% b

  # Calculate the projection of each variable onto the residuals.
  s <- apply(X, 2, function(xj) crossprod(xj, resids)) /
    lambda / alpha / nrow(X)

  # Return a vector indicating where the KKT conditions have been violated
  # by the variables that are currently zero.
  ret <- rep(FALSE, length(b))
  ret[(b == 0 & abs(s) >= 1)] <- TRUE
  ret
}

@privefl
Copy link
Owner Author

privefl commented Feb 9, 2018

Now much faster with #14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants