Fast approximation of biglasso? #12

privefl · 2017-12-28T10:49:52Z

Find if there is a fast near-optimal rule approximation for computing multivariate linear/logistic regression on biobank-scale datasets in a few hours (or minutes).

kaneplusplus · 2018-01-02T18:40:06Z

You probably want to start with the STRONG rules which eliminates regressors that don't project well onto the response based on the penalty. You can then apply the lasso on the remaining regressors.

If you still have too much data you can trade off estimator accuracy for computational complexity either by reducing the numerical accuracy of the slope coefficients in the standard implementation or by using ADMM. I personally prefer the former is probably faster when the data randomly distributed over concurrent partitions.

privefl · 2018-01-02T19:13:21Z

Thanks for the tips.
I was thinking maybe using the strong rules without checking KKT conditions.

For now, I have not the time to test this, but hopefully I will someday.
Or maybe someone else before me :-)

kaneplusplus · 2018-01-03T15:05:47Z

I need to implement this for a book I'm writing. If you can wait a few weeks then I can provide a reference implementation.

privefl · 2018-01-03T17:24:57Z

Strong rules are implemented in the biglasso package of @YaohuiZeng. I'm also using the code in this package.

I'm looking forward to seeing your implementation.

kaneplusplus · 2018-02-05T21:22:55Z

The crux STRONG is checking the KKT conditions. Below is reference code similar to what I have in the book chapter to do this.

Note that if you were going to optimize for performance, you'd probably want the vector of slope cefficients b to be sparse, and you could easily parallelize the call to apply for the case where you have a lot of active slope coefficients.

Also, note that you'd probably want to change b == 0 to check to see when b is close to zero. I believe the this tolerance is 1e-6 in the glmnet Mortran code.

kkt_violation <- function(X, y, b, lambda, alpha) {
  # Calculate the residuals.
  resids <- y - X %*% b

  # Calculate the projection of each variable onto the residuals.
  s <- apply(X, 2, function(xj) crossprod(xj, resids)) /
    lambda / alpha / nrow(X)

  # Return a vector indicating where the KKT conditions have been violated
  # by the variables that are currently zero.
  ret <- rep(FALSE, length(b))
  ret[(b == 0 & abs(s) >= 1)] <- TRUE
  ret
}

privefl · 2018-02-09T16:33:06Z

Now much faster with #14

privefl mentioned this issue Jan 31, 2018

New CMSA #14

Merged

privefl closed this as completed Feb 9, 2018

privefl added the enhancement label Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast approximation of biglasso? #12

Fast approximation of biglasso? #12

privefl commented Dec 28, 2017

kaneplusplus commented Jan 2, 2018

privefl commented Jan 2, 2018

kaneplusplus commented Jan 3, 2018

privefl commented Jan 3, 2018 •

edited

Loading

kaneplusplus commented Feb 5, 2018

privefl commented Feb 9, 2018

Fast approximation of biglasso? #12

Fast approximation of biglasso? #12

Comments

privefl commented Dec 28, 2017

kaneplusplus commented Jan 2, 2018

privefl commented Jan 2, 2018

kaneplusplus commented Jan 3, 2018

privefl commented Jan 3, 2018 • edited Loading

kaneplusplus commented Feb 5, 2018

privefl commented Feb 9, 2018

privefl commented Jan 3, 2018 •

edited

Loading