-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6705][MLLIB] Add fit intercept api to ml logisticregression #5301
Conversation
I have the fit intercept enabled by default for logistic regression, I wonder what others think here. I understand that it enables allocation by default which is undesirable, but one needs to have a very strong reason for not having an intercept term enabled so it is the safer default from a statistical sense. Explicitly modeling the intercept by adding a column of all 1s does not work. I believe the reason is that since the API for LogisticRegressionWithLBFGS forces column normalization, and a column of all 1s has 0 variance so dividing by 0 kills it.
Can one of the admins verify this patch? |
@oefirouz Thanks for the PR! I agree we need to support this option, and that it should be set to True by default. Can you please make a JIRA and put it in the PR title? "[SPARK-####] [MLLIB] Add fit..." I believe you're correct about normalization causing problems for an all-ones column; the other issue is that LR needs to know not to regularize the intercept term. I'll look at the code now |
* param for fitting the intercept term | ||
* @group param | ||
*/ | ||
val fitIntercept: BooleanParam = new BooleanParam(this, "fitIntercept", "fits the intercept term or not") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fits the intercept term or not" --> "Indicates whether to fit an intercept term"
It looks fine to me other than that one comment. However, could you please add a unit test (or modify an existing one) to test this feature in org/apache/spark/ml/classification/LogisticRegressionSuite.scala? Thanks! |
Added unit tests and changed docs in line with PR comments
@jkbradley Thanks for your comments! I've updated the PR. |
* param for fitting the intercept term | ||
* @group param | ||
*/ | ||
val fitIntercept: BooleanParam = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about this, can you please make this default to true? (Add "Some(true)" as a 4th argument, plus update doc.) Thanks!
@oefirouz Just a few more comments, and then that should be it. Thanks! |
Made the trait default true Changed float comparisons to === in unit tests
Forgot to update doc
ok to test |
Test FAILed. |
Whoops, add this to the logisticRegression and not the optimizer
Test PASSed. |
Test build #636 has started for PR 5301 at commit |
LGTM once the full tests pass |
Test build #636 has finished for PR 5301 at commit
|
One comment before we merge this: Is it useful to specify the bias constant? In LIBLINEAR, there is an option called |
I'm going to say no; that sounds like an extra option we could add later. More details on that API:
Later on, I could imagine us adding the option to regularize the intercept. |
I'll go ahead and merge this with master. Thanks @oefirouz ! |
I have the fit intercept enabled by default for logistic regression, I
wonder what others think here. I understand that it enables allocation
by default which is undesirable, but one needs to have a very strong
reason for not having an intercept term enabled so it is the safer
default from a statistical sense.
Explicitly modeling the intercept by adding a column of all 1s does not
work. I believe the reason is that since the API for
LogisticRegressionWithLBFGS forces column normalization, and a column of all
1s has 0 variance so dividing by 0 kills it.