Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440

Closed

Conversation

erikerlandson
Copy link
Contributor

What changes were proposed in this pull request?

Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.

https://issues.apache.org/jira/browse/SPARK-15699

http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/

How was this patch tested?

I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.

@erikerlandson
Copy link
Contributor Author

This is a re-submission of #13438 to fix target branch

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59740 has finished for PR 13440 at commit 04c1316.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59745 has finished for PR 13440 at commit 1136518.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 1, 2016

Test build #59751 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2016

Test build #64309 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth.

@erikerlandson
Copy link
Contributor Author

@holdenk yes, I'll rebase it this week.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66679 has finished for PR 13440 at commit b199ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 11, 2016

Test build #66756 has finished for PR 13440 at commit 83f5e83.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@erikerlandson
Copy link
Contributor Author

test this please

@SparkQA
Copy link

SparkQA commented Oct 12, 2016

Test build #66766 has finished for PR 13440 at commit 83f5e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@erikerlandson
Copy link
Contributor Author

@holdenk @jkbradley looks like it's clean again

@wangmiao1981
Copy link
Contributor

@erikerlandson Are you still working on this PR? Thanks! Miao

@erikerlandson
Copy link
Contributor Author

Hi @wangmiao1981,

I am still interested in this, but I don't have any sense about whether upstream has any interest. Does upstream have any intention to accept it?

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #73006 has started for PR 13440 at commit 61cbf7c.

@shaneknapp
Copy link
Contributor

i stopped the build as i need to restart jenkins... i'll retrigger this when we're back up and running.

@wangmiao1981
Copy link
Contributor

@erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it.

@shaneknapp
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #73008 has finished for PR 13440 at commit 61cbf7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangmiao1981
Copy link
Contributor

@thunterdb Can you take a look? Thanks!

*/
@Since("2.0.0")
@DeveloperApi
def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala: do not add a default implementation, it causes issues with java compatibility

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot guarantee compatibility with existing code here, since you would break the bytecode either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, there were other impurity measures using this 'unsupported' idiom. However, I'm fine with making it abstract, if that's the direction you are thinking.

*/
@Since("2.0.0")
@DeveloperApi
def isTestStatistic: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala: do not add a default implementation, it causes issues with java compatibility

* Utility functions for Impurity measures
*/
@Since("2.0.0")
@DeveloperApi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no need for this object to be publicly exposed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility." The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better.

.setMinInfoGain(0.01)
val treeModel = dt.fit(train)

// The tree should use exactly one of the 3 features: featue(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Since("2.0.0")
@Experimental
object ChiSquared extends Impurity {
private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a private val?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it was to "allocate on the stack"

@erikerlandson
Copy link
Contributor Author

Ready for review

@willb
Copy link
Contributor

willb commented Jul 21, 2017

@thunterdb can you take a look at this now that 2.2 is out?

@erikerlandson
Copy link
Contributor Author

rebased to latest head of master

@SparkQA
Copy link

SparkQA commented Aug 1, 2017

Test build #80093 has finished for PR 13440 at commit bb2f660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@willb
Copy link
Contributor

willb commented Aug 23, 2017

@HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time!

@HyukjinKwon
Copy link
Member

I don't have ML knowledge enough to review this. I can cc ML committer guys who I can guess have some expertise from git blame but I hope there are some sign-offs left here from some guys here ahead.

@srowen
Copy link
Member

srowen commented Aug 24, 2017

I've not seen chi squared used as a split statistic; when is it theoretically better than entropy? Or something a bit more fundamental like KL divergence? It makes some sense but does require some assumption about the data

@erikerlandson
Copy link
Contributor Author

@srowen I discuss some of these questions in the blog post, but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary." Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test. Most such tests also intrinsically take into account decreasing population sizes. As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold.

On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features.

@felixcheung
Copy link
Member

@srowen @thunterdb any more thoughts on this?
how about @sethah @yanboliang @jkbradley?

@willb
Copy link
Contributor

willb commented Oct 18, 2017

I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this?

@jsigee87
Copy link

jsigee87 commented Apr 8, 2018

Is this still being considered?

@SparkQA
Copy link

SparkQA commented Aug 2, 2018

Test build #94042 has finished for PR 13440 at commit bb2f660.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95018 has finished for PR 13440 at commit bb2f660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@willb
Copy link
Contributor

willb commented Sep 17, 2018

@srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now.

// a larger-is-better gain value for the minimum-gain threshold
val minGain =
if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain)
else metadata.minInfoGain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if isTestStatistic can go away with the right API, but I am not familiar with the details of what that requires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better." I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods.

* :: Experimental ::
* Class for calculating Chi Squared as a split quality metric during binary classification.
*/
@Since("2.2.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be 2.5.0 for the moment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update those. 3.0 might be a good target, especially if I can't do this without new isTestStatistic

* Get this impurity instance.
* This is useful for passing impurity parameters to a Strategy in Java.
*/
@Since("1.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier.

*/
@Since("2.0.0")
@DeveloperApi
def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private to spark?

*/
@Since("2.2.0")
@DeveloperApi
def isTestStatistic: Boolean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be customized or extended externally to spark? I'm wondering why it is public.

*/
@Since("2.2.0")
@DeveloperApi
def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the generalization is closer to my newer signature
val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen @willb
I cached the design of the metrics back in. In general, Impurity already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation.

My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using only the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children.

Another kink in the current design is that ImpurityCalculator is essentially parallel to Impurity, and in fact ImpurityCalculator#calculate() is how impurity measures are currently requested. Impurity seems somewhat redundant, and might be factored out in favor of ImpurityCalculator. The current signature calculate() might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}.

Calls to calculate() are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks.

That kind of change would also be API breaking and so a good target for 3.0

@srowen
Copy link
Member

srowen commented Sep 22, 2018

Yeah I take your point that the trait Impurity already defines two methods, only one of which is implemented for each of the subclasses. It's already a funky design that probably should have been generalized differently. I think a rewrite for Spark 3 would be worthwhile, personally. I'm also not quite sure of the difference between the Impurity and ImpurityCalculator class; it seems like Impurity should fold into ImpurityCalculator.

Is the single method we really want to define something like computeInformationGain(ImpurityCalculator, ImpurityCalculator)? even the new method you've added is not directly computing info gain, nor were the existing ones in Impurity. But that's the thing we need and abstraction for over several implementations, it seems.

Well, I think either this gets a bigger redesign in 3.0, or we try to get it into 2.5 and accept some API changes. I think I lean towards a bolder breaking change to fix it up in 3.0, unless there's a pressing need for this metric.

@erikerlandson
Copy link
Contributor Author

I think targeting 3.0 with a refactor makes the most sense. There's no way to do this without making small breaking changes, but slightly larger changes could clean up the design. ImpurityCalculator can subsume Impurity, and a more general rethinking of gain and impurity can be accommodated too.

@erikerlandson
Copy link
Contributor Author

update - I'm consulting with some teammates about what it might mean to also support Bayesian variations on split quality, since there has been a lot of interest in the last few years regarding Bayesian alternatives to more traditional p-value based statistics.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 18, 2020
@github-actions github-actions bot closed this Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.