Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Engine policy on datasets without a default target #13

Open
janvanrijn opened this issue Oct 4, 2018 · 14 comments
Open

Evaluation Engine policy on datasets without a default target #13

janvanrijn opened this issue Oct 4, 2018 · 14 comments

Comments

@janvanrijn
Copy link
Member

Several datasets do not have a specific target. Also, multitask datasets do not have a single target, which complicates the calculation of meta-features such as: classcount, entropy, landmarkers and mean mutual information. Several things that we can do:

  • in case of no single/valid class, do not calculate these features
  • define meta-features on task level. We should do so anyway at some point. This does not solve the multitarget problem though
  • ... ?

@mfeurer @amueller @joaquinvanschoren @berndbischl @giuseppec @ja-thomas

@amueller
Copy link

amueller commented Oct 4, 2018

I would do the first for now. Yes, ideally these should be on a task level, but honestly I feel like computing meta-features is something that can be done so easily locally that we shouldn't worry about it too much for now.
(yes, the local meta-features might not be as reproducible, but neither is the rest of the local pipeline).

@janvanrijn
Copy link
Member Author

I would do the first for now.

Just some expectation management, for now I would do neither of both options, as this seems like a nice issue for a hackathon, to be picked up by someone from the community, or to be accompanied by a research project.

@janvanrijn
Copy link
Member Author

Just saying this so that in terms of this discussion we converge to the 'ideal' situation, rather than a quick and dirty hack that will do the trick for now.

@amueller
Copy link

amueller commented Oct 4, 2018

I don't understand how you can do not the first option, if the first option is not to compute them.

@janvanrijn
Copy link
Member Author

Coding-wise that would be rather trivial, but getting this in production requires additional time investments:

  • checking the exact set of meta-features that are depending on a target feature
  • updating unit tests
  • code review
  • deleting meta-features from server
  • restarting a set of evaluation engine instances

yes, this can all be done, but in combination with all the other maintenance tasks that I perform(ed) on the various components of OpenML I would like to prevent over committing to maintenance work.

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Oct 4, 2018 via email

@janvanrijn
Copy link
Member Author

Shall I just look around for someone to do this now (or do it myself)?

Would be perfect if you can find someone. Otherwise I will do it once I have time for this. Please make sure to review PR #11 first, as this introduced a proper unit testing framework and the extension can build upon this.

@janvanrijn
Copy link
Member Author

Joaquin and I have agreed that I would do the coding, and he would make sure that the meta-features will be recalculated on the server.

Just for anyone who thought this would be an easy issue, please check the diff on PR #14 (+405 −368). This does not even take into account the changes I made for the java connector (different, more low level, library).

In order to be able to properly unit test the functions that do the trick, I had to restructure some things. Furthermore, I found out that the "quick and fast" calculation of the first 10 meta-features is a slight code duplication, and I fitted this into the general framework as well. This all required quite some changes, but all together, I think this code update makes the repository better maintainable.
I also made extended the set of "quick and fast" meta-features. All meta-features except landmarkers will be calculated instantanious with the first pass over the dataset. (I see no reason why not to do this, as these are all reasonably fast.)

@joaquinvanschoren I think this PR is ready for review. Given the big change I would be surprised if there are no mistakes in it. Please have a thorough look to it, and feel free to run / extend the unit tests.

@berndbischl
Copy link

My personal opinion is to remove the meta features. Except for a very small well defined set. Until for the remaining features it is well defined how this would work. Which apparently it is not. Hence this thread.
I know I have mentioned something similar before and this might create some "anger" with the more meta learning interested crowd. Please believe me that this is not my intention. But I am willing to defend my position here if you find this worthwhile.

@joaquinvanschoren
Copy link
Contributor

joaquinvanschoren commented Oct 4, 2018

@berndbischl: you mean remove the meta-features for datasets without a target feature and for multi-target datasets? That is exactly what we have done here.
For the normal supervised datasets the meta-features are well defined, and me and many other people use them regularly.

@janvanrijn
Copy link
Member Author

I just spend 2 days on refactoring and improving the part where meta-features get calculated. PR is under review by @joaquinvanschoren and I think it will be merged soon. Removing them now seems like a bad idea :)

However, additionally I think that meta-features on task level solves almost all of our problems. This is something that could be a nice task for a moderately experienced contributor.

@amueller
Copy link

amueller commented Oct 5, 2018

I think @berndbischl meant remove all meta-features. Which is also the direction I tend to. At the current stage of openml it seems mostly a hassle.

@amueller
Copy link

amueller commented Oct 5, 2018

I think 2 days of work are not an argument for design decisions.
I spend weeks on code for sklearn that got thrown away because it wasn't the right solution.

@janvanrijn
Copy link
Member Author

I think 2 days of work are not an argument for design decisions.

This point I completely agree to.

I think @berndbischl meant remove all meta-features. Which is also the direction I tend to. At the current stage of openml it seems mostly a hassle.

This part I don't agree with. I do agree that the system currently consists of meta-features that might not be as is well-defined as we originally thought when we implemented them, so we can definitely change / improve that. But as the meta-features are currently present in:

  • most of the search functionality on the frontend
  • used in the search functionality in the api, which was needed to make the selection for the OpenML100
  • mentioned in many of our publications and publications by other researchers that cite OpenML
  • mentioned in all of our talks.

For these reasons I am highly against dropping this functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants