-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Genetic Programming for Feature Engineering #2121
Comments
@aerdem4 so, are you only looking for a gpu-accelerated |
@teju85 I think all of them are the same except the metric. Multiple options for the metric would be nice but spearman is the most useful. |
Alright, whose idea of a joke was it to tag this with Good First Issue? I'm looking at you @WXBN ! ;) |
@aerdem4 we are going to have an intern provide us with an initial implementation of this in cuML! For starters, can we assume max program AST depth of 10 or so? Or do you think that's too low to begin with? In practice, what's the deepest program you've come across? |
@teju85 thanks for the good news! I think 10 is enough for AST depth. Generated features don't need to be very complex but should capture the interactions the model can't. If the intern needs any help, I would be happy to be involved btw. |
tagging @vimarsh6739 who'll be implementing this. |
A simple Kaggle test case: I can also create artificial datasets that we can test if GP can reverse engineer the features that contribute to the target. |
This PR introduces/proposes some of the basic and core (gpu-friendly!) data structures for implementing gplearn in cuML in order to address the issue #2121 . Tagging all who will be involved in this development: @vinaydes @venkywonka @vimarsh6739. PS: It also contains an experimental register-based stack implementation that will be useful while implementing CUDA-based AST evaluation, which is needed for organizing tournaments. Authors: - Thejaswi. N. S (@teju85) Approvers: - Corey J. Nolet (@cjnolet) URL: #3387
This issue has been labeled |
Is your feature request related to a problem? Please describe.
Genetic Programming is very useful for feature engineering but main challenge is its time complexity. Luckily, they are easily parallelizable. Therefore, I believe it is a good fit for cuML.
Example: Let's assume you have 2 columns A and B, and a binary target. This target is 1 most of the time when A > B. It is very difficult to learn it with a tree based model but GP can engineer this feature for you.
Describe the solution you'd like
I would like to have the functionalities of gplearn accelerated on GPU. (https://gplearn.readthedocs.io/en/stable/)
The text was updated successfully, but these errors were encountered: