Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve models generation #96

Merged

Conversation

HealthyPear
Copy link
Member

@HealthyPear HealthyPear commented Jan 25, 2021

Requirements

Summary of expected modifications

  • Refactoring of the I/O from protopipe.scripts.build_models to protopipe.mva.io
  • Implementation of Random Forest energy regressor
  • Inclusion of missing features from CTAMARS related to energy regressor
  • Explicit inclusion of all model class attributes from scikit-learn
  • Improvement of configuration files to maximize usage
  • Final testing
  • Update documentation
  • BONUS feature: testing via integration tests

- refactored I/O from protopipe.mva.io
- add Random Forest energy regressor
- explicit all class options from scikit-learn
- better organized information
- smaller fixes
@HealthyPear HealthyPear added enhancement New feature or request machine learning labels Jan 25, 2021
@HealthyPear HealthyPear added this to the v0.5.0 milestone Jan 25, 2021
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@codecov
Copy link

codecov bot commented Jan 25, 2021

Codecov Report

Merging #96 (d10dac3) into master (166fbe0) will increase coverage by 0.00%.
The diff coverage is 77.77%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #96   +/-   ##
=======================================
  Coverage   48.92%   48.93%           
=======================================
  Files          22       23    +1     
  Lines        2001     2058   +57     
=======================================
+ Hits          979     1007   +28     
- Misses       1022     1051   +29     
Impacted Files Coverage Δ
protopipe/scripts/write_dl2.py 4.91% <3.84%> (+<0.01%) ⬆️
protopipe/scripts/build_model.py 86.32% <86.30%> (-0.29%) ⬇️
protopipe/mva/utils.py 30.65% <87.50%> (+0.22%) ⬆️
protopipe/mva/__init__.py 100.00% <100.00%> (ø)
protopipe/mva/io.py 100.00% <100.00%> (ø)
protopipe/mva/train_model.py 64.00% <100.00%> (-24.24%) ⬇️
protopipe/scripts/data_training.py 94.87% <100.00%> (+0.27%) ⬆️
protopipe/scripts/tests/test_pipeline.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 166fbe0...d10dac3. Read the comment docs.

@HealthyPear HealthyPear linked an issue Feb 23, 2021 that may be closed by this pull request
kosack
kosack previously approved these changes Apr 13, 2021
@HealthyPear HealthyPear marked this pull request as ready for review April 14, 2021 18:01
@HealthyPear HealthyPear requested a review from kosack April 14, 2021 18:01
@HealthyPear HealthyPear mentioned this pull request Apr 14, 2021
1 task
Copy link
Contributor

@kosack kosack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Had just a few small inline changes to suggest.

A general question: is there a way to include another YAML file in a YAML file, via some pre-processor perhaps? (jinja2 maybe?) Right now if you make any changes, you have a lot of YAML files to update, and many of them have a lot of the same text in them, with only the regressor changing. So having a file for parameters and another with the regressor config would simplify that a lot. That is not necessary for this PR, but is something to think about as a refactoring.

docs/scripts/model_building.rst Show resolved Hide resolved
docs/scripts/model_building.rst Show resolved Hide resolved
protopipe/scripts/write_dl2.py Outdated Show resolved Hide resolved
protopipe/scripts/write_dl2.py Outdated Show resolved Hide resolved
protopipe/aux/example_config_files/AdaBoostRegressor.yaml Outdated Show resolved Hide resolved
protopipe/mva/train_model.py Show resolved Hide resolved
except:
pass
except KeyError as e:
print(e)
Copy link
Contributor

@kosack kosack Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this really silently fail (well partially silently, as it just prints the error and continues)? If so, maybe add a comment explaining why the error is caught but nothing is done with it. Otherwise, use a warning or raise the exception.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it a second time I decided that it was not done well so I rewrote it.

basically now:

  • if label is None, it means that we are doing an energy regressor, so all this code block doesn't happen
  • if it's not None, label is always added to the dataframe, but the 2 energy keys are checked for existence since in our reference analysis we need energy as 1 of the features

unfortunately right now this is needed to make DL2 work here, so I raise an error if the model is a classifier and those keys are not selected for usage (or I could add them always?)

@HealthyPear HealthyPear requested a review from kosack April 15, 2021 10:09
# This is needed because our reference analysis uses energy as
# feature for classification
# We should propably support a more elastic choice in the future.
if not all(i in derived_features for i in ["log10_reco_energy", "log10_reco_energy_tel"]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also write this as

if not {"log10_reco_energy", "log10_reco_energy_tel"}.issubset(set(derived_features)):

Do you really need this to fail if missing? Isn't it just a choice in the config file whether or not to use energy in the feature list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is what I was referring to above: right now the DL2 script expects these 2 parameters as features because this is how our reference analysis works, but I plan to make it more flexible (also the refactored version should)

I will leave this for later and for now just rely on the error message

@HealthyPear HealthyPear merged commit 4928ae4 into cta-observatory:master Apr 15, 2021
@HealthyPear HealthyPear deleted the feature-add_RF_for_energy branch April 15, 2021 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of model features
2 participants