Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow selecting N samples for training and testing #949

Merged
merged 29 commits into from
Apr 8, 2022
Merged

Conversation

vuillaut
Copy link
Member

@vuillaut vuillaut commented Mar 21, 2022

Early review is welcome

Fixes #948

@vuillaut vuillaut changed the title WIP: Allow selecting N samples for training and testing Allow selecting N samples for training and testing Mar 21, 2022
@vuillaut vuillaut added the ready for review Pull request is ready for review label Mar 21, 2022
lstchain/reco/utils.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 21, 2022

Codecov Report

Merging #949 (6efdfd1) into master (a39a5df) will decrease coverage by 0.06%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master     #949      +/-   ##
==========================================
- Coverage   85.49%   85.43%   -0.07%     
==========================================
  Files          78       78              
  Lines        6461     6494      +33     
==========================================
+ Hits         5524     5548      +24     
- Misses        937      946       +9     
Impacted Files Coverage Δ
lstchain/high_level/significance_calculation.py 14.04% <0.00%> (ø)
lstchain/io/config.py 100.00% <ø> (ø)
lstchain/reco/dl1_to_dl2.py 73.52% <81.25%> (-0.94%) ⬇️
lstchain/reco/tests/test_utils.py 100.00% <100.00%> (ø)
lstchain/reco/utils.py 76.11% <100.00%> (+0.40%) ⬆️
lstchain/tests/test_lstchain.py 96.96% <100.00%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a39a5df...6efdfd1. Read the comment docs.

@vuillaut vuillaut requested a review from maxnoe March 22, 2022 09:45
@moralejo
Copy link
Collaborator

Can you briefly describe how the number n_events is used? It seems to me that you are using n_events gammas in total, but only a part of them are used in the g/h RF training, together with n_events protons. Probably it would be useful to set the numbers of each species separately.

@vuillaut
Copy link
Member Author

vuillaut commented Mar 22, 2022

Can you briefly describe how the number n_events is used? It seems to me that you are using n_events gammas in total, but only a part of them are used in the g/h RF training, together with n_events protons. Probably it would be useful to set the numbers of each species separately.

In the present config, n_events sets the maximal number of events to be used for each particle,
So during the training stage:

  • a max of n_events gammas will be kept in total, and then used for training as usual
  • a max of n_events protons will be kept

If the same config is applied to dl1_to_dl2 step, then again, a max of n_events will be kept for each particle.

For a more fine-grained description, I can think of something like:

"n_events": {
     "training_gammas": null,
     "training_protons": null,
}

And not allow this option for the inference step.
Otherwise, we will have to add:

"n_events": {
     "training_gammas": null,
     "training_protons": null,
     "dl1_to_dl2": null
}

Not sure this last one is really useful...

@moralejo
Copy link
Collaborator

I would define this option only for the training stage.
Then, I think it would be better to be able to control what number of events is used in each of the trainings, right? As of now, the g/h training would have fewer gammas than protons, since those used for the energy reco are (rightly) not re-used for the g/h training.
Actually if the fraction of gammas that is used for E-reco is known, then it is enough to control the total numbers of training g's and p's separately.

@vuillaut
Copy link
Member Author

I would define this option only for the training stage.

Ok I will apply these changes.

Then, I think it would be better to be able to control what number of events is used in each of the trainings, right? As of now, the g/h training would have fewer gammas than protons, since those used for the energy reco are (rightly) not re-used for the g/h training. Actually if the fraction of gammas that is used for E-reco is known, then it is enough to control the total numbers of training g's and p's separately.

The fraction of gamma used for the regressors is known (20%) - it is the same for all regressors, but not configurable at the moment. I will modify the code so we can also control that number.

@vuillaut
Copy link
Member Author

vuillaut commented Mar 23, 2022

For reference, this is the current training process:

graph LR
    GAMMA[gammas] -->|100%| REG(regressors) --> |dump| DISK
    GAMMA --> S(split)
    S --> |80%| g_train
    S --> |20%| g_test
    g_train --> |reg training| tmp_reg(tmp regressors)
    tmp_reg --- A[ ]:::empty
    g_test --- A
    A --> g_test_dl2
    g_test_dl2 --- D[ ]:::empty
    protons -------- |100%| D
    D --> cls(classifier)
    cls--> |dump| DISK
    classDef empty width:0px,height:0px;
Loading

We keep all gammas for the final regressors saved on disk.
I think the rational was to train the best possible regressors. However, it might introduce a small bias down the road, since during inference, we apply that fully grown RF to the data for regression, before applying the classifier. In my opinion, that is negligible, I am mentioning for reference.

@vuillaut
Copy link
Member Author

New implementation with the config:

graph LR
    GAMMA[gammas] -->|`gamma_regressors`| REG(regressors) --> DISK
    GAMMA --> S(split)
    S --> |`gamma_classifier_train`| g_train
    S --> |`gamma_classifier_test`| g_test
    g_train --> |reg training| tmp_reg(tmp regressors)
    tmp_reg --- A[ ]:::empty
    g_test --- A
    A --> g_test_dl2
    g_test_dl2 --- D[ ]:::empty
    protons -------- |`proton_classifier`| D
    D --> cls(classifier)
    cls--> DISK
    classDef empty width:0px,height:0px;
Loading

Comment on lines 158 to 161
"gamma_regressors": 0.99,
"gamma_classifier_train": 0.78,
"gamma_classifier_test": 0.21,
"proton_classifier": 0.98
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gamma_classifier_test seems to me a bit of a confusing name. This is actually the fraction used in the training of the classifier, right? (it is "test" only from the point of view of the tmp_regressors).

Also, if we are not using the full gamma statistics, we could use the same training sample for the tmp regressors and the final ones. In that case, wouldn't it be enough, e.g.:

"gamma_regressors": 0.70
"gamma_classifier": 0.30 (could default to 1 - fraction_gamma_regressors)
"proton_classifier": 1.00

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this comment about the options naming still holds. My "forget my previous comment" refers to another one that I deleted (it was about the test with fractions adding to more than 1, to make it fail)

@vuillaut
Copy link
Member Author

gamma_classifier_test seems to me a bit of a confusing name. This is actually the fraction used in the training of the classifier, right? (it is "test" only from the point of view of the tmp_regressors).

I agree but failed to find something more explicit and clear. Any suggestion?

@vuillaut
Copy link
Member Author

Also, if we are not using the full gamma statistics, we could use the same training sample for the tmp regressors and the final ones. In that case, wouldn't it be enough, e.g.:
"gamma_regressors": 0.70
"gamma_classifier": 0.30 (could default to 1 - fraction_gamma_regressors)
"proton_classifier": 1.00

🤔
Not sure I understand.
The gammas used for the classifier are not a subpart of the gamma_regressors ones.

So you could set 70% used for the regressors (gamma_regressors=0.7) and then gamma_classifier_train=0.6, gamma_classifier_test=0.4 equals to one to use all gammas for the classifer, or any other repartition.

@moralejo
Copy link
Collaborator

Perhaps I am not understanding.

If we have enough statistics, the ideal thing would be to (i) train the regressors on some part of the gamma sample; (ii) apply those regressors to another independent part of the gamma sample (perhaps all the rest of it), and to the (desired fraction of the) proton sample; (iii) use the p and g "regressed" samples from (ii) for the classifier training.

In this way the classifier training would be done with events reconstructed in the same way (same regressors) as we will use for the real data and the "MC-IRF" sample.

As of now, the final regressors are not those used in the creation of the classifier training sample. Probably the effect is not large.

@vuillaut
Copy link
Member Author

Perhaps I am not understanding.

If we have enough statistics, the ideal thing would be to (i) train the regressors on some part of the gamma sample; (ii) apply those regressors to another independent part of the gamma sample (perhaps all the rest of it), and to the (desired fraction of the) proton sample; (iii) use the p and g "regressed" samples from (ii) for the classifier training.

In this way the classifier training would be done with events reconstructed in the same way (same regressors) as we will use for the real data and the "MC-IRF" sample.

As of now, the final regressors are not those used in the creation of the classifier training sample. Probably the effect is not large.

Yes, you got it right. That is what I explained in
#949 (comment)_

So, do you suggest that we also change the workflow?

@moralejo
Copy link
Collaborator

The problem I see with the current workflow is that if you use a large fraction of the gammas (as we do now) for the "tmp regressors" then the gamma statistics for the classifier training is small. On the other hand, if you increase the gamma stats for the classifier, then the tmp regressors will be more and more different (worse) from those we will later use for data (which are created using all gammas). So I think a more clean solution is to go for a single set of regressors that can be used also to create the classifier-training sample. But there may be more opinions. @rlopezcoto ? @maxnoe ?

@maxnoe
Copy link
Member

maxnoe commented Mar 24, 2022

Yes.

  1. Split gammas into train set for the energy regressor and a train set for the g/h classifier.

  2. Train regressor on regressor train set

  3. Apply regressor to classifier train set

  4. Train classifier

@vuillaut
Copy link
Member Author

vuillaut commented Mar 24, 2022

Ok, this will decrease the final regressors performance and is a change from what was decided early in the development of lstchain - so this is not just a new feature allowing to configure the number of samples for training.
Note that the number of gammas used to train the classifier would now be configurable and we could set to say 80% (rather than the actual 20%). I think that this would improve the classifier performances, without decreasing the final regressors performances.
Do we have somewhere an up-to-date features importance plot for the classifier?
Are the regressors outputs so important for the classifier?

For reference, this would be the new workflow:

graph LR
    GAMMA[gammas] --> Sg(split)
    Sg --> |p| g_train
    Sg --> |1-p| g_test 

    g_train --> |training| reg(regressors)
    reg --> |dump| DISK
    reg --- A[ ]:::empty
    g_test --- A
    A --> |apply reg| g_test_dl2
    g_test_dl2 --- D[ ]:::empty
    PROTON[protons] ------ B[ ]:::empty 
    reg --- B
    B --> |apply reg| p_test_dl2 --- D
    D --> |train| cls(classifier)
    cls--> |dump| DISK
    classDef empty width:0px,height:0px;
Loading

@moralejo
Copy link
Collaborator

Note that "apply regressors" is missing for the protons in the graph

Probably the importance of regressor outputs in the classifier is not large, after all they are built from the same inputs that go anyway into the classifier.

If that is true, and for the case in which no regressor outputs are fed to the classifier, we may want to use the full gamma sample for regressors and classifier. Shall we just allow the fractions of gammas (for regressors ad classifier) to add up to more than 1? (if they add up to <= 1, then independent event samples are drawn from the gamma sample; otherwise we just take the requested fraction for each at random from the original sample).

@vuillaut
Copy link
Member Author

Note that "apply regressors" is missing for the protons in the graph

Right, thanks ! I edited accordingly.

@vuillaut
Copy link
Member Author

vuillaut commented Apr 2, 2022

This is pending, I am not sure I should completely refactor the training workflow or not at the end?

@moralejo
Copy link
Collaborator

moralejo commented Apr 2, 2022

This is pending, I am not sure I should completely refactor the training workflow or not at the end?

I think we can do without that for now. But the naming of the config file options is misleading, in my opinion (see #949 (comment)).

@vuillaut
Copy link
Member Author

vuillaut commented Apr 2, 2022

This is pending, I am not sure I should completely refactor the training workflow or not at the end?

I think we can do without that for now. But the naming of the config file options is misleading, in my opinion (see #949 (comment)).

what about?

"gamma_regressors", "gamma_tmp_regressors", "gamma_classifier", "proton_classifier"

@vuillaut
Copy link
Member Author

vuillaut commented Apr 4, 2022

I renamed the variables.

Note that the diagram of the workflow will be added to the built documentation, that should help to understand what is what.

Comment on lines 38 to 43
"n_training_events": {
"gamma_regressors": null,
"gamma_tmp_regressors": null,
"gamma_classifier": 0.2,
"proton_classifier": null
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier

correct? Should we write a comment in the json of what the default means?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the written out regressors are those using 100% of the gammas - but the diagram indicates they would be the 80% ones. Or did I misunderstand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Json has no comments

@vuillaut
Copy link
Member Author

vuillaut commented Apr 5, 2022

Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier

correct? Should we write a comment in the json of what the default means?

That is correct, and this corresponds to the current situation.

I could replace the null values with the actual values as standard - that would probably be more explicit.

Max is correct, we cannot comment directly in the json file.

@vuillaut
Copy link
Member Author

vuillaut commented Apr 5, 2022

And the written out regressors are those using 100% of the gammas - but the diagram indicates they would be the 80% ones. Or did I misunderstand?

I am sorry @moralejo I think you are mistaking the diagram I made when discussing if we should refactor the training process as well.

The diagram that will be in the doc is this one:

        graph LR
            GAMMA[gammas] -->|#`gamma_regressors`| REG(regressors) --> DISK
            GAMMA --> S(split)
            S --> |#`gamma_tmp_regressors`| g_train
            S --> |#`gamma_classifier`| g_test
            g_train --> tmp_reg(tmp regressors)
            tmp_reg --- A[ ]:::empty
            g_test --- A
            A --> g_test_dl2
            g_test_dl2 --- D[ ]:::empty
            protons -------- |#`proton_classifier`| D
            D --> cls(classifier)
            cls--> DISK
            classDef empty width:0px,height:0px;
Loading

@moralejo
Copy link
Collaborator

moralejo commented Apr 5, 2022

Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier

correct? Should we write a comment in the json of what the default means?

That is correct, and this corresponds to the current situation.

I could replace the null values with the actual values as standard - that would probably be more explicit.

yes, I think so, thanks!

@moralejo
Copy link
Collaborator

moralejo commented Apr 5, 2022

About the diagram: ok, I see, sorry for the confusion!
Just make the explicit settings in the default for clarity, and we can move on.

@vuillaut
Copy link
Member Author

vuillaut commented Apr 5, 2022

Done

@vuillaut vuillaut requested a review from moralejo April 5, 2022 14:53
@vuillaut
Copy link
Member Author

vuillaut commented Apr 8, 2022

Thank you for the exchanges and review. I merge.

@vuillaut vuillaut merged commit c8abcb2 into master Apr 8, 2022
@vuillaut vuillaut deleted the nsamples branch April 8, 2022 07:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review Pull request is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Configurable number of training samples
3 participants