Allow selecting N samples for training and testing #949

vuillaut · 2022-03-21T13:47:18Z

Early review is welcome

Fixes #948

lstchain/reco/utils.py

codecov · 2022-03-21T20:03:13Z

Codecov Report

Merging #949 (6efdfd1) into master (a39a5df) will decrease coverage by 0.06%.
The diff coverage is 83.33%.

@@            Coverage Diff             @@
##           master     #949      +/-   ##
==========================================
- Coverage   85.49%   85.43%   -0.07%     
==========================================
  Files          78       78              
  Lines        6461     6494      +33     
==========================================
+ Hits         5524     5548      +24     
- Misses        937      946       +9

Impacted Files	Coverage Δ
lstchain/high_level/significance_calculation.py	`14.04% <0.00%> (ø)`
lstchain/io/config.py	`100.00% <ø> (ø)`
lstchain/reco/dl1_to_dl2.py	`73.52% <81.25%> (-0.94%)`	⬇️
lstchain/reco/tests/test_utils.py	`100.00% <100.00%> (ø)`
lstchain/reco/utils.py	`76.11% <100.00%> (+0.40%)`	⬆️
lstchain/tests/test_lstchain.py	`96.96% <100.00%> (+0.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a39a5df...6efdfd1. Read the comment docs.

moralejo · 2022-03-22T17:52:39Z

Can you briefly describe how the number n_events is used? It seems to me that you are using n_events gammas in total, but only a part of them are used in the g/h RF training, together with n_events protons. Probably it would be useful to set the numbers of each species separately.

vuillaut · 2022-03-22T18:18:58Z

Can you briefly describe how the number n_events is used? It seems to me that you are using n_events gammas in total, but only a part of them are used in the g/h RF training, together with n_events protons. Probably it would be useful to set the numbers of each species separately.

In the present config, n_events sets the maximal number of events to be used for each particle,
So during the training stage:

a max of n_events gammas will be kept in total, and then used for training as usual
a max of n_events protons will be kept

If the same config is applied to dl1_to_dl2 step, then again, a max of n_events will be kept for each particle.

For a more fine-grained description, I can think of something like:

"n_events": {
     "training_gammas": null,
     "training_protons": null,
}

And not allow this option for the inference step.
Otherwise, we will have to add:

"n_events": {
     "training_gammas": null,
     "training_protons": null,
     "dl1_to_dl2": null
}

Not sure this last one is really useful...

moralejo · 2022-03-23T15:17:28Z

I would define this option only for the training stage.
Then, I think it would be better to be able to control what number of events is used in each of the trainings, right? As of now, the g/h training would have fewer gammas than protons, since those used for the energy reco are (rightly) not re-used for the g/h training.
Actually if the fraction of gammas that is used for E-reco is known, then it is enough to control the total numbers of training g's and p's separately.

vuillaut · 2022-03-23T21:59:44Z

I would define this option only for the training stage.

Ok I will apply these changes.

Then, I think it would be better to be able to control what number of events is used in each of the trainings, right? As of now, the g/h training would have fewer gammas than protons, since those used for the energy reco are (rightly) not re-used for the g/h training. Actually if the fraction of gammas that is used for E-reco is known, then it is enough to control the total numbers of training g's and p's separately.

The fraction of gamma used for the regressors is known (20%) - it is the same for all regressors, but not configurable at the moment. I will modify the code so we can also control that number.

vuillaut · 2022-03-23T23:08:18Z

For reference, this is the current training process:

graph LR
    GAMMA[gammas] -->|100%| REG(regressors) --> |dump| DISK
    GAMMA --> S(split)
    S --> |80%| g_train
    S --> |20%| g_test
    g_train --> |reg training| tmp_reg(tmp regressors)
    tmp_reg --- A[ ]:::empty
    g_test --- A
    A --> g_test_dl2
    g_test_dl2 --- D[ ]:::empty
    protons -------- |100%| D
    D --> cls(classifier)
    cls--> |dump| DISK
    classDef empty width:0px,height:0px;

We keep all gammas for the final regressors saved on disk.
I think the rational was to train the best possible regressors. However, it might introduce a small bias down the road, since during inference, we apply that fully grown RF to the data for regression, before applying the classifier. In my opinion, that is negligible, I am mentioning for reference.

vuillaut · 2022-03-23T23:37:27Z

New implementation with the config:

graph LR
    GAMMA[gammas] -->|`gamma_regressors`| REG(regressors) --> DISK
    GAMMA --> S(split)
    S --> |`gamma_classifier_train`| g_train
    S --> |`gamma_classifier_test`| g_test
    g_train --> |reg training| tmp_reg(tmp regressors)
    tmp_reg --- A[ ]:::empty
    g_test --- A
    A --> g_test_dl2
    g_test_dl2 --- D[ ]:::empty
    protons -------- |`proton_classifier`| D
    D --> cls(classifier)
    cls--> DISK
    classDef empty width:0px,height:0px;

lstchain/tests/test_lstchain.py

moralejo · 2022-03-24T10:13:03Z

lstchain/tests/test_lstchain.py

+            "gamma_regressors": 0.99,
+            "gamma_classifier_train": 0.78,
+            "gamma_classifier_test": 0.21,
+            "proton_classifier": 0.98


gamma_classifier_test seems to me a bit of a confusing name. This is actually the fraction used in the training of the classifier, right? (it is "test" only from the point of view of the tmp_regressors).

Also, if we are not using the full gamma statistics, we could use the same training sample for the tmp regressors and the final ones. In that case, wouldn't it be enough, e.g.:

"gamma_regressors": 0.70
"gamma_classifier": 0.30 (could default to 1 - fraction_gamma_regressors)
"proton_classifier": 1.00

Hi, this comment about the options naming still holds. My "forget my previous comment" refers to another one that I deleted (it was about the test with fractions adding to more than 1, to make it fail)

vuillaut · 2022-03-24T10:16:14Z

gamma_classifier_test seems to me a bit of a confusing name. This is actually the fraction used in the training of the classifier, right? (it is "test" only from the point of view of the tmp_regressors).

I agree but failed to find something more explicit and clear. Any suggestion?

vuillaut · 2022-03-24T10:22:06Z

Also, if we are not using the full gamma statistics, we could use the same training sample for the tmp regressors and the final ones. In that case, wouldn't it be enough, e.g.:
"gamma_regressors": 0.70
"gamma_classifier": 0.30 (could default to 1 - fraction_gamma_regressors)
"proton_classifier": 1.00

🤔
Not sure I understand.
The gammas used for the classifier are not a subpart of the gamma_regressors ones.

So you could set 70% used for the regressors (gamma_regressors=0.7) and then gamma_classifier_train=0.6, gamma_classifier_test=0.4 equals to one to use all gammas for the classifer, or any other repartition.

moralejo · 2022-03-24T10:35:29Z

Perhaps I am not understanding.

If we have enough statistics, the ideal thing would be to (i) train the regressors on some part of the gamma sample; (ii) apply those regressors to another independent part of the gamma sample (perhaps all the rest of it), and to the (desired fraction of the) proton sample; (iii) use the p and g "regressed" samples from (ii) for the classifier training.

In this way the classifier training would be done with events reconstructed in the same way (same regressors) as we will use for the real data and the "MC-IRF" sample.

As of now, the final regressors are not those used in the creation of the classifier training sample. Probably the effect is not large.

vuillaut · 2022-03-24T11:32:54Z

Perhaps I am not understanding.

If we have enough statistics, the ideal thing would be to (i) train the regressors on some part of the gamma sample; (ii) apply those regressors to another independent part of the gamma sample (perhaps all the rest of it), and to the (desired fraction of the) proton sample; (iii) use the p and g "regressed" samples from (ii) for the classifier training.

In this way the classifier training would be done with events reconstructed in the same way (same regressors) as we will use for the real data and the "MC-IRF" sample.

As of now, the final regressors are not those used in the creation of the classifier training sample. Probably the effect is not large.

Yes, you got it right. That is what I explained in
#949 (comment)_

So, do you suggest that we also change the workflow?

moralejo · 2022-03-24T17:23:43Z

The problem I see with the current workflow is that if you use a large fraction of the gammas (as we do now) for the "tmp regressors" then the gamma statistics for the classifier training is small. On the other hand, if you increase the gamma stats for the classifier, then the tmp regressors will be more and more different (worse) from those we will later use for data (which are created using all gammas). So I think a more clean solution is to go for a single set of regressors that can be used also to create the classifier-training sample. But there may be more opinions. @rlopezcoto ? @maxnoe ?

maxnoe · 2022-03-24T17:48:29Z

Yes.

Split gammas into train set for the energy regressor and a train set for the g/h classifier.
Train regressor on regressor train set
Apply regressor to classifier train set
Train classifier

vuillaut · 2022-03-24T18:59:44Z

Ok, this will decrease the final regressors performance and is a change from what was decided early in the development of lstchain - so this is not just a new feature allowing to configure the number of samples for training.
Note that the number of gammas used to train the classifier would now be configurable and we could set to say 80% (rather than the actual 20%). I think that this would improve the classifier performances, without decreasing the final regressors performances.
Do we have somewhere an up-to-date features importance plot for the classifier?
Are the regressors outputs so important for the classifier?

For reference, this would be the new workflow:

graph LR
    GAMMA[gammas] --> Sg(split)
    Sg --> |p| g_train
    Sg --> |1-p| g_test 

    g_train --> |training| reg(regressors)
    reg --> |dump| DISK
    reg --- A[ ]:::empty
    g_test --- A
    A --> |apply reg| g_test_dl2
    g_test_dl2 --- D[ ]:::empty
    PROTON[protons] ------ B[ ]:::empty 
    reg --- B
    B --> |apply reg| p_test_dl2 --- D
    D --> |train| cls(classifier)
    cls--> |dump| DISK
    classDef empty width:0px,height:0px;

moralejo · 2022-03-24T19:20:06Z

Note that "apply regressors" is missing for the protons in the graph

Probably the importance of regressor outputs in the classifier is not large, after all they are built from the same inputs that go anyway into the classifier.

If that is true, and for the case in which no regressor outputs are fed to the classifier, we may want to use the full gamma sample for regressors and classifier. Shall we just allow the fractions of gammas (for regressors ad classifier) to add up to more than 1? (if they add up to <= 1, then independent event samples are drawn from the gamma sample; otherwise we just take the requested fraction for each at random from the original sample).

vuillaut · 2022-03-24T21:58:57Z

Note that "apply regressors" is missing for the protons in the graph

Right, thanks ! I edited accordingly.

vuillaut · 2022-04-02T12:19:55Z

This is pending, I am not sure I should completely refactor the training workflow or not at the end?

moralejo · 2022-04-02T15:26:10Z

This is pending, I am not sure I should completely refactor the training workflow or not at the end?

I think we can do without that for now. But the naming of the config file options is misleading, in my opinion (see #949 (comment)).

vuillaut · 2022-04-02T18:26:48Z

This is pending, I am not sure I should completely refactor the training workflow or not at the end?

I think we can do without that for now. But the naming of the config file options is misleading, in my opinion (see #949 (comment)).

what about?

"gamma_regressors", "gamma_tmp_regressors", "gamma_classifier", "proton_classifier"

vuillaut · 2022-04-04T08:23:34Z

I renamed the variables.

Note that the diagram of the workflow will be added to the built documentation, that should help to understand what is what.

moralejo · 2022-04-05T11:09:57Z

lstchain/data/lstchain_standard_config.json

+  "n_training_events": {
+    "gamma_regressors": null,
+    "gamma_tmp_regressors": null,
+    "gamma_classifier": 0.2,
+    "proton_classifier": null
+  },


Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier

correct? Should we write a comment in the json of what the default means?

And the written out regressors are those using 100% of the gammas - but the diagram indicates they would be the 80% ones. Or did I misunderstand?

Json has no comments

vuillaut · 2022-04-05T12:18:33Z

Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier

correct? Should we write a comment in the json of what the default means?

That is correct, and this corresponds to the current situation.

I could replace the null values with the actual values as standard - that would probably be more explicit.

Max is correct, we cannot comment directly in the json file.

vuillaut · 2022-04-05T12:23:27Z

And the written out regressors are those using 100% of the gammas - but the diagram indicates they would be the 80% ones. Or did I misunderstand?

I am sorry @moralejo I think you are mistaking the diagram I made when discussing if we should refactor the training process as well.

The diagram that will be in the doc is this one:

        graph LR
            GAMMA[gammas] -->|#`gamma_regressors`| REG(regressors) --> DISK
            GAMMA --> S(split)
            S --> |#`gamma_tmp_regressors`| g_train
            S --> |#`gamma_classifier`| g_test
            g_train --> tmp_reg(tmp regressors)
            tmp_reg --- A[ ]:::empty
            g_test --- A
            A --> g_test_dl2
            g_test_dl2 --- D[ ]:::empty
            protons -------- |#`proton_classifier`| D
            D --> cls(classifier)
            cls--> DISK
            classDef empty width:0px,height:0px;

moralejo · 2022-04-05T14:27:36Z

Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier

correct? Should we write a comment in the json of what the default means?

That is correct, and this corresponds to the current situation.

I could replace the null values with the actual values as standard - that would probably be more explicit.

yes, I think so, thanks!

moralejo · 2022-04-05T14:28:48Z

About the diagram: ok, I see, sorry for the confusion!
Just make the explicit settings in the default for clarity, and we can move on.

vuillaut · 2022-04-05T14:53:11Z

Done

vuillaut · 2022-04-08T07:57:26Z

Thank you for the exchanges and review. I merge.

vuillaut added 6 commits March 21, 2022 14:45

allow to select a number of samples through filter_events

ae2b38e

use it in train / apply models

ba5da8c

update std config

a95d613

explicit args

ea7b5af

fix tests and default filter dict

89e34a3

fix docstring

be7025a

vuillaut changed the title ~~WIP: Allow selecting N samples for training and testing~~ Allow selecting N samples for training and testing Mar 21, 2022

vuillaut added the ready for review Pull request is ready for review label Mar 21, 2022

maxnoe reviewed Mar 21, 2022

View reviewed changes

lstchain/reco/utils.py Outdated Show resolved Hide resolved

vuillaut added 2 commits March 21, 2022 20:06

randomize using train_test_split

93ad772

fix filter numpy

2cb7261

delete blank lines

be0faf3

vuillaut requested a review from maxnoe March 22, 2022 09:45

replace custom_config default by None and fix lying docstrings

53e1b8f

vuillaut added 2 commits March 24, 2022 00:33

implement the configuration of the number of events for the training

b2a74df

update the config

1880a4b

vuillaut added 7 commits March 24, 2022 00:38

update docstring with new workflow

1db9989

update docstring with new workflow

a485502

trying to build mermaid graph in docstring

97a5516

remove prior implementation in filter_events

a6255fc

fix some code

d4ed37b

rm unused import

4ed6b99

merge master

1869ba9

moralejo reviewed Mar 24, 2022

View reviewed changes

lstchain/tests/test_lstchain.py Outdated Show resolved Hide resolved

moralejo reviewed Mar 24, 2022

View reviewed changes

rename config variables

39aa5ae

forgot to update tests

735ac0f

moralejo reviewed Apr 5, 2022

View reviewed changes

vuillaut added 2 commits April 5, 2022 14:24

explicit standard config

3e27758

allow to pass 1.0 as value for gamma regressors and proton classifer

6efdfd1

vuillaut requested a review from moralejo April 5, 2022 14:53

moralejo approved these changes Apr 5, 2022

View reviewed changes

vuillaut merged commit c8abcb2 into master Apr 8, 2022

vuillaut deleted the nsamples branch April 8, 2022 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow selecting N samples for training and testing #949

Allow selecting N samples for training and testing #949

vuillaut commented Mar 21, 2022 •

edited

Loading

codecov bot commented Mar 21, 2022 •

edited

Loading

moralejo commented Mar 22, 2022

vuillaut commented Mar 22, 2022 •

edited

Loading

moralejo commented Mar 23, 2022

vuillaut commented Mar 23, 2022

vuillaut commented Mar 23, 2022 •

edited

Loading

vuillaut commented Mar 23, 2022

moralejo Mar 24, 2022

moralejo Apr 2, 2022

vuillaut commented Mar 24, 2022

vuillaut commented Mar 24, 2022

moralejo commented Mar 24, 2022

vuillaut commented Mar 24, 2022

moralejo commented Mar 24, 2022

maxnoe commented Mar 24, 2022

vuillaut commented Mar 24, 2022 •

edited

Loading

moralejo commented Mar 24, 2022

vuillaut commented Mar 24, 2022

vuillaut commented Apr 2, 2022

moralejo commented Apr 2, 2022

vuillaut commented Apr 2, 2022

vuillaut commented Apr 4, 2022

moralejo Apr 5, 2022

moralejo Apr 5, 2022

maxnoe Apr 5, 2022

vuillaut commented Apr 5, 2022

vuillaut commented Apr 5, 2022

moralejo commented Apr 5, 2022

moralejo commented Apr 5, 2022

vuillaut commented Apr 5, 2022

vuillaut commented Apr 8, 2022

Allow selecting N samples for training and testing #949

Allow selecting N samples for training and testing #949

Conversation

vuillaut commented Mar 21, 2022 • edited Loading

codecov bot commented Mar 21, 2022 • edited Loading

Codecov Report

moralejo commented Mar 22, 2022

vuillaut commented Mar 22, 2022 • edited Loading

moralejo commented Mar 23, 2022

vuillaut commented Mar 23, 2022

vuillaut commented Mar 23, 2022 • edited Loading

vuillaut commented Mar 23, 2022

moralejo Mar 24, 2022

Choose a reason for hiding this comment

moralejo Apr 2, 2022

Choose a reason for hiding this comment

vuillaut commented Mar 24, 2022

vuillaut commented Mar 24, 2022

moralejo commented Mar 24, 2022

vuillaut commented Mar 24, 2022

moralejo commented Mar 24, 2022

maxnoe commented Mar 24, 2022

vuillaut commented Mar 24, 2022 • edited Loading

moralejo commented Mar 24, 2022

vuillaut commented Mar 24, 2022

vuillaut commented Apr 2, 2022

moralejo commented Apr 2, 2022

vuillaut commented Apr 2, 2022

vuillaut commented Apr 4, 2022

moralejo Apr 5, 2022

Choose a reason for hiding this comment

moralejo Apr 5, 2022

Choose a reason for hiding this comment

maxnoe Apr 5, 2022

Choose a reason for hiding this comment

vuillaut commented Apr 5, 2022

vuillaut commented Apr 5, 2022

moralejo commented Apr 5, 2022

moralejo commented Apr 5, 2022

vuillaut commented Apr 5, 2022

vuillaut commented Apr 8, 2022

vuillaut commented Mar 21, 2022 •

edited

Loading

codecov bot commented Mar 21, 2022 •

edited

Loading

vuillaut commented Mar 22, 2022 •

edited

Loading

vuillaut commented Mar 23, 2022 •

edited

Loading

vuillaut commented Mar 24, 2022 •

edited

Loading