-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow selecting N samples for training and testing #949
Conversation
Codecov Report
@@ Coverage Diff @@
## master #949 +/- ##
==========================================
- Coverage 85.49% 85.43% -0.07%
==========================================
Files 78 78
Lines 6461 6494 +33
==========================================
+ Hits 5524 5548 +24
- Misses 937 946 +9
Continue to review full report at Codecov.
|
Can you briefly describe how the number n_events is used? It seems to me that you are using n_events gammas in total, but only a part of them are used in the g/h RF training, together with n_events protons. Probably it would be useful to set the numbers of each species separately. |
In the present config,
If the same config is applied to For a more fine-grained description, I can think of something like:
And not allow this option for the inference step.
Not sure this last one is really useful... |
I would define this option only for the training stage. |
Ok I will apply these changes.
The fraction of gamma used for the regressors is known (20%) - it is the same for all regressors, but not configurable at the moment. I will modify the code so we can also control that number. |
For reference, this is the current training process: graph LR
GAMMA[gammas] -->|100%| REG(regressors) --> |dump| DISK
GAMMA --> S(split)
S --> |80%| g_train
S --> |20%| g_test
g_train --> |reg training| tmp_reg(tmp regressors)
tmp_reg --- A[ ]:::empty
g_test --- A
A --> g_test_dl2
g_test_dl2 --- D[ ]:::empty
protons -------- |100%| D
D --> cls(classifier)
cls--> |dump| DISK
classDef empty width:0px,height:0px;
We keep all gammas for the final regressors saved on disk. |
New implementation with the config: graph LR
GAMMA[gammas] -->|`gamma_regressors`| REG(regressors) --> DISK
GAMMA --> S(split)
S --> |`gamma_classifier_train`| g_train
S --> |`gamma_classifier_test`| g_test
g_train --> |reg training| tmp_reg(tmp regressors)
tmp_reg --- A[ ]:::empty
g_test --- A
A --> g_test_dl2
g_test_dl2 --- D[ ]:::empty
protons -------- |`proton_classifier`| D
D --> cls(classifier)
cls--> DISK
classDef empty width:0px,height:0px;
|
lstchain/tests/test_lstchain.py
Outdated
"gamma_regressors": 0.99, | ||
"gamma_classifier_train": 0.78, | ||
"gamma_classifier_test": 0.21, | ||
"proton_classifier": 0.98 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gamma_classifier_test seems to me a bit of a confusing name. This is actually the fraction used in the training of the classifier, right? (it is "test" only from the point of view of the tmp_regressors).
Also, if we are not using the full gamma statistics, we could use the same training sample for the tmp regressors and the final ones. In that case, wouldn't it be enough, e.g.:
"gamma_regressors": 0.70
"gamma_classifier": 0.30 (could default to 1 - fraction_gamma_regressors)
"proton_classifier": 1.00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this comment about the options naming still holds. My "forget my previous comment" refers to another one that I deleted (it was about the test with fractions adding to more than 1, to make it fail)
I agree but failed to find something more explicit and clear. Any suggestion? |
🤔 So you could set 70% used for the regressors ( |
Perhaps I am not understanding. If we have enough statistics, the ideal thing would be to (i) train the regressors on some part of the gamma sample; (ii) apply those regressors to another independent part of the gamma sample (perhaps all the rest of it), and to the (desired fraction of the) proton sample; (iii) use the p and g "regressed" samples from (ii) for the classifier training. In this way the classifier training would be done with events reconstructed in the same way (same regressors) as we will use for the real data and the "MC-IRF" sample. As of now, the final regressors are not those used in the creation of the classifier training sample. Probably the effect is not large. |
Yes, you got it right. That is what I explained in So, do you suggest that we also change the workflow? |
The problem I see with the current workflow is that if you use a large fraction of the gammas (as we do now) for the "tmp regressors" then the gamma statistics for the classifier training is small. On the other hand, if you increase the gamma stats for the classifier, then the tmp regressors will be more and more different (worse) from those we will later use for data (which are created using all gammas). So I think a more clean solution is to go for a single set of regressors that can be used also to create the classifier-training sample. But there may be more opinions. @rlopezcoto ? @maxnoe ? |
Yes.
|
Ok, this will decrease the final regressors performance and is a change from what was decided early in the development of lstchain - so this is not just a new feature allowing to configure the number of samples for training. For reference, this would be the new workflow: graph LR
GAMMA[gammas] --> Sg(split)
Sg --> |p| g_train
Sg --> |1-p| g_test
g_train --> |training| reg(regressors)
reg --> |dump| DISK
reg --- A[ ]:::empty
g_test --- A
A --> |apply reg| g_test_dl2
g_test_dl2 --- D[ ]:::empty
PROTON[protons] ------ B[ ]:::empty
reg --- B
B --> |apply reg| p_test_dl2 --- D
D --> |train| cls(classifier)
cls--> |dump| DISK
classDef empty width:0px,height:0px;
|
Note that "apply regressors" is missing for the protons in the graph Probably the importance of regressor outputs in the classifier is not large, after all they are built from the same inputs that go anyway into the classifier. If that is true, and for the case in which no regressor outputs are fed to the classifier, we may want to use the full gamma sample for regressors and classifier. Shall we just allow the fractions of gammas (for regressors ad classifier) to add up to more than 1? (if they add up to <= 1, then independent event samples are drawn from the gamma sample; otherwise we just take the requested fraction for each at random from the original sample). |
Right, thanks ! I edited accordingly. |
This is pending, I am not sure I should completely refactor the training workflow or not at the end? |
I think we can do without that for now. But the naming of the config file options is misleading, in my opinion (see #949 (comment)). |
what about?
|
I renamed the variables. Note that the diagram of the workflow will be added to the built documentation, that should help to understand what is what. |
"n_training_events": { | ||
"gamma_regressors": null, | ||
"gamma_tmp_regressors": null, | ||
"gamma_classifier": 0.2, | ||
"proton_classifier": null | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the default is now:
100% of gammas for (final) regressors
80% of gammas for tmp regressors
20% of gammas for classifier
100% of protons for classifier
correct? Should we write a comment in the json of what the default means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the written out regressors are those using 100% of the gammas - but the diagram indicates they would be the 80% ones. Or did I misunderstand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Json has no comments
That is correct, and this corresponds to the current situation. I could replace the Max is correct, we cannot comment directly in the json file. |
I am sorry @moralejo I think you are mistaking the diagram I made when discussing if we should refactor the training process as well. The diagram that will be in the doc is this one: graph LR
GAMMA[gammas] -->|#`gamma_regressors`| REG(regressors) --> DISK
GAMMA --> S(split)
S --> |#`gamma_tmp_regressors`| g_train
S --> |#`gamma_classifier`| g_test
g_train --> tmp_reg(tmp regressors)
tmp_reg --- A[ ]:::empty
g_test --- A
A --> g_test_dl2
g_test_dl2 --- D[ ]:::empty
protons -------- |#`proton_classifier`| D
D --> cls(classifier)
cls--> DISK
classDef empty width:0px,height:0px;
|
yes, I think so, thanks! |
About the diagram: ok, I see, sorry for the confusion! |
Done |
Thank you for the exchanges and review. I merge. |
Early review is welcome
Fixes #948