Training metrics #100

rcmalli · 2019-08-11T16:42:20Z

Should we have training accuracy calculation automated?

Currently I am handling like this

class Model(ptl.LightningModule):

    def __init__(self,):
        super(AdvTrainModel, self).__init__()
        self.training_correct_counter = 0

    def training_step(self, batch, batch_nb):
        #...
        if batch_nb == 0:
            self.training_correct_counter = (torch.max(y_hat, 1)[1].view(y.size()) == y).sum()
        else:
            self.training_correct_counter += (torch.max(y_hat, 1)[1].view(y.size()) == y).sum()
        return {'loss': self.my_loss(y_adv_hat, y)}

    def validation_end(self, outputs):
        # ...
        train_avg_acc = 100 * self.training_correct_counter / len(self.tng_dataloader.dataset)
        return {'Training/_accuracy':train_avg_acc}

williamFalcon · 2019-08-11T16:50:25Z

? just calculate accuracy in training_step. you can do whatever in there, it’s not just for the loss

minhptx · 2019-10-25T02:56:06Z

I think the problem here is that if metrics are caculated in training_step, it is only calculated for one batch. I need to tweak the code as @rcmalli did to aggregate for the whole epoch.

Can we have a function called training_end where we can calculate metrics for the whole epoch ? (Something similar to validation_end but for training)

expectopatronum · 2020-01-07T12:01:07Z

@minhptx Did you implement this? I also want to collect my training metrics after each epoch but as far as I understood the new method training_end just collects the output for the whole batch and not all batches in an epoch.

Jonathan-LeRoux · 2020-01-29T03:23:41Z

I'm also interested in such a feature. It took me a little while to understand that training_end and validation_end did not have the same behavior, which is a bit misleading. It may be clearer to have training_end be whatever happens at the end of an epoch, and maybe rename the current training_end to training_step_end.

captainvera · 2020-03-03T15:01:13Z

@Jonathan-LeRoux I'm in the same boat.. It is super misleading that validation_end and training_end have different behaviour. It took me a while to understand what was going on.

Continuing this discussion @williamFalcon, I think this thread's name is misleading. There's absolutely no reason for lightning to automatically calculate accuracy. On the other hand, it would be super useful if lightning could keep the list of outputs of training_step just like it does for validaton_step with validation_end.

Correct me if I'm wrong, but the only way to calculate these metrics is for me to save a state of (y_hat, target) throughout the entire epoch and calculate metrics at certain points. My point is, if I am not supposed to keep state to track validation metrics why would we break that philosophy with the training metrics?

edit:
There are metrics we can calculate per-batch such as accuracy and just save a running average, for that we could use external loggers. On the other hand, metrics like F1, need to be calculated using the entirety of the dataset so pumping out values to the loggers at each training step seems useless for this purpose (off, we could keep avgs of precision, etc etc but you get the point).

Borda · 2020-03-04T11:22:46Z

@captainvera have you check recent changes in #776 #889 #950
anyway a PR with suggestions is welcome 🤖

failable · 2020-04-20T03:10:39Z

@captainvera May I ask how you compute metrics like F1 in current version? I tried to do it in validation_epoch_end but it seemed that to access the data loader by val_dataloader I would need to handle things like moving tensors to correct devices manually...

rcmalli added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 11, 2019

williamFalcon closed this as completed Aug 11, 2019

mpaepper mentioned this issue May 22, 2020

Multi GPU training (ddp) gets very slow when using list of tensors in Dataset #1925

Closed

sgondala mentioned this issue Sep 5, 2020

load_from_checkpoint not initializing right when using transformers #3364

Closed

fire717 mentioned this issue Nov 25, 2020

How to print acc while training? #4850

Closed

jstjohn mentioned this issue Jul 1, 2022

FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints. #13500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training metrics #100

Training metrics #100

rcmalli commented Aug 11, 2019 •

edited

Loading

williamFalcon commented Aug 11, 2019

minhptx commented Oct 25, 2019 •

edited

Loading

expectopatronum commented Jan 7, 2020

Jonathan-LeRoux commented Jan 29, 2020

captainvera commented Mar 3, 2020 •

edited

Loading

Borda commented Mar 4, 2020

failable commented Apr 20, 2020

Training metrics #100

Training metrics #100

Comments

rcmalli commented Aug 11, 2019 • edited Loading

williamFalcon commented Aug 11, 2019

minhptx commented Oct 25, 2019 • edited Loading

expectopatronum commented Jan 7, 2020

Jonathan-LeRoux commented Jan 29, 2020

captainvera commented Mar 3, 2020 • edited Loading

Borda commented Mar 4, 2020

failable commented Apr 20, 2020

rcmalli commented Aug 11, 2019 •

edited

Loading

minhptx commented Oct 25, 2019 •

edited

Loading

captainvera commented Mar 3, 2020 •

edited

Loading