Evaluation during training #840

YCAyca · 2022-03-04T18:47:25Z

Hello, I was wondering if there is a way to evaluate the model "during training". Not at the end of the training with a second command like test.py but to obtain validation accuracy and loss to compare the ones with training loss and accuracy (I didnt see any training accuracy information in mi tensorboard graphs though) Do you have any idea? It would be very useful to make a real research and finetuning to have train loss + accuracy - validation loss + accuracy informations. For example here is my train log in tensorboard with only loss information

MartinHahner · 2022-03-04T19:40:32Z

#414

YCAyca · 2022-03-05T09:22:16Z

thank you for the answer! I got the reason but if I wouldnt care about the time of training how would it be possible to add this mechanism do you have any idea? The code seemed to me a little bit complicated for this

MartinHahner · 2022-03-05T15:40:39Z

In the train loop after calling train_one_epoch (e.g. here), you could add code from here and there.

Then these lines can be omitted since you already evaluated during training.
If you can make it work, it would be great if you could create a pull request.
As I said before, I think this should be the way to go.

Something that came to my mind while I was formulating this answer is that maybe the memory footprint gets too large when you construct the test_set, test_loader, sampler in the train loop (because then you effectively have two datasets loaded at the same time - training and evaluation), maybe that's the reason why it was done one (training) after the other (evaluation) and not intertwined.

OpenPCDet/tools/train.py

Line 182 in a9c66fe

test_set, test_loader, sampler = build_dataloader(

YCAyca · 2022-03-05T20:59:38Z

Im done with training time evaluation thank you for your instructions. Another very important thing for the research part and I try to implement is just to add "training accuracy" information. So basically I just do the same steps for training data but I dont know why, I get an error as below. Doesnt train and test loader loads the data in a similar way?

evaluate validation dataset - calculate validation loss + accuracy

tmp = str(ckpt_name) +'.pth'
eval_output_dir = ckpt_save_dir / 'validation_accuracy' / ('checkpoint_epoch_%d' % trained_epoch) #result.pkl
            
tb_log = SummaryWriter(log_dir=str(ckpt_save_dir / 'validation_accuracy' / 'tensorboard')) 
eval_single_ckpt(model, test_loader, tmp, eval_output_dir, logger, trained_epoch, dist_test=False, cfg=cfg,tb_log=tb_log, val_loss=True)

evaluate train dataset - calculate train accuracy

eval_output_dir = ckpt_save_dir / 'train_accuracy' / ('checkpoint_epoch_%d' % trained_epoch)
tb_log = SummaryWriter(log_dir=str(ckpt_save_dir / 'train_accuracy' / 'tensorboard')) 
          
eval_single_ckpt(model, train_loader, tmp, eval_output_dir, logger, trained_epoch, dist_test=False, cfg=cfg, tb_log=tb_log)

YCAyca · 2022-03-05T21:42:12Z

Sure I will send a pull request after finishing train accuracy problem !!

MartinHahner · 2022-03-06T10:40:48Z

I don't know, you need to debug where 'calib' gets added (or not added) to the batch_dict.

LouisSF · 2022-03-07T10:08:33Z

I was working on this for my own research problems, but faced the same issue regarding calib, I'm not quite sure what causes this so for now i'm just using the following workaround:

for idx_anno, anno in enumerate(annos):
   try:
          calib = annos[idx_anno]['calib']
   except KeyError:
          print("No calib found!")
          calib = self.prev_calib
   self.prev_calib = calib
   try:
          image_shape = annos[idx_anno]['image_shape']
   except KeyError:
          print("No image_shape found!")
          image_shape = self.prev_image_shape
   self.prev_image_shape = image_shape

I also had to do this for image_shape as I was facing the same issue. This will probably cause some slight accuracy errors but there are only a few frames of KITTI for which we get a KeyError

YCAyca · 2022-03-07T10:45:43Z

Actually I updated the code and now I obtain train loss, train acc and validation acc during training. But in somehow it effected my results so now when I make an inference I obtain worse results than before (not too bad, in some how the train acc info doesnt seem ok but the inference results are still usable with a little decrease of accuracy).

As a result we obtain these 2 tensorboard logs one for train loss and acc one for validation acc

The main problem is when you use dataloader for training set (where you give training=True as parameter) you dont obtain calib info in your dataset info, otherwise you obtain.

def build_dataloader(dataset_cfg, class_names, batch_size, dist, root_path=None, workers=4,
logger=None, training=True, merge_all_iters_to_one_epoch=False, total_epochs=0):

dataset = __all__[dataset_cfg.DATASET](
    dataset_cfg=dataset_cfg,
    class_names=class_names,
    root_path=root_path,
    training=training,
    logger=logger,
)

print("DATASET",training, dataset[0]) # Test dataset includes "calib" info but train dataset doesnt

if merge_all_iters_to_one_epoch:
    assert hasattr(dataset, 'merge_all_iters_to_one_epoch')
    dataset.merge_all_iters_to_one_epoch(merge=True, epochs=total_epochs)

if dist:
    if training:
        sampler = torch.utils.data.distributed.DistributedSampler(dataset)
    else:
        rank, world_size = common_utils.get_dist_info()
        sampler = DistributedSampler(dataset, world_size, rank, shuffle=False)
else:
    sampler = None
dataloader = DataLoader(
    dataset, batch_size=batch_size, pin_memory=True, num_workers=workers,
    shuffle=(sampler is None) and training, collate_fn=dataset.collate_batch,
    drop_last=False, sampler=sampler, timeout=0
)

return dataset, dataloader, sampler

I tracked the issue and even at first initialize in kitti_dataset.py class KittiDataset(DatasetTemplate): both train and test datasets comes with the equal informations, the calibration info is deleted in dataset.py prepare_dataset() function. I commented out the lines where training mode make the calib deleted and it worked.

def prepare_data(self, data_dict):
"""
Args:
data_dict:
points: optional, (N, 3 + C_in)
gt_boxes: optional, (N, 7 + C) [x, y, z, dx, dy, dz, heading, ...]
gt_names: optional, (N), string
...

    Returns:
        data_dict:
            frame_id: string
            points: (N, 3 + C_in)
            gt_boxes: optional, (N, 7 + C) [x, y, z, dx, dy, dz, heading, ...]
            gt_names: optional, (N), string
            use_lead_xyz: bool
            voxels: optional (num_voxels, max_points_per_voxel, 3 + C)
            voxel_coords: optional (num_voxels, 3)
            voxel_num_points: optional (num_voxels)
            ...
    """
    # if self.training:
    #     assert 'gt_boxes' in data_dict, 'gt_boxes should be provided for training'
    #     gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)

    #     data_dict = self.data_augmentor.forward(
    #         data_dict={
    #             **data_dict,
    #             'gt_boxes_mask': gt_boxes_mask
    #         }
    #     )

    if data_dict.get('gt_boxes', None) is not None:
        selected = common_utils.keep_arrays_by_name(data_dict['gt_names'], self.class_names)
        data_dict['gt_boxes'] = data_dict['gt_boxes'][selected]
        data_dict['gt_names'] = data_dict['gt_names'][selected]
        gt_classes = np.array([self.class_names.index(n) + 1 for n in data_dict['gt_names']], dtype=np.int32)
        gt_boxes = np.concatenate((data_dict['gt_boxes'], gt_classes.reshape(-1, 1).astype(np.float32)), axis=1)
        data_dict['gt_boxes'] = gt_boxes

        if data_dict.get('gt_boxes2d', None) is not None:
            data_dict['gt_boxes2d'] = data_dict['gt_boxes2d'][selected]

    if data_dict.get('points', None) is not None:
        data_dict = self.point_feature_encoder.forward(data_dict)

    data_dict = self.data_processor.forward(
        data_dict=data_dict
    )

    # if self.training and len(data_dict['gt_boxes']) == 0:
    #     new_index = np.random.randint(self.__len__())
    #     return self.__getitem__(new_index)

    data_dict.pop('gt_names', None)

    return data_dict

YCAyca · 2022-03-07T11:22:11Z

And finally I realized the situation...

in datasey.py, the following part causes to lose calib info training dataset as I said. But I understand the reason now: if its in training mode, we apply "data augmentation" in self.data_augmentor.forward() function. And in this function calib info is deleted, only the ground truth boxes are updated according to the augmentation technique. So since I removed data augmentation from training, my results went kind a bad. To not lose data augmentation and still be able to obtain train time accuracy, I will try to load train data in test mode with some additional control flags

if self.training:
assert 'gt_boxes' in data_dict, 'gt_boxes should be provided for training'
gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)

        data_dict = self.data_augmentor.forward(
            data_dict={
                **data_dict,
                'gt_boxes_mask': gt_boxes_mask
            }
        )

YCAyca · 2022-03-07T13:58:35Z

Its done and I will do a pull request but the commits got a little bit complicated since the last 3 commit includes the changes for this part but I dont know exactly how to select the parts to pull request and the parts not to. Maybe you can take a look at the pull request and we can talk about @MartinHahner

MartinHahner · 2022-03-07T15:22:09Z

I don't know, you need to debug where 'calib' gets added (or not added) to the batch_dict.

OpenPCDet/pcdet/datasets/augmentor/database_sampler.py

Line 165 in a9c66fe

data_dict.pop('calib')

OpenPCDet/pcdet/datasets/augmentor/data_augmentor.py

Line 246 in a9c66fe

data_dict.pop('calib')

github-actions · 2022-04-07T02:01:06Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2022-04-21T02:11:39Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

MartinHahner mentioned this issue Mar 7, 2022

train acc loss validation acc calculation and tensorboard graphs are added during train #851

Closed

github-actions bot added the stale label Apr 7, 2022

github-actions bot closed this as completed Apr 21, 2022

ReneFiala mentioned this issue Jul 24, 2024

How to get validation loss and display it in Tensorboard #1626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation during training #840

Evaluation during training #840

YCAyca commented Mar 4, 2022 •

edited

Loading

MartinHahner commented Mar 4, 2022

YCAyca commented Mar 5, 2022

MartinHahner commented Mar 5, 2022

YCAyca commented Mar 5, 2022 •

edited

Loading

YCAyca commented Mar 5, 2022

MartinHahner commented Mar 6, 2022

LouisSF commented Mar 7, 2022 •

edited

Loading

YCAyca commented Mar 7, 2022 •

edited

Loading

YCAyca commented Mar 7, 2022

YCAyca commented Mar 7, 2022

MartinHahner commented Mar 7, 2022

github-actions bot commented Apr 7, 2022

github-actions bot commented Apr 21, 2022

Evaluation during training #840

Evaluation during training #840

Comments

YCAyca commented Mar 4, 2022 • edited Loading

MartinHahner commented Mar 4, 2022

YCAyca commented Mar 5, 2022

MartinHahner commented Mar 5, 2022

YCAyca commented Mar 5, 2022 • edited Loading

evaluate validation dataset - calculate validation loss + accuracy

evaluate train dataset - calculate train accuracy

YCAyca commented Mar 5, 2022

MartinHahner commented Mar 6, 2022

LouisSF commented Mar 7, 2022 • edited Loading

YCAyca commented Mar 7, 2022 • edited Loading

YCAyca commented Mar 7, 2022

YCAyca commented Mar 7, 2022

MartinHahner commented Mar 7, 2022

github-actions bot commented Apr 7, 2022

github-actions bot commented Apr 21, 2022

YCAyca commented Mar 4, 2022 •

edited

Loading

YCAyca commented Mar 5, 2022 •

edited

Loading

LouisSF commented Mar 7, 2022 •

edited

Loading

YCAyca commented Mar 7, 2022 •

edited

Loading