Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation during training #840

Closed
YCAyca opened this issue Mar 4, 2022 · 13 comments
Closed

Evaluation during training #840

YCAyca opened this issue Mar 4, 2022 · 13 comments
Labels

Comments

@YCAyca
Copy link

YCAyca commented Mar 4, 2022

Hello, I was wondering if there is a way to evaluate the model "during training". Not at the end of the training with a second command like test.py but to obtain validation accuracy and loss to compare the ones with training loss and accuracy (I didnt see any training accuracy information in mi tensorboard graphs though) Do you have any idea? It would be very useful to make a real research and finetuning to have train loss + accuracy - validation loss + accuracy informations. For example here is my train log in tensorboard with only loss information

Screenshot from 2022-03-04 19-46-28

@MartinHahner
Copy link
Contributor

#414

@YCAyca
Copy link
Author

YCAyca commented Mar 5, 2022

thank you for the answer! I got the reason but if I wouldnt care about the time of training how would it be possible to add this mechanism do you have any idea? The code seemed to me a little bit complicated for this

@MartinHahner
Copy link
Contributor

In the train loop after calling train_one_epoch (e.g. here), you could add code from here and there.

Then these lines can be omitted since you already evaluated during training.
If you can make it work, it would be great if you could create a pull request.
As I said before, I think this should be the way to go.

Something that came to my mind while I was formulating this answer is that maybe the memory footprint gets too large when you construct the test_set, test_loader, sampler in the train loop (because then you effectively have two datasets loaded at the same time - training and evaluation), maybe that's the reason why it was done one (training) after the other (evaluation) and not intertwined.

test_set, test_loader, sampler = build_dataloader(

@YCAyca
Copy link
Author

YCAyca commented Mar 5, 2022

Im done with training time evaluation thank you for your instructions. Another very important thing for the research part and I try to implement is just to add "training accuracy" information. So basically I just do the same steps for training data but I dont know why, I get an error as below. Doesnt train and test loader loads the data in a similar way?

evaluate validation dataset - calculate validation loss + accuracy

tmp = str(ckpt_name) +'.pth'
eval_output_dir = ckpt_save_dir / 'validation_accuracy' / ('checkpoint_epoch_%d' % trained_epoch) #result.pkl
            
tb_log = SummaryWriter(log_dir=str(ckpt_save_dir / 'validation_accuracy' / 'tensorboard')) 
eval_single_ckpt(model, test_loader, tmp, eval_output_dir, logger, trained_epoch, dist_test=False, cfg=cfg,tb_log=tb_log, val_loss=True)

evaluate train dataset - calculate train accuracy

eval_output_dir = ckpt_save_dir / 'train_accuracy' / ('checkpoint_epoch_%d' % trained_epoch)
tb_log = SummaryWriter(log_dir=str(ckpt_save_dir / 'train_accuracy' / 'tensorboard')) 
          
eval_single_ckpt(model, train_loader, tmp, eval_output_dir, logger, trained_epoch, dist_test=False, cfg=cfg, tb_log=tb_log)

Screenshot from 2022-03-05 21-57-52

@YCAyca
Copy link
Author

YCAyca commented Mar 5, 2022

Sure I will send a pull request after finishing train accuracy problem !!

@MartinHahner
Copy link
Contributor

I don't know, you need to debug where 'calib' gets added (or not added) to the batch_dict.

@LouisSF
Copy link

LouisSF commented Mar 7, 2022

I was working on this for my own research problems, but faced the same issue regarding calib, I'm not quite sure what causes this so for now i'm just using the following workaround:

for idx_anno, anno in enumerate(annos):
   try:
          calib = annos[idx_anno]['calib']
   except KeyError:
          print("No calib found!")
          calib = self.prev_calib
   self.prev_calib = calib
   try:
          image_shape = annos[idx_anno]['image_shape']
   except KeyError:
          print("No image_shape found!")
          image_shape = self.prev_image_shape
   self.prev_image_shape = image_shape

I also had to do this for image_shape as I was facing the same issue. This will probably cause some slight accuracy errors but there are only a few frames of KITTI for which we get a KeyError

@YCAyca
Copy link
Author

YCAyca commented Mar 7, 2022

Actually I updated the code and now I obtain train loss, train acc and validation acc during training. But in somehow it effected my results so now when I make an inference I obtain worse results than before (not too bad, in some how the train acc info doesnt seem ok but the inference results are still usable with a little decrease of accuracy).

As a result we obtain these 2 tensorboard logs one for train loss and acc one for validation acc

Screenshot from 2022-03-07 11-49-08
Screenshot from 2022-03-07 11-50-43

The main problem is when you use dataloader for training set (where you give training=True as parameter) you dont obtain calib info in your dataset info, otherwise you obtain.

def build_dataloader(dataset_cfg, class_names, batch_size, dist, root_path=None, workers=4,
logger=None, training=True, merge_all_iters_to_one_epoch=False, total_epochs=0):

dataset = __all__[dataset_cfg.DATASET](
    dataset_cfg=dataset_cfg,
    class_names=class_names,
    root_path=root_path,
    training=training,
    logger=logger,
)

print("DATASET",training, dataset[0]) # Test dataset includes "calib" info but train dataset doesnt

if merge_all_iters_to_one_epoch:
    assert hasattr(dataset, 'merge_all_iters_to_one_epoch')
    dataset.merge_all_iters_to_one_epoch(merge=True, epochs=total_epochs)

if dist:
    if training:
        sampler = torch.utils.data.distributed.DistributedSampler(dataset)
    else:
        rank, world_size = common_utils.get_dist_info()
        sampler = DistributedSampler(dataset, world_size, rank, shuffle=False)
else:
    sampler = None
dataloader = DataLoader(
    dataset, batch_size=batch_size, pin_memory=True, num_workers=workers,
    shuffle=(sampler is None) and training, collate_fn=dataset.collate_batch,
    drop_last=False, sampler=sampler, timeout=0
)

return dataset, dataloader, sampler

I tracked the issue and even at first initialize in kitti_dataset.py class KittiDataset(DatasetTemplate): both train and test datasets comes with the equal informations, the calibration info is deleted in dataset.py prepare_dataset() function. I commented out the lines where training mode make the calib deleted and it worked.

def prepare_data(self, data_dict):
"""
Args:
data_dict:
points: optional, (N, 3 + C_in)
gt_boxes: optional, (N, 7 + C) [x, y, z, dx, dy, dz, heading, ...]
gt_names: optional, (N), string
...

    Returns:
        data_dict:
            frame_id: string
            points: (N, 3 + C_in)
            gt_boxes: optional, (N, 7 + C) [x, y, z, dx, dy, dz, heading, ...]
            gt_names: optional, (N), string
            use_lead_xyz: bool
            voxels: optional (num_voxels, max_points_per_voxel, 3 + C)
            voxel_coords: optional (num_voxels, 3)
            voxel_num_points: optional (num_voxels)
            ...
    """
    # if self.training:
    #     assert 'gt_boxes' in data_dict, 'gt_boxes should be provided for training'
    #     gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)

    #     data_dict = self.data_augmentor.forward(
    #         data_dict={
    #             **data_dict,
    #             'gt_boxes_mask': gt_boxes_mask
    #         }
    #     )

    if data_dict.get('gt_boxes', None) is not None:
        selected = common_utils.keep_arrays_by_name(data_dict['gt_names'], self.class_names)
        data_dict['gt_boxes'] = data_dict['gt_boxes'][selected]
        data_dict['gt_names'] = data_dict['gt_names'][selected]
        gt_classes = np.array([self.class_names.index(n) + 1 for n in data_dict['gt_names']], dtype=np.int32)
        gt_boxes = np.concatenate((data_dict['gt_boxes'], gt_classes.reshape(-1, 1).astype(np.float32)), axis=1)
        data_dict['gt_boxes'] = gt_boxes

        if data_dict.get('gt_boxes2d', None) is not None:
            data_dict['gt_boxes2d'] = data_dict['gt_boxes2d'][selected]

    if data_dict.get('points', None) is not None:
        data_dict = self.point_feature_encoder.forward(data_dict)

    data_dict = self.data_processor.forward(
        data_dict=data_dict
    )

    # if self.training and len(data_dict['gt_boxes']) == 0:
    #     new_index = np.random.randint(self.__len__())
    #     return self.__getitem__(new_index)

    data_dict.pop('gt_names', None)

    return data_dict

@YCAyca
Copy link
Author

YCAyca commented Mar 7, 2022

And finally I realized the situation...

in datasey.py, the following part causes to lose calib info training dataset as I said. But I understand the reason now: if its in training mode, we apply "data augmentation" in self.data_augmentor.forward() function. And in this function calib info is deleted, only the ground truth boxes are updated according to the augmentation technique. So since I removed data augmentation from training, my results went kind a bad. To not lose data augmentation and still be able to obtain train time accuracy, I will try to load train data in test mode with some additional control flags

if self.training:
assert 'gt_boxes' in data_dict, 'gt_boxes should be provided for training'
gt_boxes_mask = np.array([n in self.class_names for n in data_dict['gt_names']], dtype=np.bool_)

        data_dict = self.data_augmentor.forward(
            data_dict={
                **data_dict,
                'gt_boxes_mask': gt_boxes_mask
            }
        )

@YCAyca
Copy link
Author

YCAyca commented Mar 7, 2022

Its done and I will do a pull request but the commits got a little bit complicated since the last 3 commit includes the changes for this part but I dont know exactly how to select the parts to pull request and the parts not to. Maybe you can take a look at the pull request and we can talk about @MartinHahner

@MartinHahner
Copy link
Contributor

I don't know, you need to debug where 'calib' gets added (or not added) to the batch_dict.


@github-actions
Copy link

github-actions bot commented Apr 7, 2022

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Apr 7, 2022
@github-actions
Copy link

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants