Can't resume DiT training #1269

carzacc · 2023-08-26T16:43:21Z

I am using DiT, and trying to finetune for layout analysis (object detection) on a dataset other than PubLayNet (end goal is to fine tune it to go beyond its current classification capabilities).

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

If I try to resume training, whether I use the config in object_detection or the generated one in the output directory, I get warnings like the following:

WARNING [08/26 17:26:55 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
backbone.fpn_lateral2.{bias, weight}
backbone.fpn_lateral3.{bias, weight}
backbone.fpn_lateral4.{bias, weight}
backbone.fpn_lateral5.{bias, weight}
backbone.fpn_output2.{bias, weight}
backbone.fpn_output3.{bias, weight}
backbone.fpn_output4.{bias, weight}
backbone.fpn_output5.{bias, weight}
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_head.fc1.{bias, weight}
roi_heads.box_head.fc2.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
roi_heads.mask_head.deconv.{bias, weight}
roi_heads.mask_head.mask_fcn1.{bias, weight}
roi_heads.mask_head.mask_fcn2.{bias, weight}
roi_heads.mask_head.mask_fcn3.{bias, weight}
roi_heads.mask_head.mask_fcn4.{bias, weight}
roi_heads.mask_head.predictor.{bias, weight}
WARNING [08/26 17:26:55 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
  backbone.bottom_up.backbone.backbone.fpn_lateral2.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output2.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral3.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output3.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral4.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output4.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_lateral5.{bias, weight}
  backbone.bottom_up.backbone.backbone.fpn_output5.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.conv.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.objectness_logits.{bias, weight}
  backbone.bottom_up.backbone.proposal_generator.rpn_head.anchor_deltas.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_head.fc1.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_head.fc2.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_predictor.cls_score.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.box_predictor.bbox_pred.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn1.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn2.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn3.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.mask_fcn4.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.deconv.{bias, weight}
  backbone.bottom_up.backbone.roi_heads.mask_head.predictor.{bias, weight}

and, when training starts, the loss starts very high (random, between 2 and 5) instead of being close to 0.5 which is where I had left it.

I need to be able to resume because of the policies of the university cluster I am using which don't allow me to train for long sessions.

Platform: Ubuntu 20.04.6, CUDA 11.4
Python version: 3.9.17
PyTorch version (GPU?): 1.9.1+cu111

The text was updated successfully, but these errors were encountered:

carzacc · 2023-08-26T16:45:02Z

By the way, just FYI, I have opened PR #1242 which corrects one of the training examples in the README.

carzacc · 2023-09-01T09:00:02Z

https://github.com/microsoft/unilm/blob/b60c741f746877293bb85eed6806736fc8fa0ffd/dit/object_detection/ditod/mycheckpointer.py#L199C1-L205C10

by not appending that prefix I managed to fix the issue and correctly resume training, why is that there?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't resume DiT training #1269

Can't resume DiT training #1269

carzacc commented Aug 26, 2023

carzacc commented Aug 26, 2023

carzacc commented Sep 1, 2023

Can't resume DiT training #1269

Can't resume DiT training #1269

Comments

carzacc commented Aug 26, 2023

carzacc commented Aug 26, 2023

carzacc commented Sep 1, 2023