Joint dataset training question #6904

HeChengHui · 2022-03-08T13:34:45Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I would like to try out joint dataset training as seen here with COCO + VisDrone2019-det. However, I am not sure if I should start with pre-trained weights (v5m6, s6,n6) or start from scratch (if that is possible).

Additional

No response

glenn-jocher · 2022-03-08T13:37:55Z

@HeChengHui see https://community.ultralytics.com/t/how-to-combine-weights-to-detect-from-multiple-datasets/38/9

HeChengHui · 2022-03-08T14:16:03Z

@glenn-jocher Thank you for referring to that link.

I would like to ask if it would be fine to use the pre-trained m6,s6, and n6 as the starting model to train for COCO+VisDrone? From what I understand, the pre-trained models are trained from COCO. Would doing so cause any unwanted behaviors down the road?

Further, would using a pre-trained model to train on another dataset cause the final model to be bloated with extra parameters from its pre-trained dataset? If that is the case, is it possible to train from scratch?

glenn-jocher · 2022-03-08T21:20:41Z

@HeChengHui yes you can use YOLOv5 pretrained models to start training any dataset or combination of datasets.

HeChengHui · 2022-03-15T17:09:34Z

@glenn-jocher
Thanks for the forum! I managed to merge all the image of COCO and VisDrone, with a labels folder with only 1 class.
I tried training with the following code:
python train.py --device 0 --weights '' --cfg yolov5s_cocoVisdrone.yaml --data coco_visdrone.yaml --batch-size -1 --epochs 300 --evolve --hyp hyp.scratch-low_cocoVisdrone.yaml --data VisDrone.yaml --imgsz 1920 --cache --name yolov5s_cocoVisdrone
but ran into CUDA memory problem even with --batch-size -1

train: weights='', cfg=yolov5s_cocoVisdrone.yaml, data=VisDrone.yaml, hyp=hyp.scratch-low_cocoVisdrone.yaml, epochs=300, batch_size=-1, imgsz=1920, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=300, bucket=, cache=ram, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=yolov5s_cocoVisdrone, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5
YOLOv5  v6.1-36-gc09fb2a torch 1.10.2 CUDA:0 (NVIDIA GeForce RTX 3080 Laptop GPU, 16384MiB)
Overriding model.yaml nc=1 with nc=10
Overriding model.yaml anchors with anchors=3

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1     40455  models.yolo.Detect                      [10, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Model Summary: 270 layers, 7046599 parameters, 7046599 gradients, 15.9 GFLOPs

AutoBatch: Computing optimal batch size for --imgsz 1920
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3080 Laptop GPU) 16.00G total, 0.06G reserved, 0.05G allocated, 15.88G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7046599       143.3         1.755         46.88         31.25      (1, 3, 1920, 1920)                    list
     7046599       286.7         3.511         52.06          57.3      (2, 3, 1920, 1920)                    list
     7046599       573.4         7.348         93.72         109.4      (4, 3, 1920, 1920)                    list
     7046599        1147        14.577         195.8         208.4      (8, 3, 1920, 1920)                    list
CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 16.00 GiB total capacity; 13.79 GiB already allocated; 0 bytes free; 13.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Could this be some error in my settings or just a case of autobatch not working here?
Furthermore, why is it saying Overriding model.yaml nc=1 with nc=10? I had nc in --cfg and --data set to 1.

glenn-jocher · 2022-03-15T17:31:56Z

@HeChengHui your data.yaml nc=10 so it's using that rather than the conflicting nc from your model.yaml.

Based on your partially completed AutoBatch results it seems like your card can support maybe --batch 4 or --batch 8. Experiment to see what works.

HeChengHui · 2022-03-15T17:55:20Z

@glenn-jocher
ah my bad I accidentally added 2 --data

After setting up the environment again, autobatch seems to work when I tried --batch-size 16 and it implemented a size of 7.

HeChengHui · 2022-03-21T10:37:11Z

@glenn-jocher
I tried training using:
python train.py --device 0 --weights yolov5s.pt --data mot_visdrone.yaml --batch-size -1 --epochs 300 --hyp hyp.scratch-low_MOTvisdrone.yaml --imgsz 960 --cache disk --name yolov5s_motVisdrone --evolve

Even on 34 epoch, my metrics (mAP, recall, precision logged into wandb) are all 0. Could this be due to the hyperparameter evolution?

glenn-jocher · 2022-03-21T13:05:11Z

@HeChengHui I don't know what you mean by zero mAP on epoch 34 of 300, as --evolve does not compute mAP until the final epoch. Also note that --batch 1 is extremely small and not recommended.

HeChengHui · 2022-03-21T13:29:46Z

@glenn-jocher

I don't know what you mean by zero mAP on epoch 34 of 300

I refer to the metrics shown in wandb. Is it only evaluated after 300 epochs instead of every epoch?

Also note that --batch 1 is extremely small and not recommended.

Does --batch-size -1 not help to find the best batch size?

glenn-jocher · 2022-03-21T13:38:22Z

Does --batch-size -1 not help to find the best batch size?

Oh yes! Didn't notice the -1. -1 will implement AutoBatch to automatically find the best batch size. But yes during evolution mAP is only evaluated on the final epoch, so there's no way to know it's value until a generation is finished.

HeChengHui · 2022-03-21T14:04:37Z

@glenn-jocher
I see! No wonder it is not showing anything. If I want further training after 300 epochs, is it correct to use back python train.py --device 0 --weights yolov5s.pt --data mot_visdrone.yaml --batch-size -1 --epochs 300 --hyp hyp.scratch-low_MOTvisdrone.yaml --imgsz 960 --cache disk --name yolov5s_motVisdrone --evolve, but change the weights to the best resulting model?

glenn-jocher · 2022-03-21T15:02:56Z

@HeChengHui it sounds like you should just train normally rather than using --evolve. --evolve is intended to take several weeks with significant resources, and it does not return a model, it only returns evolve hyperparameters on your base scenario that you can then use to train a model.

If you just want to train a model don't use --evolve.

HeChengHui · 2022-03-21T15:05:16Z

@glenn-jocher
I see. But since I have already started the process, would it be much more beneficial to use the resulting 300 epochs hyperparameters to train?

glenn-jocher · 2022-03-21T15:06:56Z

@HeChengHui you're not understanding evolution. One training is one generation. Evolution relies on many (hundreds) of generations to evolve optimal hyperparameters. See hyperparameter evolution tutorial for details:

YOLOv5 Tutorials

Train Custom Data 🚀 RECOMMENDED
Tips for Best Training Results ☘️ RECOMMENDED
Weights & Biases Logging 🌟 NEW
Supervisely Ecosystem 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
TFLite, ONNX, CoreML, TensorRT Export 🚀
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
TensorRT Deployment

Good luck 🍀 and let us know if you have any other questions!

HeChengHui · 2022-03-24T09:31:24Z

@glenn-jocher

While looking through the different model configurations under models/hub I am interested in using p2.yaml since it also includes detecting xsmall objects which could be useful for my use case of aerial tracking.
To further decrease the network size to increase speed, I deleted the large detection head as follows:

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [128, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 2], 1, Concat, [1]],  # cat backbone P2
   [-1, 1, C3, [128, False]],  # 21 (P2/4-xsmall)

   [-1, 1, Conv, [128, 3, 2]],
   [[-1, 18], 1, Concat, [1]],  # cat head P3
   [-1, 3, C3, [256, False]],  # 24 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 27 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 30 (P5/32-large)

   [[21, 24, 27, 30], 1, Detect, [nc, anchors]],  # Detect(P2, P3, P4, P5)
  ]

I would like to clarify the purpose of # 17 (P3/8-small) block and if deleting will affect the performance of the model?

glenn-jocher · 2022-03-24T09:54:16Z

@HeChengHui sure you can delete larger output blocks if you don't need them. Results will vary based on your dataset and training settings like --img-size naturally.

Another option for small object detection would just be to train and detect at larger --img-size with the normal P5 models.

HeChengHui · 2022-03-24T10:26:55Z

@glenn-jocher

I see.
How about deleting

 [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

Would that affect the performance? The purpose is to reduce model size and increase speed.

glenn-jocher · 2022-03-24T10:36:46Z

@HeChengHui you can delete anything you want, but if you delete intermediate layers you need to correctly reconnect the remaining layers, i.e. the [[-1, 18], 1, Concat, [1]], # cat head P3 layer depends on layer 18 which will no longer be there if you delete layer 17 etc.

HeChengHui · 2022-03-24T14:10:51Z

@glenn-jocher
Thank you for the suggestion. I tried combining p2 and v5s configuration as follow:

# YOLOv5 v6.0 head with (P2, P3, P4) outputs
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [128, 1, 1]],  #14
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],  #15
   [[-1, 2], 1, Concat, [1]],  # cat backbone P2
   [-1, 1, C3, [128, False]],  # 17 (P2/4-xsmall)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 4], 1, Concat, [1]],  # cat head P3
   [-1, 3, C3, [512, False]],  # 20 (P3/8-small)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 23 (P4/16-medium)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P2, P3, P4)
  ]

Would this be a valid configuration?

glenn-jocher · 2022-03-24T14:55:00Z

@HeChengHui you can run any model yaml through yolo.py to verify it works and profile it etc.

python models/yolo.py --cfg yolov5s.yaml

HeChengHui · 2022-03-24T15:38:03Z

@glenn-jocher

ohh thank you.

                 from  n    params  module                                  arguments
  0                -1  1      7040  models.common.Conv                      [3, 64, 6, 2, 2]
  1                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  2                -1  3    156928  models.common.C3                        [128, 128, 3]
  3                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  4                -1  6   1118208  models.common.C3                        [256, 256, 6]
  5                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  6                -1  9   6433792  models.common.C3                        [512, 512, 9]
  7                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]
  8                -1  3   9971712  models.common.C3                        [1024, 1024, 3]
  9                -1  1   2624512  models.common.SPPF                      [1024, 1024, 5]
 10                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  3   2757632  models.common.C3                        [1024, 512, 3, False]
 14                -1  1     65792  models.common.Conv                      [512, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 2]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 19           [-1, 4]  1         0  models.common.Concat                    [1]
 20                -1  3    690688  models.common.C3                        [512, 256, 3, False]
 21                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
 22          [-1, 14]  1         0  models.common.Concat                    [1]
 23                -1  3   2561024  models.common.C3                        [640, 512, 3, False]
 24      [17, 20, 23]  1    229245  Detect                                  [80, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Traceback (most recent call last):
  File "models/yolo.py", line 309, in <module>
    model = Model(opt.cfg).to(device)
  File "models/yolo.py", line 112, in __init__
    m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))])  # forward
  File "models/yolo.py", line 126, in forward
    return self._forward_once(x, profile, visualize)  # single-scale inference, train
  File "models/yolo.py", line 149, in _forward_once
    x = m(x)  # run
  File "C:\Users\chenghui\anaconda3\envs\yolov5\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\YoloV5\yolov5\models\common.py", line 275, in forward
    return torch.cat(x, self.d)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 32 but got size 64 for tensor number 1 in the list.

Seems like something went wrong with the concat layer? Any advice on how to debug this?

glenn-jocher · 2022-03-24T16:24:04Z

@HeChengHui we don't provide support for model customizations, sorry. Perhaps a community member can assist.

HeChengHui · 2022-03-24T16:26:56Z

@glenn-jocher
Alright, thanks for the help!

HeChengHui · 2022-03-25T15:21:25Z

@glenn-jocher

I am training a model using python train.py --device 0 --weights '' --cfg yolov5s-p234.yaml --data mot_visdrone.yaml --batch-size -1 --epochs 300 --hyp hyp.scratch-low_MOTvisdrone.yaml --imgsz 960 --cache disk --name yolov5s_motVisdrone_p234.
The batch size is 24 after autobatch.

After training for 60 epochs, it suddenly failed with CUDA OOM. I looked around and I might need to lower the batch size. However, is there a way to lower the batch size while resuming training? Or must I restart from scratch?

glenn-jocher · 2022-03-25T15:24:58Z

@HeChengHui hi sorry to hear that! That's very strange. Is there any other GPU memory usage on the instance?

AutoBatch seeks to set a batch size for 90% CUDA memory utilization, but perhaps we should reduce the default value to 85%.

You can not modify any parameters on resume, but you can go into train.py and customize the code to force it to a different batch size, i.e.:

yolov5/train.py

Lines 70 to 73 in 7a2a118

    
           save_dir, epochs, batch_size, weights, single_cls, evolve, data, cfg, resume, noval, nosave, workers, freeze = \ 
        
               Path(opt.save_dir), opt.epochs, opt.batch_size, opt.weights, opt.single_cls, opt.evolve, opt.data, opt.cfg, \ 
        
               opt.resume, opt.noval, opt.nosave, opt.workers, opt.freeze

HeChengHui · 2022-03-25T15:31:52Z

@glenn-jocher
Yes it is quite weird indeed given how it survived 60epochs. I have tried --resume after restarting my laptop but the same error still occurs.
I have also checked with wandb and my GPU Memory Allocated (%) stayed at 94.52% throughout before it crashed.

You can not modify any parameters on resume, but you can go into train.py and customize the code to force it to a different batch size

Alright thank you!

HeChengHui · 2022-03-29T14:53:27Z

@glenn-jocher
Hello I have 2 questions

An error occurred when fusing layers for a custom fpn config

300 epochs completed in 42.485 hours.
Optimizer stripped from runs\train\yolov5s_motVisdrone_fpn234\weights\last.pt, 5.2MB
Optimizer stripped from runs\train\yolov5s_motVisdrone_fpn234\weights\best.pt, 5.2MB

Validating runs\train\yolov5s_motVisdrone_fpn234\weights\best.pt...
Fusing layers...
Traceback (most recent call last):
  File "train.py", line 643, in <module>
    main(opt)
  File "train.py", line 539, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 434, in train
    model=attempt_load(f, device).half(),
  File "D:\YoloV5\yolov5\models\experimental.py", line 98, in attempt_load
    model.append(ckpt.fuse().eval() if fuse else ckpt.eval())  # fused or un-fused model in eval mode
  File "D:\YoloV5\yolov5\models\yolo.py", line 225, in fuse
    m.conv = fuse_conv_and_bn(m.conv, m.bn)  # update conv
  File "D:\YoloV5\yolov5\utils\torch_utils.py", line 202, in fuse_conv_and_bn
    fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape))
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

wandb: Waiting for W&B process to finish, PID 15008... (failed 1). Press ctrl-c to abort syncing.
wandb:
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁▃▄▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█████████████████
wandb:   metrics/mAP_0.5:0.95 ▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb:      metrics/precision ▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█▇▇▇████▇▇█▇▇██▇███
wandb:         metrics/recall ▁▄▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇▇▇███████████
wandb:         train/box_loss █▆▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:         train/cls_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:         train/obj_loss █▆▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
wandb:           val/box_loss █▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/cls_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/obj_loss █▅▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                  x/lr0 ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:                  x/lr1 ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:                  x/lr2 ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:
wandb: Run summary:
wandb:             best/epoch 299
wandb:           best/mAP_0.5 0.53313
wandb:      best/mAP_0.5:0.95 0.2063
wandb:         best/precision 0.65371
wandb:            best/recall 0.47009
wandb:        metrics/mAP_0.5 0.53313
wandb:   metrics/mAP_0.5:0.95 0.2063
wandb:      metrics/precision 0.65371
wandb:         metrics/recall 0.47009
wandb:         train/box_loss 0.02225
wandb:         train/cls_loss 0.0
wandb:         train/obj_loss 0.03103
wandb:           val/box_loss 0.04784
wandb:           val/cls_loss 0.0
wandb:           val/obj_loss 0.01575
wandb:                  x/lr0 0.00017
wandb:                  x/lr1 0.00017
wandb:                  x/lr2 0.00017
wandb:
wandb: Synced 6 W&B file(s), 325 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced yolov5s_motVisdrone_fpn234: https://wandb.ai/d3cpt/train/runs/1mxmji3y
wandb: Find logs at: .\wandb\run-20220328_031510-1mxmji3y\logs\debug.log

It seems to exit okay but I am not sure if the error is going to cause any problem

I was running test on weights trained to epoch 118, 150 and 155 (out of 300) without realising that the optimiser is not stripped. I would like to check the effects of not stripping the optimiser besides a bigger model size (slower inference speed?)

glenn-jocher · 2022-03-29T14:58:51Z

@HeChengHui it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

HeChengHui · 2022-03-29T15:23:10Z

@glenn-jocher
I am sure that the environment is correct because I managed to successfully train 2 prior models for 300epochs. Any advice on how to remedy this error or do I have to restart training?

glenn-jocher · 2022-03-29T15:29:21Z

I am sure that the environment is correct

If you trained successfully on another environment, then the independent variable that has changed is your environment, not YOLOv5. Logically you should start examining your environment for issues, or use a working one:

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

HeChengHui · 2022-03-29T15:36:31Z

If you trained successfully on another environment, then the independent variable that has changed is your environment, not YOLOv5. Logically you should start examining your environment for issues, or use a working one:

Sorry, I meant that I have also managed to train 2 models with no errors in the same environment.
Does that error make the final model invalid?

glenn-jocher · 2022-03-29T15:39:07Z

@HeChengHui your environment is up to you. If you have a reproducible error specific to YOLOv5, then please submit a bug report with code to reproduce.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible to produce the problem
✅ Complete – Provide all parts someone else needs to reproduce the problem
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

✅ Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
✅ Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

HeChengHui · 2022-04-07T09:57:58Z

@glenn-jocher
I would like to check if changing the activation function requires --weighs '' --cfg yolov5s.yaml or I can just use --weights yolov5s.pt

glenn-jocher · 2022-04-07T11:32:11Z

@HeChengHui default activation function for YOLOv5 is SiLU:

yolov5/models/common.py

Line 44 in 0ca85ed

    
           self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

HeChengHui · 2022-04-07T11:34:17Z

@glenn-jocher

sorry, I was asking more specifically after changing the activation function, do I need to train the model from scratch using --weighs '' --cfg yolov5s.yaml or I can just use a pretrained one like --weights yolov5s.pt

glenn-jocher · 2022-04-07T11:51:46Z

@HeChengHui I don't understand your question. Nothing is changeable about a trained model. Any changes you make to modules it uses will result in errors or worse results.

HeChengHui · 2022-04-07T11:54:08Z

@glenn-jocher
I am not trying to change the AF after training. I meant to ask before training starts, do I need to train the model from scratch using the cfg file or can I use --weights yolov5s.pt to use the new AF.

glenn-jocher · 2022-04-07T12:00:08Z

@HeChengHui oh, you can do both depending whether you want to start from pretrained weights or not. See Train Custom Data tutorial for details:

YOLOv5 Tutorials

Train Custom Data 🚀 RECOMMENDED
Tips for Best Training Results ☘️ RECOMMENDED
Weights & Biases Logging 🌟 NEW
Roboflow for Datasets, Labeling, and Active Learning 🌟 NEW
Multi-GPU Training
PyTorch Hub ⭐ NEW
TFLite, ONNX, CoreML, TensorRT Export 🚀
Test-Time Augmentation (TTA)
Model Ensembling
Model Pruning/Sparsity
Hyperparameter Evolution
Transfer Learning with Frozen Layers ⭐ NEW
Architecture Summary ⭐ NEW

Good luck 🍀 and let us know if you have any other questions!

HeChengHui · 2022-04-07T12:01:08Z

@glenn-jocher
ok! thanks for the clarification.

HeChengHui · 2022-04-08T10:44:46Z

@glenn-jocher

How do I check the activation of a pretrained model?
In loss.py, I see that the loss function is weighted as follows:
self.balance = {3: [4.0, 1.0, 0.4]}.get(det.nl, [4.0, 1.0, 0.25, 0.06, 0.02]) # P3-P7
Does this mean that P2 is not considered when calculating loss? Because I would like to use P2 as a detection head for xsmall objects.

HeChengHui · 2022-04-16T09:20:25Z

@glenn-jocher

Hi, I would like to clarify the purpose of the test split during training. My understanding is that validation is done on the validation split. Does the test split contribute in any way?

glenn-jocher · 2022-04-16T13:08:11Z

test split is not used during training

github-actions · 2022-05-17T00:22:02Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Wiki – https://github.com/ultralytics/yolov5/wiki
Tutorials – https://docs.ultralytics.com/yolov5
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/hub
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

HeChengHui added the question Further information is requested label Mar 8, 2022

github-actions bot added the Stale Stale and schedule for closing soon label May 17, 2022

github-actions bot closed this as completed May 24, 2022

Joint dataset training question #6904

Joint dataset training question #6904

Comments

HeChengHui commented Mar 8, 2022

Search before asking

Question

Additional

glenn-jocher commented Mar 8, 2022

HeChengHui commented Mar 8, 2022 • edited Loading

glenn-jocher commented Mar 8, 2022

HeChengHui commented Mar 15, 2022 • edited Loading

glenn-jocher commented Mar 15, 2022 • edited Loading

HeChengHui commented Mar 15, 2022 • edited Loading

HeChengHui commented Mar 21, 2022

glenn-jocher commented Mar 21, 2022

HeChengHui commented Mar 21, 2022

glenn-jocher commented Mar 21, 2022

HeChengHui commented Mar 21, 2022

glenn-jocher commented Mar 21, 2022

HeChengHui commented Mar 21, 2022

glenn-jocher commented Mar 21, 2022 • edited by UltralyticsAssistant Loading

YOLOv5 Tutorials

HeChengHui commented Mar 24, 2022 • edited Loading

glenn-jocher commented Mar 24, 2022

HeChengHui commented Mar 24, 2022

glenn-jocher commented Mar 24, 2022

HeChengHui commented Mar 24, 2022

glenn-jocher commented Mar 24, 2022

HeChengHui commented Mar 24, 2022

glenn-jocher commented Mar 24, 2022

HeChengHui commented Mar 24, 2022 • edited Loading

HeChengHui commented Mar 25, 2022

glenn-jocher commented Mar 25, 2022

HeChengHui commented Mar 25, 2022 • edited Loading

HeChengHui commented Mar 29, 2022

glenn-jocher commented Mar 29, 2022 • edited by UltralyticsAssistant Loading

Requirements

Environments

Status

HeChengHui commented Mar 29, 2022

glenn-jocher commented Mar 29, 2022 • edited by UltralyticsAssistant Loading

Environments

Status

HeChengHui commented Mar 29, 2022 • edited Loading

glenn-jocher commented Mar 29, 2022 • edited Loading

How to create a Minimal, Reproducible Example

HeChengHui commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022

HeChengHui commented Apr 7, 2022 • edited Loading

glenn-jocher commented Apr 7, 2022

HeChengHui commented Apr 7, 2022

glenn-jocher commented Apr 7, 2022 • edited by UltralyticsAssistant Loading

YOLOv5 Tutorials

HeChengHui commented Apr 7, 2022

HeChengHui commented Apr 8, 2022

HeChengHui commented Apr 16, 2022

glenn-jocher commented Apr 16, 2022

github-actions bot commented May 17, 2022 • edited by glenn-jocher Loading

HeChengHui commented Mar 8, 2022 •

edited

Loading

HeChengHui commented Mar 15, 2022 •

edited

Loading

glenn-jocher commented Mar 15, 2022 •

edited

Loading

HeChengHui commented Mar 15, 2022 •

edited

Loading

glenn-jocher commented Mar 21, 2022 •

edited by UltralyticsAssistant

Loading

HeChengHui commented Mar 24, 2022 •

edited

Loading

HeChengHui commented Mar 24, 2022 •

edited

Loading

HeChengHui commented Mar 25, 2022 •

edited

Loading

glenn-jocher commented Mar 29, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Mar 29, 2022 •

edited by UltralyticsAssistant

Loading

HeChengHui commented Mar 29, 2022 •

edited

Loading

glenn-jocher commented Mar 29, 2022 •

edited

Loading

HeChengHui commented Apr 7, 2022 •

edited

Loading

glenn-jocher commented Apr 7, 2022 •

edited by UltralyticsAssistant

Loading

github-actions bot commented May 17, 2022 •

edited by glenn-jocher

Loading