Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help me, binary segmentation acc error! #2628

Closed
xiaoaxiaoxiaocao opened this issue Feb 21, 2023 · 13 comments
Closed

Help me, binary segmentation acc error! #2628

xiaoaxiaoxiaocao opened this issue Feb 21, 2023 · 13 comments
Assignees

Comments

@xiaoaxiaoxiaocao
Copy link

xiaoaxiaoxiaocao commented Feb 21, 2023

The custom data set (modeled after the drive data set), the categories are foreground and background. Tannotation imgs value divided by 128 is equivalent to '1 if value >= 128 else 0', the training results are as follows, I don't know how to improve,help me, thanks!

2023-02-21 10:46:45,168 - mmseg - INFO - Iter [1700/20000]      lr: 9.240e-03, eta: 1:06:42, time: 0.211, data_time: 0.004, memory: 11397, decode.loss_ce: 0.0502, decode.acc_seg: 99.0194, aux.loss_ce: 0.0214, aux.acc_seg: 99.0194, loss: 0.0716
2023-02-21 10:46:55,700 - mmseg - INFO - Iter [1750/20000]      lr: 9.217e-03, eta: 1:06:27, time: 0.211, data_time: 0.004, memory: 11397, decode.loss_ce: 0.0315, decode.acc_seg: 99.5005, aux.loss_ce: 0.0128, aux.acc_seg: 99.5005, loss: 0.0444
2023-02-21 10:47:06,254 - mmseg - INFO - Iter [1800/20000]      lr: 9.195e-03, eta: 1:06:12, time: 0.211, data_time: 0.004, memory: 11397, decode.loss_ce: 0.0369, decode.acc_seg: 99.3797, aux.loss_ce: 0.0145, aux.acc_seg: 99.3797, loss: 0.0514
2023-02-21 10:47:16,802 - mmseg - INFO - Iter [1850/20000]      lr: 9.172e-03, eta: 1:05:58, time: 0.211, data_time: 0.004, memory: 11397, decode.loss_ce: 0.0535, decode.acc_seg: 99.0453, aux.loss_ce: 0.0209, aux.acc_seg: 99.0453, loss: 0.0744
2023-02-21 10:47:27,349 - mmseg - INFO - Iter [1900/20000]      lr: 9.150e-03, eta: 1:05:43, time: 0.211, data_time: 0.004, memory: 11397, decode.loss_ce: 0.0499, decode.acc_seg: 99.1083, aux.loss_ce: 0.0198, aux.acc_seg: 99.1083, loss: 0.0697
2023-02-21 10:47:37,877 - mmseg - INFO - Iter [1950/20000]      lr: 9.127e-03, eta: 1:05:29, time: 0.211, data_time: 0.004, memory: 11397, decode.loss_ce: 0.0223, decode.acc_seg: 99.6419, aux.loss_ce: 0.0096, aux.acc_seg: 99.6419, loss: 0.0319


+--------------+-------+-------+
|    Class     |  IoU  |  Acc  |
+--------------+-------+-------+
|  background  | 99.57 | 100.0 |
| Manipulation |  0.0  |  0.0  |
+--------------+-------+-------+
2023-02-21 10:52:53,215 - mmseg - INFO - Summary:
2023-02-21 10:52:53,215 - mmseg - INFO - 
+-------+-------+------+
|  aAcc |  mIoU | mAcc |
+-------+-------+------+
| 99.57 | 49.78 | 50.0 |
+-------+-------+------+

My config:

_base_ = [
    '../../../configs/_base_/models/fastfcn_r50-d32_jpu_psp.py', '../../../configs/_base_/datasets/manipulation.py',
    '../../../configs/_base_/default_runtime.py', '../../../configs/_base_/schedules/schedule_20k.py'
]

model = dict(
    decode_head=dict(num_classes=2,
                    out_channels=2,
                    loss_decode=dict(
                        type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)
                    ),
    auxiliary_head=dict(num_classes=2,
                        out_channels=2,
                        loss_decode=dict(
                            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)
                        ))

# dataset settings
dataset_type = 'ManipulationDataset' #change
data_root = '/home/featurize/data/manipulation'  

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (320, 320)
@MeowZheng
Copy link
Collaborator

would you like to provide the full config about dataset setting?

@xiaoaxiaoxiaocao
Copy link
Author

xiaoaxiaoxiaocao commented Feb 23, 2023

would you like to provide the full config about dataset setting?

# dataset settings
dataset_type = 'ManipulationDataset' #change
data_root = '/home/featurize/data/manipulation'   

img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations'),
    dict(type='Resize', img_scale=(1280, 640), ratio_range=(0.5, 2.0)),
    dict(type='RandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PhotoMetricDistortion'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size=crop_size, pad_val=0, seg_pad_val=255),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_semantic_seg']),
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(2560, 640),
        # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(type='Normalize', **img_norm_cfg),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img']),
        ])
]
data = dict(
    samples_per_gpu=6,
    workers_per_gpu=4,
    train=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='images/training',
        ann_dir='annotations/training',
        pipeline=train_pipeline),
    val=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='images/validation',
        ann_dir='annotations/validation',
        pipeline=test_pipeline),
    test=dict(
        type=dataset_type,
        data_root=data_root,
        img_dir='images/validation',
        ann_dir='annotations/validation',
        pipeline=test_pipeline))

optimizer = dict(type='SGD', lr=0.005, momentum=0.9, weight_decay=0.0005)

@xiexinch
Copy link
Collaborator

Hi @xiaoaxiaoxiaocao,
It seems that your config has no problem, could you provide the full config and some example images?

@xiaoaxiaoxiaocao
Copy link
Author

@xiexinch, the full config and some example images :
链接: https://pan.baidu.com/s/11MtqqT5LvgnoQhFhyrshEg 提取码: 8cqv 复制这段内容后打开百度网盘手机App,操作更方便哦
thanks for your help!

@csatsurnh
Copy link
Collaborator

Did you use the annotations in data_process for training? It seems that all the annotation images are pure black.

@xiaoaxiaoxiaocao
Copy link
Author

Yes, I used the data of data_process for training. I refer to the drive dataset, and the annotation imgs(data_ori) value divided by 128 is equivalent to '1 if value >= 128 else 0'

@csatsurnh
Copy link
Collaborator

what do you mean by "divided by 128 is equivalent to '1 if value >= 128 else 0'"? Do you mind providing the complete preprocessing code used

@xiaoaxiaoxiaocao
Copy link
Author

preprocessing code:

for i, file in enumerate(files):
img_path = os.path.join(root, file)
img = cv2.imread(img_path)
img_train_path = os.path.join(train_path, file)
cv2.imwrite(img_train_path, img[:, :, 0] // 128)

@xiaoaxiaoxiaocao
Copy link
Author

@xiexinch @csatsurnh If I do not do the above data preprocessing, the following error will be reported:

2023-03-05 10:38:24,533 - mmseg - INFO - workflow: [('train', 1)], max: 80000 iters
2023-03-05 10:38:24,533 - mmseg - INFO - Checkpoints will be saved to /home/xiaojie/demo/mmsegmentation/work_dirs/deeplabv3_r50-d8_512x512_80k_mani_test2 by HardDiskBackend.
Traceback (most recent call last):
File "tools/train.py", line 242, in
main()
File "tools/train.py", line 238, in main
meta=meta)
File "/home/xiaojie/demo/mmsegmentation/mmseg/apis/train.py", line 194, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/runner/iter_based_runner.py", line 70, in train
self.call_hook('after_train_iter')
File "/home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/mmcv/runner/hooks/optimizer.py", line 65, in after_train_iter
runner.outputs['loss'].backward()
File "/home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/autograd/init.py", line 175, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 256, 40, 40], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(256, 2, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x52ef590
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 256, 40, 40,
strideA = 409600, 1600, 40, 1,
output: TensorDescriptor 0x52f9160
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 2, 40, 40,
strideA = 3200, 1600, 40, 1,
weight: FilterDescriptor 0x7f2df402d8a0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 2, 256, 1, 1,
Pointer addresses:
input: 0x7f2dffc30000
output: 0x7f2ee0fd5800
weight: 0x7f2ee0f8b200

terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:1230 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2fa724c7d2 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x239de (0x7f2fdfe319de in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x22d (0x7f2fdfe3357d in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x301898 (0x7f305c665898 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f2fa7235005 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: torch::autograd::SavedVariable::reset_data() + 0xa1 (0x7f2fe301eee1 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: + 0x2838bab (0x7f2fe292fbab in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: + 0x2eef0a2 (0x7f2fe2fe60a2 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f2fe2fe614f in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: + 0x2ed0e67 (0x7f2fe2fc7e67 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: c10::TensorImpl::release_resources() + 0x20 (0x7f2fa7234eb0 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #11: + 0x1edf69 (0x7f305c551f69 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x4e5818 (0x7f305c849818 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: THPVariable_subclass_dealloc(_object*) + 0x299 (0x7f305c849b19 in /home/xiaojie/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #14: python() [0x4ac8e7]
frame #15: python() [0x4acafd]
frame #16: python() [0x4d5894]
frame #17: python() [0x4bbc68]
frame #18: python() [0x4d05bb]
frame #19: python() [0x4d05d1]
frame #20: python() [0x4d05d1]
frame #21: python() [0x4a1947]

frame #25: python() [0x5449b9]
frame #27: __libc_start_main + 0xf3 (0x7f3074609083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #28: python() [0x54472e]

已放弃 (核心已转储)

@xiexinch
Copy link
Collaborator

preprocessing code:

for i, file in enumerate(files): img_path = os.path.join(root, file) img = cv2.imread(img_path) img_train_path = os.path.join(train_path, file) cv2.imwrite(img_train_path, img[:, :, 0] // 128)

Hi @xiaoaxiaoxiaocao,
It is possible that all pixels are labeled as background after this processing。

@xiaoaxiaoxiaocao
Copy link
Author

@xiexinch

Why? I refer to the preprocessing of the drive dataset, and its preprocessing was done in this way

@xiexinch
Copy link
Collaborator

@xiexinch

Why? I refer to the preprocessing of the drive dataset, and its preprocessing was done in this way

We don't know whether your data is the same as drive, so there is no guarantee that drive's processing will work on your dataset as well.

@xiexinch
Copy link
Collaborator

Closing the issue, as there is no activity for a while.
We hope your issue has been resolved.
If not, please feel free to open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants