-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: an illegal memory access was encountered #42
Comments
Hi @rassabin |
Hi @rassabin @xvjiarui my env:2020-08-02 17:45:33,618 - mmseg - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0]
CUDA available: True
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 10.1, V10.1.243
GPU 0: GeForce RTX 2060 SUPER
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.0
PyTorch compiling details: PyTorch built with:
- GCC 7.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 10.1
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
- CuDNN 7.6.3
- Magma 2.5.2
- Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
TorchVision: 0.6.0a0+82fd1c8
OpenCV: 4.3.0
MMCV: 1.0.4
MMSegmentation: 0.5.0+2b801de
MMCV Compiler: GCC 7.5
MMCV CUDA Compiler: 10.1 my configs2020-08-02 17:37:38,478 - mmseg - INFO - Distributed training: True
2020-08-02 17:37:38,682 - mmseg - INFO - Config:
norm_cfg = dict(type='SyncBN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='open-mmlab://resnet50_v1c',
backbone=dict(
type='ResNetV1c',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 2, 4),
strides=(1, 2, 1, 1),
norm_cfg=dict(type='SyncBN', requires_grad=True),
norm_eval=False,
style='pytorch',
contract_dilation=True),
decode_head=dict(
type='FCNHead',
in_channels=2048,
in_index=3,
channels=512,
num_convs=2,
concat_input=True,
dropout_ratio=0.1,
num_classes=17,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
auxiliary_head=dict(
type='FCNHead',
in_channels=1024,
in_index=2,
channels=256,
num_convs=1,
concat_input=False,
dropout_ratio=0.1,
num_classes=17,
norm_cfg=dict(type='SyncBN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
dataset_type = 'SkmtDataset'
data_root = 'data/VOCdevkit/Seg/skmt5'
img_norm_cfg = dict(
mean=[34.73, 34.81, 34.45], std=[13.96, 13.93, 14.05], to_rgb=True)
crop_size = (512, 512)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=4,
workers_per_gpu=4,
train=dict(
type='SkmtDataset',
data_root='data/VOCdevkit/Seg/skmt5',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
split='ImageSets/Segmentation/train.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75),
dict(type='RandomFlip', flip_ratio=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]),
val=dict(
type='SkmtDataset',
data_root='data/VOCdevkit/Seg/skmt5',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
split='ImageSets/Segmentation/val.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='SkmtDataset',
data_root='data/VOCdevkit/Seg/skmt5',
img_dir='JPEGImages',
ann_dir='SegmentationClass',
split='ImageSets/Segmentation/val.txt',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(2048, 512),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[34.73, 34.81, 34.45],
std=[13.96, 13.93, 14.05],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
log_config = dict(
interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
total_iters = 20000
checkpoint_config = dict(by_epoch=False, interval=2000)
evaluation = dict(interval=2000, metric='mIoU')
work_dir = './work_dirs/fcn_r50-d8_512x512_20k_voc12aug'
gpu_ids = range(0, 1) my erros:2020-08-02 17:45:34,250 - mmseg - INFO - Loaded 107 images
2020-08-02 17:45:35,864 - mmseg - INFO - Loaded 107 images
2020-08-02 17:45:35,864 - mmseg - INFO - Start running, host: liuxin@liuxin, work_dir: /media/Program/CV/Project/SKMT/mmsegmentation/work_dirs/fcn_r50-d8_512x512_20k_voc12aug
2020-08-02 17:45:35,864 - mmseg - INFO - workflow: [('train', 1)], max: 20000 iters
Traceback (most recent call last):
File "tools/train.py", line 159, in <module>
main()
File "tools/train.py", line 155, in main
meta=meta)
File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/apis/train.py", line 105, in train_segmentor
runner.run(data_loaders, cfg.workflow, cfg.total_iters)
File "/media/Program/CV/Project/mmcv/mmcv/runner/iter_based_runner.py", line 119, in run
iter_runner(iter_loaders[i], **kwargs)
File "/media/Program/CV/Project/mmcv/mmcv/runner/iter_based_runner.py", line 55, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/media/Program/CV/Project/mmcv/mmcv/parallel/distributed.py", line 36, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/models/segmentors/base.py", line 153, in train_step
loss, log_vars = self._parse_losses(losses)
File "/media/Program/CV/Project/SKMT/mmsegmentation/mmseg/models/segmentors/base.py", line 204, in _parse_losses
log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /opt/conda/conda-bld/pytorch_1587428398394/work/c10/cuda/CUDACachingAllocator.cpp:771)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7f7201a47b5e in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x6d0 (0x7f7201802e30 in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7201a356ed in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x51ee0a (0x7f722ea62e0a in /home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x1a311e (0x557099a8f11e in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #5: <unknown function> + 0xfdfc8 (0x5570999e9fc8 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #6: <unknown function> + 0x10f147 (0x5570999fb147 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #7: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #8: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #9: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #10: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #11: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #12: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #13: <unknown function> + 0x10f15d (0x5570999fb15d in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #14: PyDict_SetItem + 0x502 (0x557099a50172 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #15: PyDict_SetItemString + 0x4f (0x557099a50c4f in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #16: PyImport_Cleanup + 0xa0 (0x557099a95760 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #17: Py_FinalizeEx + 0x67 (0x557099b10817 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #18: <unknown function> + 0x2373d3 (0x557099b233d3 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #19: _Py_UnixMain + 0x3c (0x557099b236fc in /home/liuxin/anaconda3/envs/mmlab/bin/python)
frame #20: __libc_start_main + 0xe7 (0x7f7250c9bb97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: <unknown function> + 0x1dc3c0 (0x557099ac83c0 in /home/liuxin/anaconda3/envs/mmlab/bin/python)
Traceback (most recent call last):
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/home/liuxin/anaconda3/envs/mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/liuxin/anaconda3/envs/mmlab/bin/python', '-u', 'tools/train.py', '--local_rank=0', 'configs/fcn/fcn_r50-d8_512x512_20k_voc12aug.py', '--launcher', 'pytorch']' died with <Signals.SIGABRT: 6>. I use my custom dataset(the class_num is 17(including background);original img is RGB img ; gt_seg_map is P mode of PIL.) |
I solved this problem by modifying the label picture to one channel map which pixels consisted by class_id. |
Hi @UESTC-Liuxin |
Hi, did you add it to the doc. ? I encountered the same problem. At the beginning, I ran with 1 sample per gpu and it worked fine.
After I changed the setting to 2 samples per gpu, I raises the error.
The error:
Any ideas of that? Thank you very much~ |
Hi @fangruizhu |
Thanks for the quick reply! @xvjiarui |
Hi @fangruizhu
|
Thanks! @xvjiarui When I trained with a single gpu with multiple samples on it, it got the same error. The error occurs at
register_module . Would that cause problems?
|
Hi @fangruizhu |
Thanks! @xvjiarui I fixed the error by re-installing the mmcv lite version, disabling CUDA complier in mmcv. I think maybe the error was caused by CUDA environment. |
because we use labelme,and some Data Annotator write the wrong new label,account for 11 classes ,rather than 10 classes....... |
check your labels and the dataset configs if you encountered this error. |
In my case, I used my custom dataset and used config
What was a trap is that we have to change num_classes in
in configs/deeplabv3plus/deeplabv3plus_r50-d8_512x512_160k_ade20k.py, as well as num_classes in configs/base/models/deeplabv3plus_r50-d8.py. The config style is slightly different depending on a dataset type you use. Moreover, the same information (num_classes, in my case) appears here and there in a set of relevant config files. So we really have to be careful whether we have completely modified configs according to our custom datasets... |
add test for ldm uncond
Hello, I would like to reopen this issue because I have exactly the same problem described above. I want to run the segmentation training code for cityscapes as follow with 1 GPU. However, I obtain this error :
The error occurs at the same location (like @xvjiarui) I attempted to follow the others issues comments.
|
Co-authored-by: JoannaLXY <[email protected]>
form the checking you labels:
as for custom dataset, and not ignoring the background, for padding process, adding another index for the padding elements in case conflicting with the ids would be calculated in you loss function. for example: |
Error was encountered during training process with condfigs:
The script take an approximately 4-5GB of GPU from 11GB available and return this error:
#ERROR
But if i reduce the size the image size twice with the same images per GPU (2) ,script takes approxiamtely 2GB from GPU and everything works fine.
Also,i want to add that using another PyTorch script with my own Dataloader i'm able to fill in GPU on full (11GB) by training process with the same Torch version and the same hardware.
The text was updated successfully, but these errors were encountered: