CUDA error: an illegal memory access was encountered #270

bkbjsd · 2020-11-21T04:10:50Z

xvjiarui · 2020-11-23T05:21:50Z

Hi @bkbjsd
Which config are you using?
You may check if you set num_classess correctly.

bkbjsd · 2020-11-24T07:10:39Z

I check it and retry several times。—— seems not the num_classes problem。
i paste whole environment、config and error message：

Run:

(py38_source) # mmsegmentation_ python tools/train.py --no-validate configs/luke/pspnet_ABCDataset.py

2020-11-24 14:53:47,964 - mmseg - INFO - Environment info:

sys.platform: linux
Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0]
CUDA available: True
GPU 0: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda-11.1
NVCC: Build cuda_11.1.TC455_06.29069683_0
GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
PyTorch: 1.8.0a0+8819bad
PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.9.0.dev20201118
OpenCV: 4.4.0
MMCV: 1.2.6
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.1
MMSegmentation: 0.8.0+

2020-11-24 14:53:47,964 - mmseg - INFO - Distributed training: False
2020-11-24 14:53:48,710 - mmseg - INFO - Config:
ABC_data_root = '/home//code/remote/mmsegmentation_/data/ABC_58G'
ABC_img_dir = 'images'
ABC_ann_dir = 'labels'
ABC_split_dir = 'splits'
ABC_work_dir = '/home//code/remote/mmsegmentation_/work_dirs/ABC_exp'
ABC_cfg_from_file = '/home//code/remote/mmsegmentation_/configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py'
ABC_checkpoint_load_from = '/home//code/remote/mmsegmentation_/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'
ABC_classes = ('void', 's_w_d', 's_y_d', 'ds_w_dn', 'ds_y_dn', 'sb_w_do',
'sb_y_do', 'b_w_g', 'b_y_g', 'db_w_g', 'db_y_g', 'db_w_s',
's_w_s', 'ds_w_s', 's_w_c', 's_y_c', 's_w_p', 's_n_p',
'c_wy_z', 'a_w_u', 'a_w_t', 'a_w_tl', 'a_w_tr', 'a_w_tlr',
'a_w_l', 'a_w_r', 'a_w_lr', 'a_n_lu', 'a_w_tu', 'a_w_m',
'a_y_t', 'b_n_sr', 'd_wy_za', 'r_wy_np', 'vom_wy_n',
'om_n_n', 'noise', 'ignored')
ABC_palette = [[0, 0, 0], [70, 130, 180], [220, 20, 60], [128, 0, 128],
[255, 0, 0], [0, 0, 60], [0, 60, 100], [0, 0, 142],
[119, 11, 32], [244, 35, 232], [0, 0, 160], [153, 153, 153],
[220, 220, 0], [250, 170, 30], [102, 102, 156], [128, 0, 0],
[128, 64, 128],
[238, 232, 170], [190, 153, 153], [0, 0, 230], [128, 128, 0],
[128, 78, 160], [150, 100, 100], [255, 165, 0],
[180, 165, 180], [107, 142, 35], [201, 255, 229],
[0, 191, 255], [51, 255, 51], [250, 128, 114], [127, 255, 0],
[255, 128, 0], [0, 255, 255], [178, 132,
190], [128, 128, 64],
[102, 0, 204], [0, 153, 153], [255, 255, 255]]
ABC_num_classes = 38
ABC_samples_per_gpu = 8
ABC_workers_per_gpu = 1
ABC_crop_size = (758, 768)
dataset_type = 'ABCDataset'
data_root = '/home//code/remote/mmsegmentation_/data/ABC_58G'
work_dir = '/home//code/remote/mmsegmentation_/work_dirs/ABC_exp_202011241453'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
crop_size = (758, 768)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(type='Resize', img_scale=(3384, 2710), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(758, 768), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(758, 768), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3384, 2710),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=8,
workers_per_gpu=1,
train=dict(
type='ABCDataset',
data_root=
'/home//code/remote/mmsegmentation_/data/ABC_58G',
img_dir='images',
ann_dir='labels',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations'),
dict(
type='Resize', img_scale=(3384, 2710), ratio_range=(0.5, 2.0)),
dict(type='RandomCrop', crop_size=(758, 768), cat_max_ratio=0.75),
dict(type='RandomFlip', prob=0.5),
dict(type='PhotoMetricDistortion'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size=(758, 768), pad_val=0, seg_pad_val=255),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_semantic_seg'])
],
split='splits/train.txt'),
val=dict(
type='ABCDataset',
data_root=
'/home//code/remote/mmsegmentation_/data/ABC_58G',
img_dir='images',
ann_dir='labels',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3384, 2710),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
split='splits/val.txt'),
test=dict(
type='ABCDataset',
data_root=
'/home//code/remote/mmsegmentation_/data/ABC_58G',
img_dir='images',
ann_dir='labels',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(3384, 2710),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
],
split='splits/val.txt'))
norm_cfg = dict(type='BN', requires_grad=True)
model = dict(
type='EncoderDecoder',
pretrained='open-mmlab://resnet50_v1c',
backbone=dict(
type='ResNetV1c',
depth=50,
num_stages=4,
out_indices=(0, 1, 2, 3),
dilations=(1, 1, 2, 4),
strides=(1, 2, 1, 1),
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=False,
style='pytorch',
contract_dilation=True),
decode_head=dict(
type='PSPHead',
in_channels=2048,
in_index=3,
channels=512,
pool_scales=(1, 2, 3, 6),
dropout_ratio=0.1,
num_classes=38,
norm_cfg=dict(type='BN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
auxiliary_head=dict(
type='FCNHead',
in_channels=1024,
in_index=2,
channels=256,
num_convs=1,
concat_input=False,
dropout_ratio=0.1,
num_classes=38,
norm_cfg=dict(type='BN', requires_grad=True),
align_corners=False,
loss_decode=dict(
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)))
train_cfg = dict()
test_cfg = dict(mode='whole')
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optimizer_config = dict()
lr_config = dict(policy='poly', power=0.9, min_lr=0.0001, by_epoch=False)
runner = dict(type='IterBasedRunner', max_iters=40000)
checkpoint_config = dict(by_epoch=False, interval=200)
evaluation = dict(interval=200, metric='mIoU')
log_config = dict(
interval=20, hooks=[dict(type='TextLoggerHook', by_epoch=False)])
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = '/home//code/remote/mmsegmentation_/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'
resume_from = None
workflow = [('train', 1)]
cudnn_benchmark = True
gpu_ids = range(0, 1)

2020-11-24 14:53:49,047 - mmseg - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2020-11-24 14:53:49,049 - mmseg - INFO - EncoderDecoder(
(backbone): ResNetV1c(
(stem): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU(inplace=True)
(3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(4): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(5): ReLU(inplace=True)
(6): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(7): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(8): ReLU(inplace=True)
)
(maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
(layer1): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer2): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
(1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer3): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(3): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(4): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(5): Bottleneck(
(conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
(layer4): ResLayer(
(0): Bottleneck(
(conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(downsample): Sequential(
(0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(1): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
(2): Bottleneck(
(conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4), bias=False)
(bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
)
)
)
(decode_head): PSPHead(
input_transform=None, ignore_index=255, align_corners=False
(loss_decode): CrossEntropyLoss()
(conv_seg): Conv2d(512, 38, kernel_size=(1, 1), stride=(1, 1))
(dropout): Dropout2d(p=0.1, inplace=False)
(psp_modules): PPM(
(0): Sequential(
(0): AdaptiveAvgPool2d(output_size=1)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(1): Sequential(
(0): AdaptiveAvgPool2d(output_size=2)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(2): Sequential(
(0): AdaptiveAvgPool2d(output_size=3)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(3): Sequential(
(0): AdaptiveAvgPool2d(output_size=6)
(1): ConvModule(
(conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
)
(bottleneck): ConvModule(
(conv): Conv2d(4096, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
(auxiliary_head): FCNHead(
input_transform=None, ignore_index=255, align_corners=False
(loss_decode): CrossEntropyLoss()
(conv_seg): Conv2d(256, 38, kernel_size=(1, 1), stride=(1, 1))
(dropout): Dropout2d(p=0.1, inplace=False)
(convs): Sequential(
(0): ConvModule(
(conv): Conv2d(1024, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(bn): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(activate): ReLU(inplace=True)
)
)
)
)

Running Error Message:

2020-11-24 14:53:49,090 - mmseg - INFO - Loaded 26494 images
fatal: not a git repository (or any of the parent directories): .git
2020-11-24 14:53:50,657 - mmseg - INFO - load checkpoint from /home//code/remote/mmsegmentation_/checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth
2020-11-24 14:53:50,740 - mmseg - WARNING - The model and loaded state dict do not match exactly

size mismatch for decode_head.conv_seg.weight: copying a param with shape torch.Size([19, 512, 1, 1]) from checkpoint, the shape in current model is torch.Size([38, 512, 1, 1]).
size mismatch for decode_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([38]).
size mismatch for auxiliary_head.conv_seg.weight: copying a param with shape torch.Size([19, 256, 1, 1]) from checkpoint, the shape in current model is torch.Size([38, 256, 1, 1]).
size mismatch for auxiliary_head.conv_seg.bias: copying a param with shape torch.Size([19]) from checkpoint, the shape in current model is torch.Size([38]).
2020-11-24 14:53:50,745 - mmseg - INFO - Start running, host: @ai-server24G, work_dir: /home//code/remote/mmsegmentation_/work_dirs/ABC_exp_202011241453
2020-11-24 14:53:50,745 - mmseg - INFO - workflow: [('train', 1)], max: 40000 iters
Traceback (most recent call last):
File "tools/train.py", line 166, in
main()
File "tools/train.py", line 155, in main
train_segmentor(
File "/home//code/remote/mmsegmentation_/mmseg/apis/train.py", line 116, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/home//code/remote/mmcv_/mmcv/runner/iter_based_runner.py", line 130, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home//code/remote/mmcv_/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home//code/remote/mmcv_/mmcv/parallel/data_parallel.py", line 67, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/home//code/remote/mmsegmentation_/mmseg/models/segmentors/base.py", line 152, in train_step
losses = self(**data_batch)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in call_impl
result = self.forward(*input, **kwargs)
File "/home//code/remote/mmcv/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home//code/remote/mmsegmentation_/mmseg/models/segmentors/base.py", line 122, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home//code/remote/mmsegmentation_/mmseg/models/segmentors/encoder_decoder.py", line 162, in forward_train
loss_aux = self.auxiliary_head_forward_train(
File "/home//code/remote/mmsegmentation/mmseg/models/segmentors/encoder_decoder.py", line 124, in auxiliary_head_forward_train
loss_aux = self.auxiliary_head.forward_train(
File "/home//code/remote/mmsegmentation/mmseg/models/decode_heads/decode_head.py", line 186, in forward_train
seg_logits = self.forward(inputs)
File "/home//code/remote/mmsegmentation_/mmseg/models/decode_heads/fcn_head.py", line 72, in forward
output = self.convs(x)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in call_impl
result = self.forward(*input, **kwargs)
File "/home//code/remote/mmcv/mmcv/cnn/bricks/conv_module.py", line 192, in forward
x = self.conv(x)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
result = self.forward(*input, **kwargs)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 393, in forward
return self._conv_forward(input, self.weight)
File "/home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 389, in _conv_forward
return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([8, 1024, 95, 96], dtype=torch.float, device='cuda', requires_grad=True)
net = torch.nn.Conv2d(1024, 256, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x5605aca58f20
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 8, 1024, 95, 96,
strideA = 9338880, 9120, 96, 1,
output: TensorDescriptor 0x5605aca54d10
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 8, 256, 95, 96,
strideA = 2334720, 9120, 96, 1,
weight: FilterDescriptor 0x5605aca59ee0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 256, 1024, 3, 3,
Pointer addresses:
input: 0x7f9642e60000
output: 0x7f97f8298000
weight: 0x7f9ae2800000

terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7f9bda09092c in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x1cb08 (0x7f9bda0d1b08 in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x51 (0x7f9bda07ab21 in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #3: + 0x89731a (0x7f9bf2b3831a in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x8973e5 (0x7f9bf2b383e5 in /home//anaconda3/envs/py38_source/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #22: __libc_start_main + 0xf3 (0x7f9c2cc240b3 in /lib/x86_64-linux-gnu/libc.so.6)

[2] 9376 abort (core dumped) python tools/train.py --no-validate configs/luke/pspnet_ABCDataset.py

Shiming94 · 2020-11-25T16:51:15Z

Hi @bkbjsd,

I met the same problem. Did you solve it? I have even the same error message.

wanghao9610 · 2020-11-27T08:02:18Z

I have met the same issue, it almost may the 'num_classes' not match with your label index, I suggest you have to carefully check your label index of ground truth and the num_classes config of the model. For example, if your ground truth label index range from 0 to 19(include), you have to set the num_classes to 20.

JPLAY0 · 2020-12-04T06:50:15Z

I have encountered the same problem and solved it.

I'm sure the problem is from the dataset. Take a closer look at the category range of tags in the dataset and num_classes in the configuration file. Notice the use of ignore_index in dataset implement, does it contain unlabeled data?

NickChang97 · 2020-12-10T06:37:55Z

I also encontered it and solved it. You may check whther your label data's index is satisfied with your num_classes. Except the ignore index, all of them should be 0~num_classes-1

Shiming94 · 2020-12-11T14:49:46Z

I also encontered it and solved it. You may check whther your label data's index is satisfied with your num_classes. Except the ignore index, all of them should be 0~num_classes-1

Hi Nick, thanks a lot for your replies. It really solves my problem. But could you please further explain what does "Except the ignore index, all of them should be 0~num_classes-1" mean? I mean how to implement the ignore index? Thanks a lot.

NickChang97 · 2020-12-11T15:38:36Z

I also encontered it and solved it. You may check whther your label data's index is satisfied with your num_classes. Except the ignore index, all of them should be 0~num_classes-1

Hi Nick, thanks a lot for your replies. It really solves my problem. But could you please further explain what does "Except the ignore index, all of them should be 0~num_classes-1" mean? I mean how to implement the ignore index? Thanks a lot.

Because I did not use the ignore index, you may check it's code about ignore index, I remember it's 255

* Support K-LMS in img2img * Apply review suggestions

xvjiarui self-assigned this Nov 23, 2020

bkbjsd changed the title ~~Many days still can't solve it: CUDA error: an illegal memory access was encountered~~ still can't solve it: CUDA error: an illegal memory access was encountered Nov 27, 2020

bkbjsd changed the title ~~still can't solve it: CUDA error: an illegal memory access was encountered~~ Still can't solve it: CUDA error: an illegal memory access was encountered Nov 29, 2020

bkbjsd changed the title ~~Still can't solve it: CUDA error: an illegal memory access was encountered~~ Surely can't solve it: CUDA error: an illegal memory access was encountered Dec 3, 2020

bkbjsd changed the title ~~Surely can't solve it: CUDA error: an illegal memory access was encountered~~ CUDA error: an illegal memory access was encountered Dec 7, 2020

xvjiarui closed this as completed Jan 5, 2021

Junjun2016 added the FAQ label Aug 4, 2021

sainivedh19pt mentioned this issue Feb 28, 2022

CUDA error with several attention heads #1330

Closed

xiexinch mentioned this issue Aug 23, 2022

CUDA Illegal memory access was encountered #1941

Closed

aravind-h-v pushed a commit to aravind-h-v/mmsegmentation that referenced this issue Mar 27, 2023

Support K-LMS in img2img (open-mmlab#270)

efa773a

* Support K-LMS in img2img * Apply review suggestions

wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this issue Dec 3, 2023

Simplified wkflow. (open-mmlab#270)

28256fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: an illegal memory access was encountered #270

CUDA error: an illegal memory access was encountered #270

bkbjsd commented Nov 21, 2020 •

edited

Loading

xvjiarui commented Nov 23, 2020

bkbjsd commented Nov 24, 2020 •

edited

Loading

Shiming94 commented Nov 25, 2020

wanghao9610 commented Nov 27, 2020

JPLAY0 commented Dec 4, 2020

NickChang97 commented Dec 10, 2020 •

edited

Loading

Shiming94 commented Dec 11, 2020

NickChang97 commented Dec 11, 2020 •

edited

Loading

CUDA error: an illegal memory access was encountered #270

CUDA error: an illegal memory access was encountered #270

Comments

bkbjsd commented Nov 21, 2020 • edited Loading

xvjiarui commented Nov 23, 2020

bkbjsd commented Nov 24, 2020 • edited Loading

Running Error Message:

Shiming94 commented Nov 25, 2020

wanghao9610 commented Nov 27, 2020

JPLAY0 commented Dec 4, 2020

NickChang97 commented Dec 10, 2020 • edited Loading

Shiming94 commented Dec 11, 2020

NickChang97 commented Dec 11, 2020 • edited Loading

bkbjsd commented Nov 21, 2020 •

edited

Loading

bkbjsd commented Nov 24, 2020 •

edited

Loading

NickChang97 commented Dec 10, 2020 •

edited

Loading

NickChang97 commented Dec 11, 2020 •

edited

Loading