Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A summary of issues (问题总结) #9

Open
chairc opened this issue Nov 4, 2023 · 12 comments
Open

A summary of issues (问题总结) #9

chairc opened this issue Nov 4, 2023 · 12 comments
Assignees
Labels
documentation Improvements or additions to documentation Note!!!! Please attention!!!

Comments

@chairc
Copy link
Owner

chairc commented Nov 4, 2023

This Issue is to summarize all kinds of problems and provide corresponding solutions. If there is no relevant problem in this issue, you can propose a new issue, and I will answer it.

Thanks for feedback bugs and contributing code (and pr).

这个Issue是对各种问题进行总结,并提供相应的解决方案。如果这个issue没有回答相关的问题,你可以提出一个新的issue,我会解答。

另外,感谢反馈bug和贡献代码(和pr)。

Problem quick navigation
问题快速导航
Q1. What is the purpose of this project? What significance does it have? (这个项目是做什么的?它有什么意义?)
Q2. How should I choose appropriate parameters during training? (我该如何在训练时选择合适的参数?)
Q3. How can I accelerate image generation during training? (我该如何在训练时加速图像生成?)
Q4. Why am I encountering numerous CUDA or cuDNN errors such as THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp or RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR during training? (为什么我在训练的时候出现了THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp或RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR等大片CUDA或cuDNN错误?)
Q5. Why do I see noise issues in the generated images? (为什么我生成的图片会出现噪点问题?)
Q6. How should the dataset be divided? How to set up conditional and unconditional training? (数据集该如何划分?条件训练和非条件训练该怎么设置?)
Q7. The training was interrupted unexpectedly. How can I resume training? (训练异常中断了,如何恢复训练?)
Q8. The training time for each epoch is too long. How can I use a pretrained model? (每轮训练时间太长了,怎么使用预训练模型?)
Q9. Why does using a 32×32 model to generate 64×64 or 128×128 images result in distortion and more objects? (为什么使用32×32的模型生成64×64的图片会扭曲、物体会变多呢?)
Q10. Why do I get a RuntimeError: Address already in use error when starting training? (为什么我启动训练报RuntimeError: Address already in use错误?)
Q11. I encountered a ValueError: Imaginary component XXX error when calculating FID. How can I resolve it? (在计算FID的时候出现了ValueError: Imaginary component XXX错误,如何解决?)

@chairc chairc self-assigned this Nov 4, 2023
@chairc chairc added the Note!!!! Please attention!!! label Nov 4, 2023
@chairc chairc pinned this issue Nov 4, 2023
Repository owner locked and limited conversation to collaborators Nov 4, 2023
@chairc chairc changed the title A summary of issues A summary of issues (问题总结) Nov 4, 2023
@chairc
Copy link
Owner Author

chairc commented Jan 2, 2024

Question 1: What is the purpose of this project? What significance does it have?
Answer: This project is a reimplementation of DDPM (Diffusion Probabilistic Models) and DDIM (Diffusion Denoising Implicit Models). It serves as an introductory project to classic deep learning algorithms in the image generation domain. It provides an intuitive understanding of the algorithm's underlying principles. The code structure mirrors the paper structure, facilitating an easier learning experience.

问题1:这个项目是做什么的?它有什么意义?
回答:这个项目是一个基础的DDPM和DDIM复现项目,是入门图像生成领域经典的深度学习算法。它可以直观的教给你算法底层原理,代码结构与论文结构相同,更轻松学习。

@chairc
Copy link
Owner Author

chairc commented Jan 2, 2024

Question 2: How should I choose appropriate parameters during training?
Answer: In the tools/train.py file, you can customize the values in argparse. For specific training parameters, refer to the Parameter Explanation.

问题2:我该如何在训练时选择合适的参数?
回答:在tools/train.py文件中,你可以自定义设置argparse中的值,具体训练参数可以到参数讲解获得。

@chairc
Copy link
Owner Author

chairc commented Jan 2, 2024

Question 3: How can I accelerate image generation during training?
Answer: Use --sample ddim.

问题3:我该如何在训练时加速图像生成?
回答--sample设置为ddim

@chairc
Copy link
Owner Author

chairc commented Jan 5, 2024

Question 4: Why am I encountering numerous CUDA or cuDNN errors such as THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cpp or RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR during training?
Answer: Check whether the --num_classes value in argparse matches the number of classes in your current dataset. One major reason for this issue is that the value here is less than the number of classes in your dataset.

问题4:为什么我在训练的时候出现了THCudaCheck FAIL file=../aten/src/THC/THCCachingHostAllocator.cppRuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR等大片CUDA或cuDNN错误?
回答:检查argparse--num_classes是否与当前数据集类别个数相同。出现该问题的一大原因是此处设置的值小于你的类别个数。

image

image
image

@chairc
Copy link
Owner Author

chairc commented Jan 9, 2024

Question 5: Why do I see noise issues in the generated images?
Answer: The appearance of noise is often due to a mismatch between the current model configuration and the one used during training. Please check if --act, --num_classes, and --sample are set correctly. Also, make sure to inspect your training results to verify if the model has adequately converged in each validation round.

问题5:为什么我生成的图片会出现噪点问题?
回答:噪点出现的很重要的原因是当前使用的模型配置未与训练时保持一致,请检查--act--num_classes--sample是否设置正确。与此同时,请查看一下自己的训练结果中每轮验证图片是否训练到拟合。

The image and question from #7
图片和问题来自 #7

image
image

@chairc
Copy link
Owner Author

chairc commented Jan 9, 2024

Question 6: How should the dataset be divided? How to set up conditional and unconditional training?
Answer: You can store the dataset anywhere on your computer; just set --dataset_path accordingly. For unconditional training, place all data in one folder, for example: if the file path is /path/dataset/unconditional/images, store all images in the images folder, and set --dataset_path to /path/dataset/unconditional. For conditional training, organize images of the same type into corresponding folders. For instance, if you have folders class0 and class1, with the main directory being /path/dataset/conditional, the paths for the two folders would be /path/dataset/conditional/class0 and /path/dataset/conditional/class1. After organizing the dataset, for conditional training, modify --num_classes to the number of input categories (No need to set for models after version 1.1.4). The configuration is complete.

问题6:数据集该如何划分?条件训练和非条件训练该怎么设置?
回答:数据集你可以存放在电脑的任何地方,只需要设置--dataset_path即可。当使用非条件训练时,应将所有数据放在一个文件夹中,例如:文件地址为/path/dataset/unconditional/images,将所有图片存放在images中,设置--dataset_path/path/dataset/unconditional即可。当使用条件训练时,应将相同类型的图片放入对应文件夹中,例如有class0class1这两个文件夹,主目录为/path/dataset/conditional,此时两个文件夹路径为/path/dataset/conditional/class0/path/dataset/conditional/class1。此时,数据集都划分完毕,但是在条件训练时需要将--num_classes修改为输入种类的个数(1.1.4版本后可不用配置),当设置完毕后即配置完成。

Refer to the diagram below for a detailed structure.
详细结构可以参考下图。

image
image

@chairc
Copy link
Owner Author

chairc commented Jan 9, 2024

Question 7: The training was interrupted unexpectedly. How can I resume training?
Answer: Don't worry, the trainer provides a resume training feature with detailed parameters --resume and --start_epoch. For resuming training on a single GPU, you can directly use python train.py --resume True. This will resume training using the ckpt_last.pt by default. If you want to resume training from a specific epoch, say, epoch 50, you can use python train.py --resume True --start_epoch 50. The trainer will then read the weights from the 49th epoch and start training from the 50th epoch (--save_model_interval must be set to True). When conducting distributed training, please ensure that all processes have been terminated before resuming training. If any process is still active, it will indicate that the current address is occupied.

问题7:训练异常中断了,如何恢复训练?
回答:别担心,训练器提供了恢复训练功能,详细参数为--resume--start_epoch。当单GPU需要恢复训练时,可以直接使用python train.py --resume True,此时默认使用ckpt_last.pt恢复训练;当使用python train.py --resume True --start_epoch 50时,训练器将会从读取第49个权重文件,开始第50次训练(--save_model_interval必须为True)。当为分布式训练时,请在恢复训练前查看是否所有进程都已销毁,如果没销毁,则会显示当前地址被占用。

image

@chairc
Copy link
Owner Author

chairc commented Jan 11, 2024

Question 8: The training time for each epoch is too long. How can I use a pretrained model?
Answer: Pretrained models are released with each major version Release. Please stay informed about their release. To use a pretrained model, download the model with matching parameters such as network, image_size, act, etc., to any local folder. Then, use python train.py --pretrain True --pretrain_path /your/pretrain/model.pt to load the pretrained weights. Alternatively, you can directly modify the --pretrain and --pretrain_path parameters in train.py.

问题8:每轮训练时间太长了,怎么使用预训练模型?
回答:预训练模型在每次大版本Release中发布,请留意。预训练模型使用方法如下,首先将对应networkimage_sizeact等相同参数的模型下到本地任意文件夹下。使用python train.py --pretrain --pretrain_path /your/pretrain/model.pt即可加载训练。或直接调整train.py--pretrain--pretrain_path即可。

image

@chairc
Copy link
Owner Author

chairc commented Jan 11, 2024

Question 9: Why does using a 32×32 model to generate 64×64 or 128×128 images result in distortion and more objects?
Answer: This is due to the mismatch in model sizes. If it's an image with defect textures where the features are not clear, generating a large size directly might not have these issues, such as in NRSD or NEU datasets. However, if the image contains a background with specific distinctive features, you may need to use super-resolution or resizing to increase the size, for example, in Cifar10, CelebA-HQ, etc. If you really need large-sized images, you can directly train with large pixel images if there is enough GPU memory.

问题9:为什么使用32×32的模型生成64×64的图片会扭曲、物体会变多呢?
回答:这是由于模型尺寸不匹配导致的。如果是缺陷纹理那种图片,特征物不明显的直接生成大尺寸就不会有这些问题,例如NRSD、NEU数据集。如果是含有背景有特定明显特征的则需要超分或者resize提升尺寸,例如Cifar10、CelebA-HQ等。如果实在需要大尺寸图像,在显存足够的情况下直接训练大像素图片。

6EPY9U97LXM$5 Q2OHVLBVK
0@UL9R~(J_CK08N `EDUBIV
)6QLNFMTZRRZ@9RU$BS94M1

@chairc
Copy link
Owner Author

chairc commented Jan 11, 2024

Question 10: Why do I get a RuntimeError: Address already in use error when starting training?
Answer: This issue often occurs when running distributed training. To resolve it, follow these steps: Start the console with the htop or top command, look for a program starting with mp that is running at high usage, and use the kill command to terminate that process. Simultaneously, use the nvidia-smi command to check if the GPU memory usage has returned to 0.

问题10:为什么我启动训练报RuntimeError: Address already in use错误?
回答:这种问题常发生在开启了分布式训练中。解决方法如下:首先启动控制台htoptop命令,查找mp开头的正在高运行的程序,使用kill命令将该进程结束。同时配合nvidia-smi命令检查显存占用率是否恢复为0。

image

@chairc chairc added the documentation Improvements or additions to documentation label Mar 12, 2024
@chairc
Copy link
Owner Author

chairc commented May 6, 2024

Question 11: I encountered a ValueError: Imaginary component XXX error when calculating FID. How can I resolve it?
Answer: The error occurs due to an excessively high version of scipy. To resolve it, please downgrade scipy to version 1.11.1.

问题11:在计算FID的时候出现了ValueError: Imaginary component XXX错误,如何解决?
回答:该错误出现的原因是当前scipy版本过高导致,请降低版本至1.11.1即可解决。
image

@chairc
Copy link
Owner Author

chairc commented Oct 29, 2024

Question 12: During training, I encountered the FutureWarning: 'torch.cuda.amp.autocast(args...)' is deprecated. Please use 'torch.amp.autocast('cuda', args...)' instead. How should I resolve this issue?
Answer: You can either downgrade to a Pytorch version below 2.4.1 or replace with autocast(enabled=amp): with with torch.amp.autocast("cuda", enabled=amp):.

问题12:在训练时出现FutureWarning: torch.cuda.amp.autocast(args...)is deprecated. Please usetorch.amp.autocast('cuda', args...) instead.,我该如何解决?
回答:降低Pytorch版本至2.4.1以下。或将with autocast(enabled=amp):替换为with torch.amp.autocast("cuda", enabled=amp):

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Improvements or additions to documentation Note!!!! Please attention!!!
Projects
None yet
Development

No branches or pull requests

1 participant