[BUG] finetuning property fitting with multi-dimensional data causes error #4108

theAfish · 2024-09-06T09:36:13Z

Bug summary

I have tested the new property fitting model in fine-tuning procedures with the pre-trained OpenLAM_2.2.0_27heads_beta3.pt.
The dataset I used is in the examples folder and has a dimension of 3. Raised errors about tensor size mismatch. See the Error Log below.

DeePMD-kit Version

DeePMD-kit v3.0.0a1.dev320+g46632f90

Backend and its version

torch 2.4.1+cu121

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Commands: dp --pt train input_finetune.json --finetune OpenLAM_2.2.0_27heads_beta3.pt

Input File:
input_finetune.json

The data files I used are in examples/property/data

Error Log:

To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2024-09-06 17:33:46,629] DEEPMD INFO    DeePMD version: 3.0.0a1.dev320+g46632f90
[2024-09-06 17:33:46,629] DEEPMD INFO    Configuration path: input_finetune.json
[2024-09-06 17:33:46,672] DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[2024-09-06 17:33:46,672] DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[2024-09-06 17:33:46,672] DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[2024-09-06 17:33:46,672] DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[2024-09-06 17:33:46,672] DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[2024-09-06 17:33:46,672] DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[2024-09-06 17:33:46,672] DEEPMD INFO    Please read and cite:
[2024-09-06 17:33:46,672] DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2024-09-06 17:33:46,672] DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2024-09-06 17:33:46,672] DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
[2024-09-06 17:33:46,672] DEEPMD INFO    -------------------------------------------------------------------------------
[2024-09-06 17:33:46,672] DEEPMD INFO    installed to:          /home/notfish/dev/dc-dev/deepmd
[2024-09-06 17:33:46,672] DEEPMD INFO                           /home/notfish/dev/dp/lib/python3.10/site-packages/deepmd
[2024-09-06 17:33:46,672] DEEPMD INFO    source:                v3.0.0a0-320-g46632f90
[2024-09-06 17:33:46,672] DEEPMD INFO    source brach:          devel
[2024-09-06 17:33:46,672] DEEPMD INFO    source commit:         46632f90
[2024-09-06 17:33:46,672] DEEPMD INFO    source commit at:      2024-09-04 00:33:34 +0000
[2024-09-06 17:33:46,672] DEEPMD INFO    use float prec:        double
[2024-09-06 17:33:46,672] DEEPMD INFO    build variant:         cpu
[2024-09-06 17:33:46,672] DEEPMD INFO    Backend:               PyTorch
[2024-09-06 17:33:46,672] DEEPMD INFO    PT ver:                v2.4.1+cu121-g38b96d3399a
[2024-09-06 17:33:46,672] DEEPMD INFO    Enable custom OP:      False
[2024-09-06 17:33:46,672] DEEPMD INFO    running on:            theNotfish
[2024-09-06 17:33:46,672] DEEPMD INFO    computing device:      cuda:0
[2024-09-06 17:33:46,672] DEEPMD INFO    CUDA_VISIBLE_DEVICES:  unset
[2024-09-06 17:33:46,672] DEEPMD INFO    Count of visible GPUs: 1
[2024-09-06 17:33:46,672] DEEPMD INFO    num_intra_threads:     0
[2024-09-06 17:33:46,672] DEEPMD INFO    num_inter_threads:     0
[2024-09-06 17:33:46,672] DEEPMD INFO    -------------------------------------------------------------------------------
/home/notfish/dev/dc-dev/deepmd/pt/utils/finetune.py:139: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(finetune_model, map_location=env.DEVICE)
[2024-09-06 17:33:47,902] DEEPMD WARNING The fitting net will be re-init instead of using that in the pretrained model! The bias_adjust_mode will be set-by-statistic!
[2024-09-06 17:33:47,927] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2024-09-06 17:33:47,959] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[2024-09-06 17:33:47,987] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2024-09-06 17:33:48,113] DEEPMD INFO    training data with min nbor dist: 0.9608642172055677
[2024-09-06 17:33:48,113] DEEPMD INFO    training data with max nbor size: [21]
[2024-09-06 17:33:48,117] DEEPMD INFO    If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[2024-09-06 17:33:48,119] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2024-09-06 17:33:48,273] DEEPMD INFO    training data with min nbor dist: 0.9608642172055677
[2024-09-06 17:33:48,274] DEEPMD INFO    training data with max nbor size: [21]
[2024-09-06 17:33:48,547] DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
[2024-09-06 17:33:48,548] DEEPMD INFO    found 2 system(s):
[2024-09-06 17:33:48,548] DEEPMD INFO                                        system  natoms  bch_sz   n_bch       prob  pbc
[2024-09-06 17:33:48,548] DEEPMD INFO                                ../data/data_0      20       1      80  5.000e-01    F
[2024-09-06 17:33:48,548] DEEPMD INFO                                ../data/data_1      22       1      80  5.000e-01    F
[2024-09-06 17:33:48,548] DEEPMD INFO    --------------------------------------------------------------------------------------
[2024-09-06 17:33:48,551] DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
[2024-09-06 17:33:48,551] DEEPMD INFO    found 1 system(s):
[2024-09-06 17:33:48,551] DEEPMD INFO                                        system  natoms  bch_sz   n_bch       prob  pbc
[2024-09-06 17:33:48,551] DEEPMD INFO                                ../data/data_2      24       1      80  1.000e+00    F
[2024-09-06 17:33:48,551] DEEPMD INFO    --------------------------------------------------------------------------------------
[2024-09-06 17:33:48,552] DEEPMD INFO    Resuming from OpenLAM_2.2.0_27heads_beta3.pt.
/home/notfish/dev/dc-dev/deepmd/pt/train/training.py:404: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(resume_model, map_location=DEVICE)
Traceback (most recent call last):
  File "/home/notfish/dev/dp/bin/dp", line 8, in <module>
    sys.exit(main())
  File "/home/notfish/dev/dc-dev/deepmd/main.py", line 923, in main
    deepmd_main(args)
  File "/home/notfish/dev/dp/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/notfish/dev/dc-dev/deepmd/pt/entrypoints/main.py", line 563, in main
    train(FLAGS)
  File "/home/notfish/dev/dc-dev/deepmd/pt/entrypoints/main.py", line 327, in train
    trainer = get_trainer(
  File "/home/notfish/dev/dc-dev/deepmd/pt/entrypoints/main.py", line 190, in get_trainer
    trainer = training.Trainer(
  File "/home/notfish/dev/dc-dev/deepmd/pt/train/training.py", line 516, in __init__
    self.wrapper.load_state_dict(state_dict)
  File "/home/notfish/dev/dp/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ModelWrapper:
        size mismatch for model.Default.atomic_model.out_bias: copying a param with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current model is torch.Size([1, 118, 3]).
        size mismatch for model.Default.atomic_model.out_std: copying a param with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current model is torch.Size([1, 118, 3]).

Steps to Reproduce

Just run the command with the datasets and the input file

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

wanghan-iapcm · 2024-09-09T00:09:39Z

The model OpenLAM_2.2.0_27heads_beta3.pt is for 3.0.0beta release, you were using v3.0.0alpha. please upgrade to beta release.

theAfish · 2024-09-09T01:19:36Z

I am using the newest version of the devel branch which obtains the new PropertyFitting functions. The OpenLAM_2.2.0_27heads_beta3.pt seems work for finetuning with 1D property data properly in my case. Since the v3.0.0b3 does not obtain this specific feature, I'm not sure whether this issus in fitting multi-dimensional properties data is caused by the model's version or the code.

njzjz · 2024-09-09T19:28:08Z

Which commit do you use? 46632f9 does not contain PropertyFitting.

theAfish · 2024-09-10T01:19:00Z

should be #3867

njzjz · 2024-09-13T19:12:54Z

It looks like a bug in finetune, but not related to the property fitting. out_bias should not be loaded. @iProzd could you take a look at the finetune code?

Chengqian-Zhang · 2024-09-14T02:53:38Z

This bug appears when finetune task's label is multi-dimensional. dos fitting, property fitting, polar fitting and dipole fitting all report this bug when finetuning using a multitask pretrained model.

…nsional data causes error (#4145) Fix issue #4108 If a pretrained model is labeled with energy and the `out_bias` is one dimension. If we want to finetune a dos/polar/dipole/property model using this pretrained model, the `out_bias` of finetuning model is multi-dimension(example: numb_dos = 250). An error occurs: `RuntimeError: Error(s) in loading state_dict for ModelWrapper:` ` size mismatch for model.Default.atomic_model.out_bias: copying a param with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current model is torch.Size([1, 118, 250]).` ` size mismatch for model.Default.atomic_model.out_std: copying a param with shape torch.Size([1, 118, 1]) from checkpoint, the shape in current model is torch.Size([1, 118, 250]).` When using new fitting, old out_bias is useless because we will recompute the new bias in later code. So we do not need to load old out_bias when using new fitting finetune.  ## Summary by CodeRabbit - **New Features** - Enhanced parameter collection for fine-tuning, refining criteria for parameter retention. - Introduced a model checkpoint file for saving and resuming training states, facilitating iterative development. - **Tests** - Added a new test class to validate training and fine-tuning processes, ensuring model performance consistency across configurations.  --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

theAfish added the bug label Sep 6, 2024

Chengqian-Zhang self-assigned this Sep 14, 2024

Chengqian-Zhang mentioned this issue Sep 19, 2024

fix(pt): finetuning property/dipole/polar/dos fitting with multi-dimensional data causes error #4145

Merged

njzjz linked a pull request Sep 19, 2024 that will close this issue

fix(pt): finetuning property/dipole/polar/dos fitting with multi-dimensional data causes error #4145

Merged

iProzd closed this as completed Sep 26, 2024

njzjz added this to the v3.0.0 milestone Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] finetuning property fitting with multi-dimensional data causes error #4108

[BUG] finetuning property fitting with multi-dimensional data causes error #4108

theAfish commented Sep 6, 2024

wanghan-iapcm commented Sep 9, 2024

theAfish commented Sep 9, 2024

njzjz commented Sep 9, 2024

theAfish commented Sep 10, 2024

njzjz commented Sep 13, 2024

Chengqian-Zhang commented Sep 14, 2024 •

edited

Loading

[BUG] finetuning property fitting with multi-dimensional data causes error #4108

[BUG] finetuning property fitting with multi-dimensional data causes error #4108

Comments

theAfish commented Sep 6, 2024

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

wanghan-iapcm commented Sep 9, 2024

theAfish commented Sep 9, 2024

njzjz commented Sep 9, 2024

theAfish commented Sep 10, 2024

njzjz commented Sep 13, 2024

Chengqian-Zhang commented Sep 14, 2024 • edited Loading

Chengqian-Zhang commented Sep 14, 2024 •

edited

Loading