Fix GPU UTs #3203

iProzd · 2024-01-30T09:20:18Z

This PR fixes GPU UTs;
Delete the PREPROCESS_DEVICE in torch data preprocess and use training DEVICE instead, which will be removed after the dataset is refomated.

for more information, see https://pre-commit.ci

codecov · 2024-01-30T09:28:32Z

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (b800043) 74.22% compared to head (1c37f44) 74.32%.

Files	Patch %	Lines
deepmd/pt/utils/dataset.py	70.00%	3 Missing ⚠️
deepmd/pt/utils/preprocess.py	80.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            devel    #3203      +/-   ##
==========================================
+ Coverage   74.22%   74.32%   +0.09%     
==========================================
  Files         313      344      +31     
  Lines       27343    31867    +4524     
  Branches      908     1592     +684     
==========================================
+ Hits        20296    23685    +3389     
- Misses       6510     7257     +747     
- Partials      537      925     +388

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

iProzd · 2024-01-30T09:47:12Z

This PR still has problems: when data preprocess is on GPU, some UTs will stop (e.g. test_LKF.py, test_saveload_dpa1.py, ...), I'm working on this.

* throw errors when PyTorch CXX11 ABI is different from TensorFlow (deepmodeling#3201) If so, throw the following error: ``` -- PyTorch CXX11 ABI: 0 CMake Error at CMakeLists.txt:162 (message): PyTorch CXX11 ABI mismatch TensorFlow: 0 != 1 ``` Signed-off-by: Jinzhe Zeng <[email protected]> * allow disabling TensorFlow backend during Python installation (deepmodeling#3200) Fix deepmodeling#3120. One can disable building the TensorFlow backend during `pip install` by setting `DP_ENABLE_TENSORFLOW=0`. --------- Signed-off-by: Jinzhe Zeng <[email protected]> * breaking: pt: add dp model format and refactor pt impl for the fitting net. (deepmodeling#3199) - add dp model format (backend independent definition) for the fitting - refactor torch support, compatible with dp model format - fix mlp issue: the idt should only be used when a skip connection is available. - add tools `to_numpy_array` and `to_torch_tensor`. --------- Co-authored-by: Han Wang <[email protected]> * remove duplicated fitting output check. fix codeql (deepmodeling#3202) Co-authored-by: Han Wang <[email protected]> --------- Signed-off-by: Jinzhe Zeng <[email protected]> Co-authored-by: Jinzhe Zeng <[email protected]> Co-authored-by: Han Wang <[email protected]> Co-authored-by: Han Wang <[email protected]>

This reverts commit cb4cc67.

deepmd/pt/model/task/ener.py

source/tests/common/test_model_format_utils.py

source/tests/pt/test_utils.py

source/tests/pt/test_ener_fitting.py

deepmd/pt/model/task/ener.py

deepmd/model_format/fitting.py

…o fix_gpu_ut

for more information, see https://pre-commit.ci

…o fix_gpu_ut

for more information, see https://pre-commit.ci

njzjz · 2024-01-30T22:43:07Z

I got the following errors on my local machine:

================================================================================================================================================== short test summary info ===================================================================================================================================================
FAILED source/tests/pt/test_calculator.py::TestCalculator::test_calculator - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_deeppot.py::TestDeepPot::test_dp_test - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_deeppot.py::TestDeepPot::test_uni - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_dp_test.py::TestDPTest::test_dp_test - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_jit.py::TestEnergyModelSeA::test_jit - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_jit.py::TestEnergyModelDPA1::test_jit - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_jit.py::TestEnergyModelDPA2::test_jit - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_training.py::TestEnergyModelSeA::test_dp_train - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_training.py::TestEnergyModelDPA1::test_dp_train - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_training.py::TestEnergyModelDPA2::test_dp_train - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
============================================================================================================================= 10 failed, 70 passed, 20 skipped, 96 warnings in 109.94s (0:01:49) =============================================================================================================================

njzjz · 2024-01-30T22:51:15Z

I got the following errors on my local machine:

I found it's a PyTorch bug (pytorch/pytorch#110940) and has been fixed in v2.2.0 (released 4 hours ago).

njzjz

This PR has resolved the PT issues. I can run pytest source/tests/pt with no problem after upgrading PyTorch to 2.2.

Another issue is that when we run TF and PT tests together (i.e., pytest source/tests), the OOM error will be thrown. The reason might be set_memory_growth is not set for some sessions. It can be resolved in another PR.

iProzd added 2 commits January 30, 2024 17:14

Fix GPU UTs

fdbccab

Update env.py

3f0f1f8

github-actions bot added the Python label Jan 30, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

3dd415b

for more information, see https://pre-commit.ci

iProzd assigned njzjz and wanghan-iapcm and unassigned njzjz and wanghan-iapcm Jan 30, 2024

iProzd closed this Jan 30, 2024

iProzd reopened this Jan 30, 2024

njzjz added the Test CUDA Trigger test CUDA workflow label Jan 30, 2024

github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jan 30, 2024

github-actions bot added Core CUDA ROCM Docs labels Jan 30, 2024

Revert "Devel update (#30)"

4cd8258

This reverts commit cb4cc67.

github-advanced-security bot found potential problems Jan 30, 2024

View reviewed changes

iProzd and others added 6 commits January 30, 2024 18:29

Merge branch 'devel' into fix_gpu_ut

ca8083d

Fix dataloader stuck on GPU

07e0d96

Merge branch 'fix_gpu_ut' of https://github.com/iProzd/deepmd-kit int…

f174590

…o fix_gpu_ut

[pre-commit.ci] auto fixes from pre-commit.com hooks

913efa0

for more information, see https://pre-commit.ci

Update test_fitting_net.py

a4892b7

Merge branch 'fix_gpu_ut' of https://github.com/iProzd/deepmd-kit int…

06d2579

…o fix_gpu_ut

iProzd requested a review from njzjz January 30, 2024 12:04

iProzd requested a review from wanghan-iapcm January 30, 2024 12:04

pre-commit-ci bot and others added 2 commits January 30, 2024 12:04

[pre-commit.ci] auto fixes from pre-commit.com hooks

7cad8a6

for more information, see https://pre-commit.ci

set NUM_WORKERS to 0

1c37f44

njzjz added the Test CUDA Trigger test CUDA workflow label Jan 30, 2024

github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jan 30, 2024

njzjz removed Core CUDA ROCM Docs labels Jan 30, 2024

njzjz approved these changes Jan 30, 2024

View reviewed changes

wanghan-iapcm approved these changes Jan 31, 2024

View reviewed changes

wanghan-iapcm merged commit 7f069cc into deepmodeling:devel Jan 31, 2024
46 of 47 checks passed

njzjz mentioned this pull request Apr 2, 2024

[TYPO] #3635

Closed

iProzd deleted the fix_gpu_ut branch April 24, 2024 09:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GPU UTs #3203

Fix GPU UTs #3203

iProzd commented Jan 30, 2024

codecov bot commented Jan 30, 2024 •

edited

Loading

iProzd commented Jan 30, 2024 •

edited

Loading

njzjz commented Jan 30, 2024

njzjz commented Jan 30, 2024 •

edited

Loading

njzjz left a comment •

edited

Loading

Fix GPU UTs #3203

Fix GPU UTs #3203

Conversation

iProzd commented Jan 30, 2024

codecov bot commented Jan 30, 2024 • edited Loading

Codecov Report

iProzd commented Jan 30, 2024 • edited Loading

njzjz commented Jan 30, 2024

njzjz commented Jan 30, 2024 • edited Loading

njzjz left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jan 30, 2024 •

edited

Loading

iProzd commented Jan 30, 2024 •

edited

Loading

njzjz commented Jan 30, 2024 •

edited

Loading

njzjz left a comment •

edited

Loading