Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU UTs #3203

Merged
merged 13 commits into from
Jan 31, 2024
Merged

Fix GPU UTs #3203

merged 13 commits into from
Jan 31, 2024

Conversation

iProzd
Copy link
Collaborator

@iProzd iProzd commented Jan 30, 2024

This PR fixes GPU UTs;
Delete the PREPROCESS_DEVICE in torch data preprocess and use training DEVICE instead, which will be removed after the dataset is refomated.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (b800043) 74.22% compared to head (1c37f44) 74.32%.

Files Patch % Lines
deepmd/pt/utils/dataset.py 70.00% 3 Missing ⚠️
deepmd/pt/utils/preprocess.py 80.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #3203      +/-   ##
==========================================
+ Coverage   74.22%   74.32%   +0.09%     
==========================================
  Files         313      344      +31     
  Lines       27343    31867    +4524     
  Branches      908     1592     +684     
==========================================
+ Hits        20296    23685    +3389     
- Misses       6510     7257     +747     
- Partials      537      925     +388     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@iProzd
Copy link
Collaborator Author

iProzd commented Jan 30, 2024

This PR still has problems: when data preprocess is on GPU, some UTs will stop (e.g. test_LKF.py, test_saveload_dpa1.py, ...), I'm working on this.

@iProzd iProzd assigned njzjz and wanghan-iapcm and unassigned njzjz and wanghan-iapcm Jan 30, 2024
@iProzd iProzd closed this Jan 30, 2024
@iProzd iProzd reopened this Jan 30, 2024
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Jan 30, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jan 30, 2024
* throw errors when PyTorch CXX11 ABI is different from TensorFlow (deepmodeling#3201)

If so, throw the following error:
```
-- PyTorch CXX11 ABI: 0
CMake Error at CMakeLists.txt:162 (message):
  PyTorch CXX11 ABI mismatch TensorFlow: 0 != 1
```

Signed-off-by: Jinzhe Zeng <[email protected]>

* allow disabling TensorFlow backend during Python installation (deepmodeling#3200)

Fix deepmodeling#3120.

One can disable building the TensorFlow backend during `pip install` by
setting `DP_ENABLE_TENSORFLOW=0`.

---------

Signed-off-by: Jinzhe Zeng <[email protected]>

* breaking: pt: add dp model format and refactor pt impl for the fitting net. (deepmodeling#3199)

- add dp model format (backend independent definition) for the fitting
- refactor torch support, compatible with dp model format
- fix mlp issue: the idt should only be used when a skip connection is
available.
- add tools `to_numpy_array` and `to_torch_tensor`.

---------

Co-authored-by: Han Wang <[email protected]>

* remove duplicated fitting output check. fix codeql (deepmodeling#3202)

Co-authored-by: Han Wang <[email protected]>

---------

Signed-off-by: Jinzhe Zeng <[email protected]>
Co-authored-by: Jinzhe Zeng <[email protected]>
Co-authored-by: Han Wang <[email protected]>
Co-authored-by: Han Wang <[email protected]>
This reverts commit cb4cc67.
deepmd/pt/model/task/ener.py Dismissed Show dismissed Hide dismissed
source/tests/common/test_model_format_utils.py Dismissed Show dismissed Hide dismissed
source/tests/pt/test_utils.py Dismissed Show dismissed Hide dismissed
source/tests/pt/test_ener_fitting.py Dismissed Show dismissed Hide dismissed
source/tests/pt/test_ener_fitting.py Dismissed Show dismissed Hide dismissed
deepmd/pt/model/task/ener.py Dismissed Show dismissed Hide dismissed
deepmd/pt/model/task/ener.py Dismissed Show dismissed Hide dismissed
deepmd/model_format/fitting.py Dismissed Show dismissed Hide dismissed
deepmd/model_format/fitting.py Dismissed Show dismissed Hide dismissed
deepmd/model_format/fitting.py Dismissed Show dismissed Hide dismissed
@iProzd iProzd requested a review from njzjz January 30, 2024 12:04
@njzjz njzjz added the Test CUDA Trigger test CUDA workflow label Jan 30, 2024
@github-actions github-actions bot removed the Test CUDA Trigger test CUDA workflow label Jan 30, 2024
@njzjz
Copy link
Member

njzjz commented Jan 30, 2024

I got the following errors on my local machine:

================================================================================================================================================== short test summary info ===================================================================================================================================================
FAILED source/tests/pt/test_calculator.py::TestCalculator::test_calculator - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_deeppot.py::TestDeepPot::test_dp_test - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_deeppot.py::TestDeepPot::test_uni - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_dp_test.py::TestDPTest::test_dp_test - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_jit.py::TestEnergyModelSeA::test_jit - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_jit.py::TestEnergyModelDPA1::test_jit - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_jit.py::TestEnergyModelDPA2::test_jit - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_training.py::TestEnergyModelSeA::test_dp_train - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_training.py::TestEnergyModelDPA1::test_dp_train - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
FAILED source/tests/pt/test_training.py::TestEnergyModelDPA2::test_dp_train - RuntimeError: Tensors of the same index must be on the same device and the same dtype except `step` tensors that can be CPU and float32 notwithstanding
============================================================================================================================= 10 failed, 70 passed, 20 skipped, 96 warnings in 109.94s (0:01:49) =============================================================================================================================

@njzjz
Copy link
Member

njzjz commented Jan 30, 2024

I got the following errors on my local machine:

I found it's a PyTorch bug (pytorch/pytorch#110940) and has been fixed in v2.2.0 (released 4 hours ago).

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has resolved the PT issues. I can run pytest source/tests/pt with no problem after upgrading PyTorch to 2.2.

Another issue is that when we run TF and PT tests together (i.e., pytest source/tests), the OOM error will be thrown. The reason might be set_memory_growth is not set for some sessions. It can be resolved in another PR.

@wanghan-iapcm wanghan-iapcm merged commit 7f069cc into deepmodeling:devel Jan 31, 2024
46 of 47 checks passed
@njzjz njzjz mentioned this pull request Apr 2, 2024
@iProzd iProzd deleted the fix_gpu_ut branch April 24, 2024 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants