Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]The wdl_8gpu.py script execution has halted and training cannot proceed. #453

Open
redzhang1990 opened this issue Jun 26, 2024 · 1 comment

Comments

@redzhang1990
Copy link

Describe the bug
The wdl_8gpu.py script execution has halted and training cannot proceed.

To Reproduce
Steps to reproduce the behavior:

  1. bash preprocess.sh 0 ./criteo_data nvt 0 1 1
    2.Entering the hugectr docker and execute "HUGECTR_LOG_LEVEL=3 python samples/wdl/wdl_8gpu.py"

Expected behavior
Training steps halt after dump log:
"[HCTR][09:37:51.908][INFO][RK0][main]: Training source file: ./criteo_data/train/_file_list.txt
[HCTR][09:37:51.908][INFO][RK0][main]: Evaluation source file: ./criteo_data/val/_file_list.txt"

Screenshots
Some issue log dumpped like this
image

Environment (please complete the following information):

  • Graphic card: NVIDIA A10
  • CUDA version: 12.0.146
  • Docker image: nvcr.io/nvidia/merlin/merlin-hugectr:24.06

Additional context
After add "i64_input_key=True," into slover in wdl_8gpu.py, this issue fixed.

@JacoCheung
Copy link
Collaborator

Hi @redzhang1990, I think the behavior is as expected.

The solver::i64_input_key is defaulted to False. However, data preprocessed via preprocess.sh will output int64_t which requires i64_input_key=True in Solver.

See the note :

i64_input_key: For the Parquet format dataset generated by NVTabular, only I64 is allowed.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants