Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows 10 Pro, CUDA 11.0 +PyTorch 1.7.1+ issue #866

Closed
maaquib opened this issue Dec 11, 2020 · 11 comments
Closed

Windows 10 Pro, CUDA 11.0 +PyTorch 1.7.1+ issue #866

maaquib opened this issue Dec 11, 2020 · 11 comments
Assignees
Labels
triaged_wait Waiting for the Reporter's resp
Milestone

Comments

@maaquib
Copy link
Collaborator

maaquib commented Dec 11, 2020

On behalf of @jeffxtang

On my Windows 10 Pro, I had CUDA 11.0 and PyTorch 1.7.1+ (it was 1.7.0 but was upgraded to 1.7.1 when I ran python .\ts_scripts\install_dependencies.py --environment=dev, with == changed to >= in the requirements\torch.txt)

(base) PS C:\Users\Warrior\github> & 'C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 452.57       Driver Version: 452.57       CUDA Version: 11.0     |
===================+======================|
|   0  Quadro GP100       WDDM  | 00000000:01:00.0 Off |                  Off |
| 26%   31C    P0    25W / 235W |     89MiB / 16384MiB |      0%      Default |

pip list |grep torch
torch1.7.1+cu110

# inside ipython
In [6]: torch.cuda.get_device_name(0)
Out[6]: 'Quadro GP100'

The first time I ran python .\torchserve_sanity.py, 151 tests completed, 20 failed - result win_gpu_torch171_sanity_test.txt

In the TorchServe on Windows Troubleshooting, it says "you may have to change the port number for inference, management and metrics apis as specified in frontend/server/src/test/resources/config.properties, all files in frontend/server/src/test/resources/snapshot/* and frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java, but it's unclear exactly how to make the changes. So I discarded the requirements/torch.txt changes, and installed torch 1.6.0 etc and reran the sanity test. The result is in win_gpu_torch160_sanity_test.txt, with the same number of failed tests, which may be caused by the same error OSError: [WinError 10013] An attempt was made to access a socket reported in issue #828.

Originally posted by @jeffxtang in #851 (comment)

@lokeshgupta1975
Copy link
Collaborator

@jeffxtang, can you now verify and close this issue

@jeffxtang
Copy link
Contributor

@jeffxtang, can you now verify and close this issue

you mean to test again and verify if the problem is gone? or just verify what @maaquib - thanks - posted for me and close it? i may not have time to test this again today..

@jeffxtang
Copy link
Contributor

Did another sanity test with the commit 2541292 and was stuck at 82% EXECUTING for about an hour. Full log is attached. Is this still caused by "OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions"?

sanity_log_win_gpu_pt171.txt

@jeffxtang
Copy link
Contributor

About the error "OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions", it was mentioned at #828 (comment) and the Windows Native TroubleShooting may help - "If you are building from source then you may have to change the port number for inference, management and metrics apis as specified in frontend/server/src/test/resources/config.properties"

curl http://127.0.0.1:8080/predictions/densenet161 -T kitten_small.jpg works for me. But how should the lines below in config.properties be changed? I tried changing inference_address to be https://127.0.0.1:8080 or 9000 but still got errors.

inference_address=https://127.0.0.1:8443
management_address=https://127.0.0.1:8444
metrics_address=https://127.0.0.1:8445

@maaquib maaquib added this to the v0.3.0 milestone Dec 16, 2020
@pytorch pytorch deleted a comment from dhaniram-kshirsagar Dec 16, 2020
@harshbafna
Copy link
Contributor

About the error "OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions", it was mentioned at #828 (comment) and the Windows Native TroubleShooting may help - "If you are building from source then you may have to change the port number for inference, management and metrics apis as specified in frontend/server/src/test/resources/config.properties"

curl http://127.0.0.1:8080/predictions/densenet161 -T kitten_small.jpg works for me. But how should the lines below in config.properties be changed? I tried changing inference_address to be https://127.0.0.1:8080 or 9000 but still got errors.

inference_address=https://127.0.0.1:8443
management_address=https://127.0.0.1:8444
metrics_address=https://127.0.0.1:8445

@jeffxtang: For running on HTTPS, you will need to generate and provide the private_key_file and certificate_file parameters in your config.properties file.

@harshbafna
Copy link
Contributor

Did another sanity test with the commit 2541292 and was stuck at 82% EXECUTING for about an hour. Full log is attached. Is this still caused by "OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions"?

sanity_log_win_gpu_pt171.txt

@jeffxtang:

From the shared logs, I am still observing the same error at different places and a bunch of test cases have failed.

    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "C:\Users\Warrior\repos\serve\ts\model_service_worker.py", line 182, in <module>
    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "C:\Users\Warrior\repos\serve\ts\model_service_worker.py", line 141, in run_server
    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.sock.bind((self.sock_name, int(s<==========---> 82% EXECUTING [1m 12s]
    2020-12-15 12:16:56,694 [INFO ] W-9012-respheader_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions

TorchServe Sanity suite executes a bunch of integration test cases which uses different ports to load the model workers. In your case it seems your user doesn't have access to some ports or the ports required are already in use by some other process and hence the errors. We haven't observed any such error in out build/test machines/environments, which has full permissions for accessing the resources.

From the other threads, I understand that you were able to successfully install from source and run basic model serving through REST as well as gRPC APIs and its the frontend build which is failing for you?

@jeffxtang
Copy link
Contributor

@harshbafna yes I'm able to run the basic model serving with REST and GRPC APIs but haven't made the sanity test work. I'll try to make https work to see it'll fix the problem.

@jeffxtang
Copy link
Contributor

I followed the configuration Enable SSL's Examples' two options and retried the sanity test but for option 1 I was still stuck at 82% with the same errors and for option 2, the test finished quickly with the messages

153 tests completed, 1 failed, 152 skipped
> Task :server:test FAILED
FAILURE: Build failed with an exception.

@maaquib
Copy link
Collaborator Author

maaquib commented Dec 16, 2020

@jeffxtang What was the failed test? Can you provide the logs?

@jeffxtang
Copy link
Contributor

@jeffxtang What was the failed test? Can you provide the logs?

sanity_log_win_failed.txt

@harshbafna
Copy link
Contributor

@jeffxtang

Could you also zip and attache the reports generated at following path:

file:///C:/Users/Warrior/repos/serve/frontend/server/build/reports/

From the shared logs, it says it ran into some FileNotFound exception while trying to start the test suite.

I followed the configuration Enable SSL's Examples' two options and retried the sanity test but for option 1 I was still stuck at 82% with the same errors and for option 2, the test finished quickly with the messages

You should not make any changes in the config.properties file for running the test suites, it will mess up the the expected outputs. The reason for pointing you to that doc was that you were trying to use https in the config.properties.

The sanity suite/regression suite or any test case should be executed without any changes in the congif, unless you have made corresponding changes in the code.

To run the sanity suite all you need is to run ts/insall_dependencies.py followed by torchserve_sanity.py.
Please also ensure you have gone through all the pre-requisites specified in the TorchServe on window native documentation

@maaquib maaquib modified the milestones: v0.3.0, v0.4.0 Dec 18, 2020
@harshbafna harshbafna added the triaged_wait Waiting for the Reporter's resp label Dec 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged_wait Waiting for the Reporter's resp
Projects
None yet
Development

No branches or pull requests

6 participants