Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torchserve sanity is failing on windows native #828

Closed
dhaniram-kshirsagar opened this issue Dec 3, 2020 · 11 comments
Closed

torchserve sanity is failing on windows native #828

dhaniram-kshirsagar opened this issue Dec 3, 2020 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@dhaniram-kshirsagar
Copy link
Contributor

dhaniram-kshirsagar commented Dec 3, 2020

The latest changes related to torchserve-sdk UT step in torchserve_sanity.py is causing sanity script failure on windows native tested on windows server 2019 using codebuild.

Failure Logs [if any]

2020/12/03 13:53:19 Running command cd serving-sdk/ && mvn clean install -q && cd ../

992 | At C:\codebuild\output\tmp\script.ps1:5 char:17
993 | + cd serving-sdk/ && mvn clean install -q && cd ../
994 | + ~~
995 | The token '&&' is not a valid statement separator in this version.
996 | At C:\codebuild\output\tmp\script.ps1:5 char:41
997 | + cd serving-sdk/ && mvn clean install -q && cd ../
998 | + ~~
999 | The token '&&' is not a valid statement separator in this version.
1000 | + CategoryInfo : ParserError: (:) [], ParentContainsErrorRecordEx
1001 | ception
1002 | + FullyQualifiedErrorId : InvalidEndOfLine

@dhaniram-kshirsagar dhaniram-kshirsagar self-assigned this Dec 3, 2020
@chauhang chauhang added the bug Something isn't working label Dec 4, 2020
@harshbafna harshbafna linked a pull request Dec 9, 2020 that will close this issue
2 tasks
@harshbafna harshbafna added the duplicate This issue or pull request already exists label Dec 9, 2020
@harshbafna harshbafna removed a link to a pull request Dec 9, 2020
2 tasks
@harshbafna harshbafna removed the duplicate This issue or pull request already exists label Dec 9, 2020
@jeffxtang
Copy link
Contributor

Previous full logs were uploaded at #835. This is the new full log running as Admin on Windows CPU with 2 failed tests.
sanity_log_win_admin.txt

@dhaniram-kshirsagar
Copy link
Contributor Author

dhaniram-kshirsagar commented Dec 10, 2020

@jeffxtang Looking at the latest logs, it seems to be same issue OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions . Can you please refer troubleshooting section of the windows native installation page and change backend port range from 9000 to 9500 [or any other suitable range].

Also, let me know your machine configuration where you are testing this.

@jeffxtang
Copy link
Contributor

It's weird... I thought I didn't see the permission error anymore after I ran as Admin..
I looked at the troubleshooting link but in the config.properties file, I don't see port 9000 or backend but I changed the 3 addresses to:

inference_address=https://127.0.0.1:9443
management_address=https://127.0.0.1:9444
metrics_address=https://127.0.0.1:9445

After that running the sanity test had the same OSError: [WinError 10013] error.

My machine is a Windows 10 Pro.

What exact changes should be made to the config.properties and "all files in frontend/server/src/test/resources/snapshot/* and frontend/server/src/main/java/org/pytorch/serve/util/ConfigManager.java" as mentioned in the troubleshooting section? After the changes, do I need to clear the cache before running the sanity test script (if yes, how)? Thanks...

@lokeshgupta1975
Copy link
Collaborator

@jeffxtang Can you verify and close this issue?

@jeffxtang
Copy link
Contributor

I pulled the latest from master about 5 hours ago, the lastest commit being:

git log
commit f3a6d7658fd68729a26eddcefa9243e3b79b5d18 (HEAD -> master, origin/release/0.3.0_rc, origin/master, origin/HEAD)
Author: Hamid Shojanazeri <[email protected]>
Date:   Mon Dec 14 12:30:17 2020 -0800

    converting the offsets for explanations to current device (#900)

and then ran the sanity test after the two install, and was stuck at 82%:

python .\ts_scripts\install_dependencies.py --environment=dev
python .\ts_scripts\install_from_src.py
python .\torchserve_sanity.py
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: master

torchserve==0.3.0
torch-model-archiver==0.2.1

Python version: 3.8 (64-bit runtime)
Python executable: C:\Users\Warrior\anaconda3\envs\py38\python.exe

Versions of relevant python libraries:
numpy==1.19.4
torch==1.7.1+cu110
torch-model-archiver==0.2.1b20201214
torchaudio==0.7.2
torchserve==0.3.0b20201214
torchtext==0.8.1
torchvision==0.8.2+cu110
torch==1.7.1+cu110
torchtext==0.8.1
torchvision==0.8.2+cu110
torchaudio==0.7.2

Java Version:
openjdk 11.0.2 2019-01-15
OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)

OS: Microsoft Windows 10 Pro
GCC version: (GCC) 6.3.0
Clang version: N/A
CMake version: N/A

Is CUDA available: Yes
CUDA runtime version: 8.0.44
GPU models and configuration:
GPU 0: Quadro GP100
Nvidia driver version: 452.57
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn_ops_train64_8.dll

## Started frontend build and tests
## In directory: C:\Users\Warrior\repos\serve | Executing command: frontend\gradlew -p frontend clean build

> Task :server:killServer
No server running!

> Task :modelarchive:test

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.CoverageTest > test PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > archiveTest STANDARD_ERROR
    log4j:WARN No appenders could be found for logger (org.pytorch.serve.archive.ModelArchive).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > archiveTest PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > test PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testAllowedMultiUrls PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testAllowedURL PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testBlockedUrl PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testFileAlreadyExist PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testLocalFile PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testMalformLocalURL PASSED

ModelArchiverSuite > TorchServe > org.pytorch.serve.archive.ModelArchiveTest > testMalformedURL PASSED
<==========---> 82% EXECUTING [-27882695ms]

@harshbafna
Copy link
Contributor

@jeffxtang: The front-end Gradle build suite, executed as part of the sanity suite, takes around 8 minutes to complete. Did your run finish after some time or it was stuck there forever?

@jeffxtang
Copy link
Contributor

It was stuck there like 30 minutes.

@chauhang
Copy link
Contributor

@dhaniram-kshirsagar Please verify that the gRPC tests are working fine on Windows, based on the logs above this is the part where things are stuck. The same tests are taking long time to finish on Ubuntu as well

@harshbafna
Copy link
Contributor

@chauhang: There was a problem today morning (IST) with pip 20.3.2 release which was getting into a recursive loop while installing any package and thereby increasing the overall build time. We had created a PR (#904) to freeze the pip version to the last stable major release 20.3. However, later in the day, this release was yanked from PyPi.

We don't see any problem with gRPC test cases on any platform. Also, our sanity and regression CI builds are regularly on Windows Server 2019 using Windows Powershell without any problem .

The other problem related to gRPC reported by @jeffxtang in #907, is not a windows issue but a generic documentaion issue and I have created #908 for the same.

@jeffxtang
Copy link
Contributor

It was stuck there like 30 minutes.

Did another test and was stuck for about an hour. Full log was uploaded at issue #866

@harshbafna
Copy link
Contributor

Closing this as sanity suite is working fine on Windows.

The problem faced by @jeffxtang is already being tracked through #866

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants