Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker Died in load_model inside model_service_worker.py #789

Closed
Aldo-Aditiya opened this issue Nov 19, 2020 · 2 comments
Closed

Worker Died in load_model inside model_service_worker.py #789

Aldo-Aditiya opened this issue Nov 19, 2020 · 2 comments

Comments

@Aldo-Aditiya
Copy link

Context

  • torchserve version: 0.2.0
  • torch version:1.6.0
  • torchvision version [if any]:0.7.0
  • java version: installed from main readme
  • Operating System and version: ubuntu 20.04

Your Environment

  • Installed using source? [yes/no]: yes
  • Are you planning to deploy it using docker container? [yes/no]: yes (I deployed it in docker)
  • Is it a CPU or GPU environment?: GPU
  • Using a default/custom handler? [If possible upload/share custom handler/model]: Custom
  • What kind of model is it e.g. vision, text, audio?: Vision

Expected Behavior

For my custom model to work with torchserve.

Current Behavior

  • My custom model was not able to run
  • After more investigation, it seems the worker died every few seconds, as indicated in the log file from ts_log below

Steps to Reproduce

  1. Created model using -> https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html
  2. Imported model into .pt file
  3. Created custom handler, and create MAR file using this command:
torch-model-archiver --model-name backside_ver3 \
--version 1.0 \
--serialized-file /workspace/model/trigger_v3_17-11-10/model_jit.pt  \
--export-path ./model-store \
--extra-files ./index_to_name.json,./backside_trigger.py \
--handler ./handler.py
  1. Run torchserve (config is only used for setting IP and port)
torchserve --start --ncs \
--ts-config ./config.properties \
--model-store model-store \
--models backside_ver3=backside_ver3.mar

...

Failure Logs [if any]

2020-11-19 08:11:12,077 [INFO ] main org.pytorch.serve.ModelServer - 
Torchserve version: 0.2.0
TS Home: /home/venv/lib/python3.6/site-packages
Current directory: /workspace
Temp directory: /home/model-server/tmp
Number of GPUs: 1
Number of CPUs: 12
Max heap size: 3986 M
Python executable: /home/venv/bin/python3
Config file: /workspace/model/trigger_v3_17-11-10/serve/config.properties
Inference address: http://0.0.0.0:15008
Management address: http://127.0.0.1:8081
Metrics address: http://127.0.0.1:8082
Model Store: /workspace/model/trigger_v3_17-11-10/serve
Initial Models: backside_ver3=backside_ver3.mar
Log dir: /workspace/logs
Metrics dir: /workspace/logs
Netty threads: 0
Netty client threads: 0
Default workers per model: 1
Blacklist Regex: N/A
Maximum Response Size: 6553500
Maximum Request Size: 6553500
Prefer direct buffer: false
Custom python dependency for model allowed: false
Metrics report format: prometheus
Enable metrics API: true
2020-11-19 08:11:12,099 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: backside_ver3.mar
2020-11-19 08:11:12,696 [INFO ] main org.pytorch.serve.archive.ModelArchive - eTag 383a7786684a459d922863d6131af313
2020-11-19 08:11:12,704 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model backside_ver3
2020-11-19 08:11:12,704 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model backside_ver3
2020-11-19 08:11:12,704 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model backside_ver3 loaded.
2020-11-19 08:11:12,704 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: backside_ver3, count: 1
2020-11-19 08:11:12,712 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2020-11-19 08:11:12,761 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:15008
2020-11-19 08:11:12,762 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel.
2020-11-19 08:11:12,762 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081
2020-11-19 08:11:12,763 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2020-11-19 08:11:12,763 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
2020-11-19 08:11:12,766 [WARN ] pool-2-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet.
2020-11-19 08:11:12,791 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.ts.sock.9000
2020-11-19 08:11:12,791 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]4582
2020-11-19 08:11:12,791 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-11-19 08:11:12,791 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.9
2020-11-19 08:11:12,792 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-backside_ver3_1.0 State change null -> WORKER_STARTED
2020-11-19 08:11:12,795 [INFO ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2020-11-19 08:11:12,803 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2020-11-19 08:11:13,175 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
2020-11-19 08:11:13,176 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-11-19 08:11:13,176 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 176, in <module>
2020-11-19 08:11:13,176 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
2020-11-19 08:11:13,176 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 148, in run_server
2020-11-19 08:11:13,176 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2020-11-19 08:11:13,176 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 112, in handle_connection
2020-11-19 08:11:13,176 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2020-11-19 08:11:13,177 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service, result, code = self.load_model(msg)
2020-11-19 08:11:13,177 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 85, in load_model
2020-11-19 08:11:13,177 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2020-11-19 08:11:13,177 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size)
2020-11-19 08:11:13,177 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_loader.py", line 84, in load
2020-11-19 08:11:13,177 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
	at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-11-19 08:11:13,177 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     module = importlib.import_module(module_name)
2020-11-19 08:11:13,179 [WARN ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: backside_ver3, error: Worker died.
2020-11-19 08:11:13,179 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-backside_ver3_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-11-19 08:11:13,179 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
2020-11-19 08:11:13,179 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return _bootstrap._gcd_import(name[level:], package, level)
2020-11-19 08:11:13,179 [WARN ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-backside_ver3_1.0-stderr
2020-11-19 08:11:13,179 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
2020-11-19 08:11:13,179 [WARN ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-backside_ver3_1.0-stdout
2020-11-19 08:11:13,179 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
2020-11-19 08:11:13,180 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-backside_ver3_1.0-stdout
2020-11-19 08:11:13,180 [INFO ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2020-11-19 08:11:13,184 [INFO ] W-9000-backside_ver3_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-backside_ver3_1.0-stderr
2020-11-19 08:11:14,256 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.ts.sock.9000
2020-11-19 08:11:14,256 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]4618
2020-11-19 08:11:14,256 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-11-19 08:11:14,256 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-backside_ver3_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-11-19 08:11:14,256 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.9
2020-11-19 08:11:14,256 [INFO ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2020-11-19 08:11:14,257 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2020-11-19 08:11:14,604 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
2020-11-19 08:11:14,605 [INFO ] epollEventLoopGroup-5-2 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2020-11-19 08:11:14,605 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-11-19 08:11:14,605 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - System state is : WORKER_STARTED
2020-11-19 08:11:14,605 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 176, in <module>
2020-11-19 08:11:14,605 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2133)
	at java.base/java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:432)
	at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:129)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)
2020-11-19 08:11:14,605 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
2020-11-19 08:11:14,605 [WARN ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: backside_ver3, error: Worker died.
2020-11-19 08:11:14,605 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 148, in run_server
2020-11-19 08:11:14,605 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-backside_ver3_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-11-19 08:11:14,605 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2020-11-19 08:11:14,605 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 112, in handle_connection
2020-11-19 08:11:14,606 [WARN ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-backside_ver3_1.0-stderr
2020-11-19 08:11:14,606 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service, result, code = self.load_model(msg)
2020-11-19 08:11:14,606 [WARN ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - terminateIOStreams() threadName=W-9000-backside_ver3_1.0-stdout
2020-11-19 08:11:14,606 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/venv/lib/python3.6/site-packages/ts/model_service_worker.py", line 85, in load_model
2020-11-19 08:11:14,606 [INFO ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9000 in 1 seconds.
2020-11-19 08:11:14,606 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-backside_ver3_1.0-stdout
2020-11-19 08:11:14,610 [INFO ] W-9000-backside_ver3_1.0-stderr org.pytorch.serve.wlm.WorkerLifeCycle - Stopped Scanner - W-9000-backside_ver3_1.0-stderr
2020-11-19 08:11:15,681 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Listening on port: /home/model-server/tmp/.ts.sock.9000
2020-11-19 08:11:15,681 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - [PID]4636
2020-11-19 08:11:15,681 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-11-19 08:11:15,681 [DEBUG] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-backside_ver3_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-11-19 08:11:15,682 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.9
2020-11-19 08:11:15,682 [INFO ] W-9000-backside_ver3_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /home/model-server/tmp/.ts.sock.9000
2020-11-19 08:11:15,683 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /home/model-server/tmp/.ts.sock.9000.
2020-11-19 08:11:16,045 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process died.
2020-11-19 08:11:16,045 [INFO ] epollEventLoopGroup-5-3 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
2020-11-19 08:11:16,045 [INFO ] W-9000-backside_ver3_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):

Any help would be greatly appreciated 🙏

@Aldo-Aditiya
Copy link
Author

It seems the problem has been resolved. Turns out I didn't implement the right handler to be directed to torchserve. If anyone is having the same problem, I suggest you look at these:

https://towardsdatascience.com/deploy-models-and-create-custom-handlers-in-torchserve-fc2d048fbe91

#779

@Tazovsky
Copy link

It seems the problem has been resolved. Turns out I didn't implement the right handler to be directed to torchserve. If anyone is having the same problem, I suggest you look at these:

https://towardsdatascience.com/deploy-models-and-create-custom-handlers-in-torchserve-fc2d048fbe91

#779

You have shared URL to blog post where I have to pay first to see the full article. Are you kidding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants