You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The user reported this error when they try to deploy merlin-tensorflow image >= 23.04. They are able to deploy merlin-tensorflow:23.02 image on Azure databricks. One main different is cuda versions in these images.
Spark driver could not be reached on startup. This issue can be caused by invalid Spark configurations or malfunctioning [init scripts](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fazure%2Fdatabricks%2Fclusters%2Finit-scripts%23global-and-cluster-named-init-script-logs&data=05%7C01%7Cronaya%40nvidia.com%7Cfe78a893b81e491de97208db82eee73e%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638247734960282987%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=inGDUr3qE2Xy%2BYdYVbF6C39%2BCH4syUZkTOOgaRvk6J4%3D&reserved=0). Please refer to the Spark driver logs to troubleshoot this issue, and contact Databricks if the problem persists.
Internal error message: Spark failed to start: Could not connect to driver instance. Possible reason: network misconfiguration.
Bug description
The user reported this error when they try to deploy
merlin-tensorflow
image >= 23.04. They are able to deploymerlin-tensorflow:23.02
image on Azure databricks. One main different is cuda versions in these images.Steps/Code to reproduce bug
Expected behavior
Environment details
Additional context
An eng from Rapids team did some debugging about the spark cluster issue that this user is facing with merlin-tensorflow:23.04 image. Rapids eng spent some time converting the instructions from https://docs.databricks.com/clusters/custom-containers.html#option-2-build-your-own-docker-base into some tests that we can run with container canary:
https://github.com/NVIDIA/container-canary/blob/main/examples/databricks.yaml
Here are some quick notes on running the test:
https://gist.github.com/jacobtomlinson/73f30f5657a370e7ed2a559b0eb7123f
The text was updated successfully, but these errors were encountered: