Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

online-deployment issue "The Managed Inference service creation is taking longer than our normal time" #4088

Closed
bhushangholave opened this issue Nov 9, 2021 · 10 comments
Assignees
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. extension/ml Machine Learning Service Attention This issue is responsible by Azure service team.

Comments

@bhushangholave
Copy link

az --version
azure-cli 2.29.0 *

az extension list

[
{
"experimental": false,
"extensionType": "whl",
"name": "ml",
"path": "/home/icertisadmin/.azure/cliextensions/ml",
"preview": true,
"version": "2.0.2"
}
]

Description of issue (in as much detail as possible)

Command:- az ml online-deployment create --name blue --endpoint advanced-ner-endpoint-demo -f deployment.yml --all-traffic
Exception details
Creating or updating online_deployments
Check: endpoint advanced-endpoint exists
The deployment request bhushan-workspace-advanced-ner-endpoint-demo-8612672 was accepted. ARM deployment URI for reference:
https://ms.portal.azure.com/#blade/HubsExtension/DeploymentDetailsBlade/overview/id/%2Fsubscriptions%2F2b044625-9119-453a-8f50-53426430883b%2
Registering model version (ad2f42d3-xxxx-xxxx-be9f-2e20ef135c16 1 ) Done (1s)
Registering code version (c665e43f-xxxx-xxxx-a98d-a84911248e91 1 ) Done (1s)
Registering environment version (32890045-xxxx-48d3-a43c-6f2ec0323d08 1 ) Done (6s)
Creating deployment blue ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Code: ResourceDeploymentFailure
Message: The resource operation completed with terminal provisioning state 'Failed'.
Exception Details: (DeploymentTimedOut) The Managed Inference service creation is taking longer than our normal time.


@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that customer-reported Issues that are reported by GitHub users external to the Azure organization. labels Nov 9, 2021
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Nov 9, 2021
@yonzhan yonzhan added Service Attention This issue is responsible by Azure service team. and removed question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Nov 9, 2021
@ghost
Copy link

ghost commented Nov 9, 2021

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details

az --version
azure-cli 2.29.0 *

az extension list

[
{
"experimental": false,
"extensionType": "whl",
"name": "ml",
"path": "/home/icertisadmin/.azure/cliextensions/ml",
"preview": true,
"version": "2.0.2"
}
]

Description of issue (in as much detail as possible)

Command:- az ml online-deployment create --name blue --endpoint advanced-ner-endpoint-demo -f deployment.yml --all-traffic
Exception details
Creating or updating online_deployments
Check: endpoint advanced-endpoint exists
The deployment request bhushan-workspace-advanced-ner-endpoint-demo-8612672 was accepted. ARM deployment URI for reference:
https://ms.portal.azure.com/#blade/HubsExtension/DeploymentDetailsBlade/overview/id/%2Fsubscriptions%2F2b044625-9119-453a-8f50-53426430883b%2
Registering model version (ad2f42d3-xxxx-xxxx-be9f-2e20ef135c16 1 ) Done (1s)
Registering code version (c665e43f-xxxx-xxxx-a98d-a84911248e91 1 ) Done (1s)
Registering environment version (32890045-xxxx-48d3-a43c-6f2ec0323d08 1 ) Done (6s)
Creating deployment blue ................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Code: ResourceDeploymentFailure
Message: The resource operation completed with terminal provisioning state 'Failed'.
Exception Details: (DeploymentTimedOut) The Managed Inference service creation is taking longer than our normal time.


Author: bhushangholave
Assignees: -
Labels:

extension/ml, customer-reported, Machine Learning, Service Attention

Milestone: -

@yonzhan
Copy link
Collaborator

yonzhan commented Nov 9, 2021

route to service team

@diondrapeck
Copy link
Member

Acknowledged - beginning investigation and will reach out for more information if needed.

@vs-li
Copy link

vs-li commented Nov 10, 2021

Hi @bhushangholave! We found an error stating that the operation timed out waiting for an image build to complete. If you check the Azure portal, are you able to find an image build error under operation details?

@vizhur
Copy link

vizhur commented Nov 10, 2021

AzureML doesn't have access to the image build log because of the privacy. Though it is available for the user from the storage associated with the workspace or from the container registry associated with user workspace (ACR task runs). Timeout issue from the image build usually means poor environment specification, and pip fails to resolve the conflicts in user dependencies with a reasonable time.

Solution from the user end would be to inspect the dependencies and fix conflicts. Should be easily repro-ed locally by materializing environment from the conda spec.

Temporary workaround that might work is to pin pip<=20.2.4 to use older resolver that ignores some of the conflicts

@bhushangholave
Copy link
Author

@vs-li and @vizhur thank you for your comments.
Looks like some timeout issue to me but it reproducing every time.
Error details in ARM deployments are

{
    "status": "Failed",
    "error": {
        "code": "DeploymentTimedOut",
        "message": "The Managed Inference service creation is taking longer than our normal time.",
        "details": [],
        "additionalInfo": []
    }
}

ARM Failure
Details

@bhushangholave
Copy link
Author

bhushangholave commented Nov 11, 2021

@vs-li In Azure portal Environment tab it is still stuck at installing pip dependencies.
Same environment is working on our other workspace
Attaching logs and dependencies file
after 1HR30 m it timed out
Installing pip dependencies: ...working...
Run ID: cf4 timed out after 1h30m0s

Environment.log
.
pipdependencies file.txt

@vizhur
Copy link

vizhur commented Nov 12, 2021

@bhushangholave you have dependencies conflict. Unfortunately conda swallows pip's log and you cannot see pips struggles. if you pip install -r requirements.txt with those deps with the latest pip you should see where pip gets stuck. There are a bunch of tuckets for pip for those kind of issues, feel free to file one too.

just to unblock yourself you can try to pin pip<=20.2.4 to use old resolver that ignores some of the conflicts

@bhushangholave
Copy link
Author

@vizhur you are right, corrected my dependencies file and it worked. But portal or deployment creation needs to throw correct error and logs for debug.

@RakeshMohanMSFT
Copy link
Contributor

@bhushangholave Since the work around helped. Closing this thread for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. extension/ml Machine Learning Service Attention This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

6 participants