504 error during deployment or destroying resources #116

bryanheo · 2022-08-10T08:46:54Z

Hello

We are deploying NetApp CVO in AWS through Terraform and sometime we have 504 error during deployment as shown below but the actual resources are successfully created in AWS. Due to the error, TF state file is not updated and we have to re-deploy TF (destroying the existing AWS resources by CloudFormation and redeploying by Terraform Enterprise). If we re-deploy TF then it works ok. It also sometime happens when we destroy TF resources.
Is it a known issue or Is it something you can investigate it?

504 error during the deployment

504 error during destroying TF resources

Error: code: 504, message: 
│ 
│   with module.usw2.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

Regards
Moon

The text was updated successfully, but these errors were encountered:

suhasbshekar · 2022-08-12T19:05:00Z

we have not seen this kind of issue before, could you send you playbook (.tf) file which is used?
So that we can try to reproduce on our end.

bryanheo · 2022-08-15T14:57:25Z

@suhasbshekar the error does not always happens but it sometime happens with other error messages like below
In addition, when we deploy CVO HA cluster, it always takes 35 minutes. Is it normal?

Could you let me know the safe way to upload the files so that you can investigate it?

Error 1

╷
│ Error: Post "https://netapp-cloud-account.auth0.com/oauth/token": dial tcp: lookup netapp-cloud-account.auth0.com on 127.0.0.1:53: read udp 127.0.0.1:57538->127.0.0.1:53: read: connection refused
│ 
│ 
╵

Error 2

│ Error: Post "https://cloudmanager.cloud.netapp.com/occm/api/aws/ha/working-environments": dial tcp: lookup cloudmanager.cloud.netapp.com on 127.0.0.1:53: read udp 127.0.0.1:54913->127.0.0.1:53: read: connection refused
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│

Error 3

╷
│ Error: code: 500, message: {"message":"Server Fault","causeMessage":"ConnectException: Connection refused (Connection refused)"}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 4

 Error: code: 400, message: Failure received for messageId JDxc6CJu with context . Failure message: occm: Name or service not known
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 5

╷
│ Error: code: 400, message: Failure received for messageId Va9yIR5c with context . Failure message: {"message":"Connection refused: occm/10.5.20.4:80","cause":null,"stackTrace":[{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":96,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":82,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"recover","fileName":"Try.scala","lineNumber":233,"className":"scala.util.Failure","nativeMethod":false},{"methodName":"run","fileName":"Promise.scala","lineNumber":450,"className":"scala.concurrent.impl.Promise$Transformation","nativeMethod":false},{"methodName":"processBatch","fileName":"BatchingExecutor.scala","lineNumber":55,"className":"akka.dispatch.BatchingExecutor$AbstractBatch","nativeMethod":false},{"methodName":"$anonfun$run$1","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"apply","fileName":"JFunction0$mcV$sp.scala","lineNumber":18,"className":"scala.runtime.java8.JFunction0$mcV$sp","nativeMethod":false},{"methodName":"withBlockContext","fileName":"BlockContext.scala","lineNumber":94,"className":"scala.concurrent.BlockContext$","nativeMethod":false},{"methodName":"run","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"run","fileName":"AbstractDispatcher.scala","lineNumber":47,"className":"akka.dispatch.TaskInvocation","nativeMethod":false},{"methodName":"exec","fileName":"ForkJoinExecutorConfigurator.scala","lineNumber":47,"className":"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask","nativeMethod":false},{"methodName":"doExec","fileName":"ForkJoinTask.java","lineNumber":289,"className":"java.util.concurrent.ForkJoinTask","nativeMethod":false},{"methodName":"runTask","fileName":"ForkJoinPool.java","lineNumber":1056,"className":"java.util.concurrent.ForkJoinPool$WorkQueue","nativeMethod":false},{"methodName":"runWorker","fileName":"ForkJoinPool.java","lineNumber":1692,"className":"java.util.concurrent.ForkJoinPool","nativeMethod":false},{"methodName":"run","fileName":"ForkJoinWorkerThread.java","lineNumber":175,"className":"java.util.concurrent.ForkJoinWorkerThread","nativeMethod":false}],"localizedMessage":"Connection refused: occm/10.5.20.4:80","suppressed":[]}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

suhasbshekar · 2022-08-16T16:50:59Z

yes, sometimes it will take 35 mins or more, but we test with demo version OR simple inputs, it depends on the complexity of the various inputs used.

edarzi · 2022-08-17T06:02:59Z

it can reach 35 minutes for HA.
is that issue reproducible? the 504? in that specific case seems that your connector was restarted due to health failures

bryanheo · 2022-08-17T10:23:14Z

@edarzi 504 error happens during mediator is created.
I am trying to debug the issue but Cloud Manager timeline does not show the error and the CVO clusters are successfully created after the error. In order to update TF state file, I have to destroy the CVOs via CloudFormation and redeploy through TF again. Is there any ways to investigate it. How can I check the connector was restarted during the deployment?

bryanheo · 2022-08-17T10:25:16Z

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

bryanheo · 2022-08-18T10:39:56Z

@edarzi @suhasbshekar as required, I have created NetApp support case (2009274344) and I uploaded the playbook file on the case.
We are using a connector policy as guided by NetApp (https://docs.netapp.com/us-en/cloud-manager-setup-admin/reference-permissions-aws.html)
Could you have a look?

edarzi · 2022-08-18T12:40:00Z

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

https://registry.terraform.io/providers/NetApp/netapp-cloudmanager/latest/docs/data-sources/cvo_aws

bryanheo · 2022-08-25T12:29:25Z

@edarzi @lonico we still have the same issue and we are trying to import the resources rather than deleting CVO through CloudFormation. Could we import the CVO resources with 'terraform import' rather than using data source?

module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Creating...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Still creating... [10s elapsed]
╷
│ Error: code: 400, message: {"message":"The name netappamtnuse1pri is already used by another working environment. Please use another one.","causeMessage":"BadRequestException: The name netappamtnuse1pri is already used by another working environment. Please use another one."}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵
moonyoung.heo@C02C35ZVMD6T ap-netapp-np % terraform import module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this VsaWorkingEnvironment-xxxxx
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Importing from ID "VsaWorkingEnvironment-xxxxx"...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Import prepared!
  Prepared netapp-cloudmanager_cvo_aws for import
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Refreshing state... [id=VsaWorkingEnvironment-xxxxx]
╷
│ Error: code: 400, message: Missing X-Agent-Id header
│ 
│ 
╵

lonico · 2022-08-25T17:08:37Z

No we don't support importing a connector. The APIs do not allow us to fetch enough information.

It would be better if Cloud Manager could provide an API to create a connector, rather than us having to go through the Cloud Provider APIs and Cloud Manager APIs. This introduces a level of complexity.

bryanheo · 2022-09-01T08:52:14Z

@lonico @edarzi @suhasbshekar the issue keeps happening from Terraform Enterprise and local laptop. I cannot see any error on the timeline of Cloud manager. The CVO are successfully deployed in AWS while the error occurs but I have to redeploy the CVOs due to the inconsistent TF state file.
Do you have any methods to find out why 504 error happens?

lonico · 2022-09-01T14:29:52Z

@bryanheo Since it looks like a Cloud Manager issue, I would suggest you open a case to track this issue.

@suhasbshekar @edarzi Should we retry on such an error? How many times? Can we be more specific about the context?

bryanheo · 2022-09-02T11:03:02Z

@lonico Thank you for your suggestion. I am not sure whether this issue is related to Cloud Manager or not because I did not have 504 error when I deployed CVO by Cloud Manager manually. Anyway, as you suggested I will create a case on NetApp support site.

edarzi · 2022-09-02T11:14:49Z

Will need some more details in order to track and debug. Ping me at [email protected]

bryanheo · 2022-09-02T21:47:02Z

@edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know

edarzi · 2022-09-03T10:03:33Z

I will need logs from the connector

…

________________________________ From: bryanheo ***@***.***> Sent: Saturday, September 3, 2022 12:47:13 AM To: NetApp/terraform-provider-netapp-cloudmanager ***@***.***> Cc: Darzi, Eran ***@***.***>; Mention ***@***.***> Subject: Re: [NetApp/terraform-provider-netapp-cloudmanager] 504 error during deployment or destroying resources (Issue #116) NetApp Security WARNING: This is an external email. Do not click links or open attachments unless you recognize the sender and know the content is safe. @edarzi<https://github.com/edarzi> Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know — Reply to this email directly, view it on GitHub<#116 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALB4HM3VTMJLGUOE4XXNLPDV4JYWDANCNFSM56DU5WXA>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bryanheo · 2022-09-05T09:51:24Z

@edarzi could you let me know how to get the logs from the connector? Could we use AutoSupport?

edarzi · 2022-09-05T10:04:49Z

you can download the auto support file from the Cloud manager UI and send it to my mail please
you can also send me the service manager log from: /opt/application/netapp/cloudmanager/log/service-manager.log

lonico · 2022-09-08T16:40:26Z

@edarzi Any update on this. We're attempting to add a retry. But without understanding the root cause, we don't know if a retry would help, or how many times / how long we should try.

bryanheo · 2022-09-08T16:51:46Z

@edarzi I have sent email with the auto support file from the Cloud manager UI but the file size is about 30MB and it has been rejected by your mail server. Could you let me know where to upload the 30MB file? (NetApp Support ticket does not allow autosupport 7z file either)
In addition, I do not know how to get /opt/application/netapp/cloudmanager/log/service-manager.log. Could you let me know how to get the log file?

lonico · 2022-09-09T14:03:14Z

We released 22.9.0 yesterday (9/8). It provides some retries on 504 errors. Can you see if it helps?

bryanheo · 2022-09-12T14:16:48Z

@lonico I have deployed NetApp CVO clusters several times with 22.9.0 and I have not seen 504 error so far.
It looks better than previous version. I will let you know if we have the error again

lonico · 2022-09-12T14:32:58Z

That's great news. As you know, we added a retry on 504. You could see it in the logs by setting TF_LOG to DEBUG or TRACE.
I'm curious to see if it always work on the first retry (which would indicate some sort of transient issue) or if we need to retry several times.

laagabi · 2022-09-26T14:45:20Z

Hi @lonico

I`m Gabor with NetApp Tech Support and have been working with the customer on this issue.

@bryanheo as discussed, for me to investigate from the cloud manager end, we would need to have logging verbosity enabled in the cloud manager. This might allow us to see how long it takes for cm to process the requests and we can proactively enhance the software to work better with terraform.

Once done, simply trigger a cloud manager auto support and I will review it.

bryanheo · 2022-11-04T09:15:53Z

Hi @lonico
I thought the issue has been resolve but it has happened again.
As mentioned above, NetApp AWS resources have been successfully created but with 504 error, Terraform State has not been updated. In other words, we have to redeploy the cluster. Could you investigate it?

lonico added the backend Issues related to the backend service or the APIs. label Aug 17, 2022

lonico mentioned this issue Sep 6, 2022

Error Code 504 message while deploying CVO #130

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

504 error during deployment or destroying resources #116

504 error during deployment or destroying resources #116

bryanheo commented Aug 10, 2022

suhasbshekar commented Aug 12, 2022

bryanheo commented Aug 15, 2022 •

edited

Loading

suhasbshekar commented Aug 16, 2022

edarzi commented Aug 17, 2022

bryanheo commented Aug 17, 2022

bryanheo commented Aug 17, 2022 •

edited

Loading

bryanheo commented Aug 18, 2022 •

edited

Loading

edarzi commented Aug 18, 2022

bryanheo commented Aug 25, 2022 •

edited

Loading

lonico commented Aug 25, 2022

bryanheo commented Sep 1, 2022

lonico commented Sep 1, 2022

bryanheo commented Sep 2, 2022

edarzi commented Sep 2, 2022

bryanheo commented Sep 2, 2022

edarzi commented Sep 3, 2022 via email

bryanheo commented Sep 5, 2022

edarzi commented Sep 5, 2022

lonico commented Sep 8, 2022

bryanheo commented Sep 8, 2022 •

edited

Loading

lonico commented Sep 9, 2022

bryanheo commented Sep 12, 2022

lonico commented Sep 12, 2022

laagabi commented Sep 26, 2022

bryanheo commented Nov 4, 2022 •

edited

Loading

504 error during deployment or destroying resources #116

504 error during deployment or destroying resources #116

Comments

bryanheo commented Aug 10, 2022

suhasbshekar commented Aug 12, 2022

bryanheo commented Aug 15, 2022 • edited Loading

suhasbshekar commented Aug 16, 2022

edarzi commented Aug 17, 2022

bryanheo commented Aug 17, 2022

bryanheo commented Aug 17, 2022 • edited Loading

bryanheo commented Aug 18, 2022 • edited Loading

edarzi commented Aug 18, 2022

bryanheo commented Aug 25, 2022 • edited Loading

lonico commented Aug 25, 2022

bryanheo commented Sep 1, 2022

lonico commented Sep 1, 2022

bryanheo commented Sep 2, 2022

edarzi commented Sep 2, 2022

bryanheo commented Sep 2, 2022

edarzi commented Sep 3, 2022 via email

bryanheo commented Sep 5, 2022

edarzi commented Sep 5, 2022

lonico commented Sep 8, 2022

bryanheo commented Sep 8, 2022 • edited Loading

lonico commented Sep 9, 2022

bryanheo commented Sep 12, 2022

lonico commented Sep 12, 2022

laagabi commented Sep 26, 2022

bryanheo commented Nov 4, 2022 • edited Loading

bryanheo commented Aug 15, 2022 •

edited

Loading

bryanheo commented Aug 17, 2022 •

edited

Loading

bryanheo commented Aug 18, 2022 •

edited

Loading

bryanheo commented Aug 25, 2022 •

edited

Loading

bryanheo commented Sep 8, 2022 •

edited

Loading

bryanheo commented Nov 4, 2022 •

edited

Loading