Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

504 error during deployment or destroying resources #116

Open
bryanheo opened this issue Aug 10, 2022 · 25 comments
Open

504 error during deployment or destroying resources #116

bryanheo opened this issue Aug 10, 2022 · 25 comments
Labels
backend Issues related to the backend service or the APIs.

Comments

@bryanheo
Copy link

Hello

We are deploying NetApp CVO in AWS through Terraform and sometime we have 504 error during deployment as shown below but the actual resources are successfully created in AWS. Due to the error, TF state file is not updated and we have to re-deploy TF (destroying the existing AWS resources by CloudFormation and redeploying by Terraform Enterprise). If we re-deploy TF then it works ok. It also sometime happens when we destroy TF resources.
Is it a known issue or Is it something you can investigate it?

504 error during the deployment
Screenshot 2022-08-10 at 09 35 24

504 error during destroying TF resources

Error: code: 504, message: 
│ 
│   with module.usw2.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

Regards
Moon

@suhasbshekar
Copy link
Collaborator

we have not seen this kind of issue before, could you send you playbook (.tf) file which is used?
So that we can try to reproduce on our end.

@bryanheo
Copy link
Author

bryanheo commented Aug 15, 2022

@suhasbshekar the error does not always happens but it sometime happens with other error messages like below
In addition, when we deploy CVO HA cluster, it always takes 35 minutes. Is it normal?

Could you let me know the safe way to upload the files so that you can investigate it?

Error 1

╷
│ Error: Post "https://netapp-cloud-account.auth0.com/oauth/token": dial tcp: lookup netapp-cloud-account.auth0.com on 127.0.0.1:53: read udp 127.0.0.1:57538->127.0.0.1:53: read: connection refused
│ 
│ 
╵

Error 2

│ Error: Post "https://cloudmanager.cloud.netapp.com/occm/api/aws/ha/working-environments": dial tcp: lookup cloudmanager.cloud.netapp.com on 127.0.0.1:53: read udp 127.0.0.1:54913->127.0.0.1:53: read: connection refused
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 

Error 3

╷
│ Error: code: 500, message: {"message":"Server Fault","causeMessage":"ConnectException: Connection refused (Connection refused)"}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 4

 Error: code: 400, message: Failure received for messageId JDxc6CJu with context . Failure message: occm: Name or service not known
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵

Error 5

╷
│ Error: code: 400, message: Failure received for messageId Va9yIR5c with context . Failure message: {"message":"Connection refused: occm/10.5.20.4:80","cause":null,"stackTrace":[{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":96,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"applyOrElse","fileName":"MessageDispatcherActor.scala","lineNumber":82,"className":"com.cloudmanager.messagepoller.poller.actor.MessageDispatcherBehavior$$anonfun$handleMessage$3","nativeMethod":false},{"methodName":"recover","fileName":"Try.scala","lineNumber":233,"className":"scala.util.Failure","nativeMethod":false},{"methodName":"run","fileName":"Promise.scala","lineNumber":450,"className":"scala.concurrent.impl.Promise$Transformation","nativeMethod":false},{"methodName":"processBatch","fileName":"BatchingExecutor.scala","lineNumber":55,"className":"akka.dispatch.BatchingExecutor$AbstractBatch","nativeMethod":false},{"methodName":"$anonfun$run$1","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"apply","fileName":"JFunction0$mcV$sp.scala","lineNumber":18,"className":"scala.runtime.java8.JFunction0$mcV$sp","nativeMethod":false},{"methodName":"withBlockContext","fileName":"BlockContext.scala","lineNumber":94,"className":"scala.concurrent.BlockContext$","nativeMethod":false},{"methodName":"run","fileName":"BatchingExecutor.scala","lineNumber":92,"className":"akka.dispatch.BatchingExecutor$BlockableBatch","nativeMethod":false},{"methodName":"run","fileName":"AbstractDispatcher.scala","lineNumber":47,"className":"akka.dispatch.TaskInvocation","nativeMethod":false},{"methodName":"exec","fileName":"ForkJoinExecutorConfigurator.scala","lineNumber":47,"className":"akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask","nativeMethod":false},{"methodName":"doExec","fileName":"ForkJoinTask.java","lineNumber":289,"className":"java.util.concurrent.ForkJoinTask","nativeMethod":false},{"methodName":"runTask","fileName":"ForkJoinPool.java","lineNumber":1056,"className":"java.util.concurrent.ForkJoinPool$WorkQueue","nativeMethod":false},{"methodName":"runWorker","fileName":"ForkJoinPool.java","lineNumber":1692,"className":"java.util.concurrent.ForkJoinPool","nativeMethod":false},{"methodName":"run","fileName":"ForkJoinWorkerThread.java","lineNumber":175,"className":"java.util.concurrent.ForkJoinWorkerThread","nativeMethod":false}],"localizedMessage":"Connection refused: occm/10.5.20.4:80","suppressed":[]}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {

@suhasbshekar
Copy link
Collaborator

yes, sometimes it will take 35 mins or more, but we test with demo version OR simple inputs, it depends on the complexity of the various inputs used.

@edarzi
Copy link

edarzi commented Aug 17, 2022

it can reach 35 minutes for HA.
is that issue reproducible? the 504? in that specific case seems that your connector was restarted due to health failures

@bryanheo
Copy link
Author

@edarzi 504 error happens during mediator is created.
I am trying to debug the issue but Cloud Manager timeline does not show the error and the CVO clusters are successfully created after the error. In order to update TF state file, I have to destroy the CVOs via CloudFormation and redeploy through TF again. Is there any ways to investigate it. How can I check the connector was restarted during the deployment?

Screenshot 2022-08-17 at 11 19 05

@bryanheo
Copy link
Author

bryanheo commented Aug 17, 2022

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

@lonico lonico added the backend Issues related to the backend service or the APIs. label Aug 17, 2022
@bryanheo
Copy link
Author

bryanheo commented Aug 18, 2022

@edarzi @suhasbshekar as required, I have created NetApp support case (2009274344) and I uploaded the playbook file on the case.
We are using a connector policy as guided by NetApp (https://docs.netapp.com/us-en/cloud-manager-setup-admin/reference-permissions-aws.html)
Could you have a look?

@edarzi
Copy link

edarzi commented Aug 18, 2022

Could you let us know how to import netapp-cloudmanager_cvo_aws in TF state file as well?

https://registry.terraform.io/providers/NetApp/netapp-cloudmanager/latest/docs/data-sources/cvo_aws

@bryanheo
Copy link
Author

bryanheo commented Aug 25, 2022

@edarzi @lonico we still have the same issue and we are trying to import the resources rather than deleting CVO through CloudFormation. Could we import the CVO resources with 'terraform import' rather than using data source?

module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Creating...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Still creating... [10s elapsed]
╷
│ Error: code: 400, message: {"message":"The name netappamtnuse1pri is already used by another working environment. Please use another one.","causeMessage":"BadRequestException: The name netappamtnuse1pri is already used by another working environment. Please use another one."}
│ 
│   with module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this,
│   on ../../../tf-module-aws-netapp/modules/cvo/cvo.tf line 1, in resource "netapp-cloudmanager_cvo_aws" "this":
│    1: resource "netapp-cloudmanager_cvo_aws" "this" {
│ 
╵
moonyoung.heo@C02C35ZVMD6T ap-netapp-np % terraform import module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this VsaWorkingEnvironment-xxxxx
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Importing from ID "VsaWorkingEnvironment-xxxxx"...
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Import prepared!
  Prepared netapp-cloudmanager_cvo_aws for import
module.use1.module.cvo.netapp-cloudmanager_cvo_aws.this: Refreshing state... [id=VsaWorkingEnvironment-xxxxx]
╷
│ Error: code: 400, message: Missing X-Agent-Id header
│ 
│ 
╵

@lonico
Copy link
Contributor

lonico commented Aug 25, 2022

No we don't support importing a connector. The APIs do not allow us to fetch enough information.

It would be better if Cloud Manager could provide an API to create a connector, rather than us having to go through the Cloud Provider APIs and Cloud Manager APIs. This introduces a level of complexity.

@bryanheo
Copy link
Author

bryanheo commented Sep 1, 2022

@lonico @edarzi @suhasbshekar the issue keeps happening from Terraform Enterprise and local laptop. I cannot see any error on the timeline of Cloud manager. The CVO are successfully deployed in AWS while the error occurs but I have to redeploy the CVOs due to the inconsistent TF state file.
Do you have any methods to find out why 504 error happens?

image

@lonico
Copy link
Contributor

lonico commented Sep 1, 2022

@bryanheo Since it looks like a Cloud Manager issue, I would suggest you open a case to track this issue.

@suhasbshekar @edarzi Should we retry on such an error? How many times? Can we be more specific about the context?

@bryanheo
Copy link
Author

bryanheo commented Sep 2, 2022

@lonico Thank you for your suggestion. I am not sure whether this issue is related to Cloud Manager or not because I did not have 504 error when I deployed CVO by Cloud Manager manually. Anyway, as you suggested I will create a case on NetApp support site.

@edarzi
Copy link

edarzi commented Sep 2, 2022

Will need some more details in order to track and debug. Ping me at [email protected]

@bryanheo
Copy link
Author

bryanheo commented Sep 2, 2022

@edarzi Thank you for your reply. As mentioned earlier, I have uploaded our entire TF code on NetApp support case (2009274344) and could you have a look? If you cannot access the case, please let me know

@edarzi
Copy link

edarzi commented Sep 3, 2022 via email

@bryanheo
Copy link
Author

bryanheo commented Sep 5, 2022

@edarzi could you let me know how to get the logs from the connector? Could we use AutoSupport?

@edarzi
Copy link

edarzi commented Sep 5, 2022

you can download the auto support file from the Cloud manager UI and send it to my mail please
you can also send me the service manager log from: /opt/application/netapp/cloudmanager/log/service-manager.log

@lonico
Copy link
Contributor

lonico commented Sep 8, 2022

@edarzi Any update on this. We're attempting to add a retry. But without understanding the root cause, we don't know if a retry would help, or how many times / how long we should try.

@bryanheo
Copy link
Author

bryanheo commented Sep 8, 2022

@edarzi I have sent email with the auto support file from the Cloud manager UI but the file size is about 30MB and it has been rejected by your mail server. Could you let me know where to upload the 30MB file? (NetApp Support ticket does not allow autosupport 7z file either)
In addition, I do not know how to get /opt/application/netapp/cloudmanager/log/service-manager.log. Could you let me know how to get the log file?

image

@lonico
Copy link
Contributor

lonico commented Sep 9, 2022

We released 22.9.0 yesterday (9/8). It provides some retries on 504 errors. Can you see if it helps?

@bryanheo
Copy link
Author

@lonico I have deployed NetApp CVO clusters several times with 22.9.0 and I have not seen 504 error so far.
It looks better than previous version. I will let you know if we have the error again

@lonico
Copy link
Contributor

lonico commented Sep 12, 2022

That's great news. As you know, we added a retry on 504. You could see it in the logs by setting TF_LOG to DEBUG or TRACE.
I'm curious to see if it always work on the first retry (which would indicate some sort of transient issue) or if we need to retry several times.

@laagabi
Copy link

laagabi commented Sep 26, 2022

Hi @lonico

I`m Gabor with NetApp Tech Support and have been working with the customer on this issue.

@bryanheo as discussed, for me to investigate from the cloud manager end, we would need to have logging verbosity enabled in the cloud manager. This might allow us to see how long it takes for cm to process the requests and we can proactively enhance the software to work better with terraform.

Once done, simply trigger a cloud manager auto support and I will review it.

@bryanheo
Copy link
Author

bryanheo commented Nov 4, 2022

Hi @lonico
I thought the issue has been resolve but it has happened again.
As mentioned above, NetApp AWS resources have been successfully created but with 504 error, Terraform State has not been updated. In other words, we have to redeploy the cluster. Could you investigate it?

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Issues related to the backend service or the APIs.
Projects
None yet
Development

No branches or pull requests

5 participants