Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flink Application failing with hash collision message #69

Closed
soumyasmruti opened this issue Aug 7, 2019 · 22 comments
Closed

Flink Application failing with hash collision message #69

soumyasmruti opened this issue Aug 7, 2019 · 22 comments

Comments

@soumyasmruti
Copy link

The flink application is failing with this message what could be the reason? how to debug?

{"json":{"app_name":"streaming-job","ns":"flink-operator","phase":"ClusterStarting"},"level":"info","msg":"Handle state skipped for application, lastSeenError UnknownMethod call failed with status FAILED and message []: found hash collision for deployment, you must do a clean deploy","ts":"2019-08-07T04:21:52Z"}
{"json":{"app_name":"streaming-job","ns":"flink-operator","phase":"ClusterStarting"},"level":"info","msg":"Handle state skipped for application, lastSeenError UnknownMethod call failed with status FAILED and message []: found hash collision for deployment, you must do a clean deploy","ts":"2019-08-07T04:21:52Z"}
@cjmakes
Copy link

cjmakes commented Aug 7, 2019

+! Have been seeing this for around a week. Have been trying to debug but haven't been able to get anywhere further than reading through pgk/controller/flink/flink.go:341.

My current understanding is that when making a deployment he operator runs through a list of currently deployed applications, hashes them to make sure that it doesn't deploy the same application twice.

My current (baseless) guess is that the application which is being deployed is added to the list before the hash checking happens. This is a guess, if someone from the team could either confirm or correct me that would be awesome.

@anandswaminathan
Copy link
Contributor

anandswaminathan commented Aug 7, 2019

@soumyasmruti @cjmakes

There are two cases for this to happen

  • We recently published a bunch of backward incompatible changes (as we moved to Beta). This includes modifying the hash function to fix several issues. You would hit this issue if the hash function in the operator gets changed when the Operator is in non Running/Deployfailed state.

This can happen in following cases - but only when the application is deployed for the first time ever. If there is already a job running and updates to an existing application will cause the application to DeployFailed state.

  1. The application is updated/modified when it is stuck and unable to progress (due to underlying issues pod/container not coming up or crashing)
  2. A newer version of the operator being deployed (with a modified hash function) when the application is currently in non running state.

As the message states, to get around the issue, you will have to delete and recreate the application resource. There is no real Flink job or application underneath when the application is in "ClusterStarting" state.

Let me know if you guys performed any of the above operation ^

@cjmakes
Copy link

cjmakes commented Aug 7, 2019

We recently published a bunch of backward incompatible changes (as we moved to Beta)

I was hitting this before the upgrade to beta

The application is updated/modified when it is stuck and unable to progress (due to underlying issues pod/container not coming up or crashing)

This is on the first deploy when there is nothing running. There doesn't seem to be a problem with the underlying container. I can get to the flink gui and even submit a job.

A newer version of the operator being deployed (with a modified hash function) when the application is currently in non running state.

I haven't upgraded the operator with any applications deployed.

As the message states, to get around the issue, you will have to delete and recreate the application resource

I have tried deleting the application, reinstalling the operator, even starting a new cluster.

Do you have any tips on how we could go about debugging this? Enabling debug logging in the operator or a guide to writing tests to try and expose this?

@anandswaminathan
Copy link
Contributor

anandswaminathan commented Aug 7, 2019

@cjmakes

I have not hit the issue. So I am just listing cases where this can happen. It happened once in my local setup where I had multiple operators running. My previous reply was more towards the issue of ClusterStarting that @.soumyasmruti indicated.

Can you provide more information

  • The logs
  • The phase where the application is stuck
  • Output of the flink application status, pods, deployments.

Can you also verify that you are not running multiple versions of the operators

@cjmakes
Copy link

cjmakes commented Aug 7, 2019

I've attached logs of the JobManager, TaskManager and Operator. Application is stuck in CreatingCluster. Also attaching the description of the application.

Confirmed there is only 1 instance of 1 version of the operator running.

jm.log
tm.log
operator.log
application.log

How can I set the operator log level to debug mode?

@anandswaminathan
Copy link
Contributor

@cjmakes Can you share the output of the deployment and pod status.

@soumyasmruti
Copy link
Author

@anandswaminathan What I did was pull the latest changes from yesterday build the flink operator and deploy it in my local. I found this error when I tried to run my applications that was built on the docker files for word count example. I see there aren't any changes to the docker files, so may be something broke in there? Those examples worked fine with version 0.1.3.

@soumyasmruti
Copy link
Author

I have very similar output logs as @cjmakes

@anandswaminathan
Copy link
Contributor

You can modify the log level by setting logger section in the config map

logger:
  show-source: true
  level: 5

@cjmakes
Copy link

cjmakes commented Aug 7, 2019

Both jm and tm pods are running and both jm and tm deployments are available.

@anandswaminathan
Copy link
Contributor

@cjmakes Can you share the output similar to application.log so that I can debug why the equals check is failing.

@cjmakes
Copy link

cjmakes commented Aug 7, 2019

deployments.log
pods.log

@anandswaminathan
Copy link
Contributor

I am debugging this.

Here is the place that throws error - https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flink/flink.go#L341

@cjmakes
Copy link

cjmakes commented Aug 7, 2019

Have you been to replicate the error? I'm currently going through the code and I suspect it has something to do with the flink.go: 158 Controller.DeploymentMatches logic

@anandswaminathan
Copy link
Contributor

@cjmakes Nope. I am not able to replicate it.

You can also see our integration tests here: https://travis-ci.org/lyft/flinkk8soperator/builds/568614007

Also tried https://github.com/lyft/flinkk8soperator/tree/master/integ and https://github.com/lyft/flinkk8soperator/blob/master/docs/quick-start-guide.md on the latest image - docker.io/lyft/flinkk8soperator:500fe6bd40da8efca4a48bbb1104896be2c1fae8

@anandswaminathan
Copy link
Contributor

Looks like the apiequality.Semantic.DeepEqual(volume) is returning false.

@cjmakes
Copy link

cjmakes commented Aug 7, 2019

Thank you for time tracking this down!

What could cause this/what would be the solution?

Can I ask how you determined that? I'm trying to hone my kubenetes debugging skills.

@anandswaminathan
Copy link
Contributor

anandswaminathan commented Aug 7, 2019

I have a PR here: #71 (But not sure if I want to merge)

@cjmakes I sync'd with @soumyasmruti (over email), and he shared his application.yaml. Then I ran the operator for his application with few added log lines.

@soumyasmruti You can fix this issue by setting emptyDir explicitly

      volumes:
      - emptyDir: {}
        name: data-dir

@cjmakes Similarly in your case can you add this. Set defaultMode explicitly

  volumes:
    - configMap:
        defaultMode: 420
        name: good-generator-configmap
      name: config-vol

I tried your applications with this, and they proceeded. Let me know if it works for you.

@cjmakes
Copy link

cjmakes commented Aug 8, 2019

@anandswaminathan, Thank you so much you help with this, that has fixed my issue!

@anandswaminathan
Copy link
Contributor

anandswaminathan commented Aug 12, 2019

@soumyasmruti @cjmakes

I have merged the PR: #71
This has removed the extra equality checks.

Use the latest operator image: docker.io/lyft/flinkk8soperator:d8cbd7481943739740947f5adbd7debd2c0ebd1c

You need not have to follow the temporary fix I mentioned above - #69 (comment). You original yaml should work now.

Please try and let me know.

@soumyasmruti
Copy link
Author

Thanks @anandswaminathan

@anandswaminathan
Copy link
Contributor

@soumyasmruti Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants