Flink Application failing with hash collision message #69

soumyasmruti · 2019-08-07T04:23:35Z

The flink application is failing with this message what could be the reason? how to debug?

{"json":{"app_name":"streaming-job","ns":"flink-operator","phase":"ClusterStarting"},"level":"info","msg":"Handle state skipped for application, lastSeenError UnknownMethod call failed with status FAILED and message []: found hash collision for deployment, you must do a clean deploy","ts":"2019-08-07T04:21:52Z"}
{"json":{"app_name":"streaming-job","ns":"flink-operator","phase":"ClusterStarting"},"level":"info","msg":"Handle state skipped for application, lastSeenError UnknownMethod call failed with status FAILED and message []: found hash collision for deployment, you must do a clean deploy","ts":"2019-08-07T04:21:52Z"}

The text was updated successfully, but these errors were encountered:

cjmakes · 2019-08-07T14:35:26Z

+! Have been seeing this for around a week. Have been trying to debug but haven't been able to get anywhere further than reading through pgk/controller/flink/flink.go:341.

My current understanding is that when making a deployment he operator runs through a list of currently deployed applications, hashes them to make sure that it doesn't deploy the same application twice.

My current (baseless) guess is that the application which is being deployed is added to the list before the hash checking happens. This is a guess, if someone from the team could either confirm or correct me that would be awesome.

anandswaminathan · 2019-08-07T17:30:42Z

@soumyasmruti @cjmakes

There are two cases for this to happen

We recently published a bunch of backward incompatible changes (as we moved to Beta). This includes modifying the hash function to fix several issues. You would hit this issue if the hash function in the operator gets changed when the Operator is in non Running/Deployfailed state.

This can happen in following cases - but only when the application is deployed for the first time ever. If there is already a job running and updates to an existing application will cause the application to DeployFailed state.

The application is updated/modified when it is stuck and unable to progress (due to underlying issues pod/container not coming up or crashing)
A newer version of the operator being deployed (with a modified hash function) when the application is currently in non running state.

As the message states, to get around the issue, you will have to delete and recreate the application resource. There is no real Flink job or application underneath when the application is in "ClusterStarting" state.

Let me know if you guys performed any of the above operation ^

cjmakes · 2019-08-07T18:08:27Z

We recently published a bunch of backward incompatible changes (as we moved to Beta)

I was hitting this before the upgrade to beta

The application is updated/modified when it is stuck and unable to progress (due to underlying issues pod/container not coming up or crashing)

This is on the first deploy when there is nothing running. There doesn't seem to be a problem with the underlying container. I can get to the flink gui and even submit a job.

A newer version of the operator being deployed (with a modified hash function) when the application is currently in non running state.

I haven't upgraded the operator with any applications deployed.

As the message states, to get around the issue, you will have to delete and recreate the application resource

I have tried deleting the application, reinstalling the operator, even starting a new cluster.

Do you have any tips on how we could go about debugging this? Enabling debug logging in the operator or a guide to writing tests to try and expose this?

anandswaminathan · 2019-08-07T19:08:28Z

@cjmakes

I have not hit the issue. So I am just listing cases where this can happen. It happened once in my local setup where I had multiple operators running. My previous reply was more towards the issue of ClusterStarting that @.soumyasmruti indicated.

Can you provide more information

The logs
The phase where the application is stuck
Output of the flink application status, pods, deployments.

Can you also verify that you are not running multiple versions of the operators

cjmakes · 2019-08-07T19:22:36Z

I've attached logs of the JobManager, TaskManager and Operator. Application is stuck in CreatingCluster. Also attaching the description of the application.

Confirmed there is only 1 instance of 1 version of the operator running.

jm.log
tm.log
operator.log
application.log

How can I set the operator log level to debug mode?

anandswaminathan · 2019-08-07T19:25:01Z

@cjmakes Can you share the output of the deployment and pod status.

soumyasmruti · 2019-08-07T19:25:17Z

@anandswaminathan What I did was pull the latest changes from yesterday build the flink operator and deploy it in my local. I found this error when I tried to run my applications that was built on the docker files for word count example. I see there aren't any changes to the docker files, so may be something broke in there? Those examples worked fine with version 0.1.3.

soumyasmruti · 2019-08-07T19:27:42Z

I have very similar output logs as @cjmakes

anandswaminathan · 2019-08-07T19:34:12Z

You can modify the log level by setting logger section in the config map

logger:
  show-source: true
  level: 5

cjmakes · 2019-08-07T19:35:44Z

Both jm and tm pods are running and both jm and tm deployments are available.

anandswaminathan · 2019-08-07T19:45:35Z

@cjmakes Can you share the output similar to application.log so that I can debug why the equals check is failing.

cjmakes · 2019-08-07T19:58:14Z

deployments.log
pods.log

anandswaminathan · 2019-08-07T20:42:48Z

I am debugging this.

Here is the place that throws error - https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flink/flink.go#L341

cjmakes · 2019-08-07T21:07:37Z

Have you been to replicate the error? I'm currently going through the code and I suspect it has something to do with the flink.go: 158 Controller.DeploymentMatches logic

anandswaminathan · 2019-08-07T21:18:53Z

@cjmakes Nope. I am not able to replicate it.

You can also see our integration tests here: https://travis-ci.org/lyft/flinkk8soperator/builds/568614007

Also tried https://github.com/lyft/flinkk8soperator/tree/master/integ and https://github.com/lyft/flinkk8soperator/blob/master/docs/quick-start-guide.md on the latest image - docker.io/lyft/flinkk8soperator:500fe6bd40da8efca4a48bbb1104896be2c1fae8

anandswaminathan · 2019-08-07T22:11:32Z

Looks like the apiequality.Semantic.DeepEqual(volume) is returning false.

cjmakes · 2019-08-07T23:21:27Z

Thank you for time tracking this down!

What could cause this/what would be the solution?

Can I ask how you determined that? I'm trying to hone my kubenetes debugging skills.

anandswaminathan · 2019-08-07T23:55:14Z

I have a PR here: #71 (But not sure if I want to merge)

@cjmakes I sync'd with @soumyasmruti (over email), and he shared his application.yaml. Then I ran the operator for his application with few added log lines.

@soumyasmruti You can fix this issue by setting emptyDir explicitly

      volumes:
      - emptyDir: {}
        name: data-dir

@cjmakes Similarly in your case can you add this. Set defaultMode explicitly

  volumes:
    - configMap:
        defaultMode: 420
        name: good-generator-configmap
      name: config-vol

I tried your applications with this, and they proceeded. Let me know if it works for you.

cjmakes · 2019-08-08T00:22:46Z

@anandswaminathan, Thank you so much you help with this, that has fixed my issue!

anandswaminathan · 2019-08-12T23:23:58Z

@soumyasmruti @cjmakes

I have merged the PR: #71
This has removed the extra equality checks.

Use the latest operator image: docker.io/lyft/flinkk8soperator:d8cbd7481943739740947f5adbd7debd2c0ebd1c

You need not have to follow the temporary fix I mentioned above - #69 (comment). You original yaml should work now.

Please try and let me know.

soumyasmruti · 2019-08-13T17:10:28Z

Thanks @anandswaminathan

anandswaminathan · 2019-08-13T20:25:16Z

@soumyasmruti Closing this issue.

anandswaminathan mentioned this issue Aug 8, 2019

A few minor fixes #68

Merged

anandswaminathan closed this as completed Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink Application failing with hash collision message #69

Flink Application failing with hash collision message #69

soumyasmruti commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019 •

edited

Loading

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019 •

edited

Loading

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

soumyasmruti commented Aug 7, 2019

soumyasmruti commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019 •

edited

Loading

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019 •

edited

Loading

cjmakes commented Aug 8, 2019

anandswaminathan commented Aug 12, 2019 •

edited

Loading

soumyasmruti commented Aug 13, 2019

anandswaminathan commented Aug 13, 2019

Flink Application failing with hash collision message #69

Flink Application failing with hash collision message #69

Comments

soumyasmruti commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019 • edited Loading

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019 • edited Loading

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

soumyasmruti commented Aug 7, 2019

soumyasmruti commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019 • edited Loading

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019

cjmakes commented Aug 7, 2019

anandswaminathan commented Aug 7, 2019 • edited Loading

cjmakes commented Aug 8, 2019

anandswaminathan commented Aug 12, 2019 • edited Loading

soumyasmruti commented Aug 13, 2019

anandswaminathan commented Aug 13, 2019

anandswaminathan commented Aug 7, 2019 •

edited

Loading

anandswaminathan commented Aug 7, 2019 •

edited

Loading

cjmakes commented Aug 7, 2019 •

edited

Loading

anandswaminathan commented Aug 7, 2019 •

edited

Loading

anandswaminathan commented Aug 12, 2019 •

edited

Loading