-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flink Application failing with hash collision message #69
Comments
+! Have been seeing this for around a week. Have been trying to debug but haven't been able to get anywhere further than reading through pgk/controller/flink/flink.go:341. My current understanding is that when making a deployment he operator runs through a list of currently deployed applications, hashes them to make sure that it doesn't deploy the same application twice. My current (baseless) guess is that the application which is being deployed is added to the list before the hash checking happens. This is a guess, if someone from the team could either confirm or correct me that would be awesome. |
There are two cases for this to happen
This can happen in following cases - but only when the application is deployed for the first time ever. If there is already a job running and updates to an existing application will cause the application to DeployFailed state.
As the message states, to get around the issue, you will have to delete and recreate the application resource. There is no real Flink job or application underneath when the application is in "ClusterStarting" state. Let me know if you guys performed any of the above operation ^ |
I was hitting this before the upgrade to beta
This is on the first deploy when there is nothing running. There doesn't seem to be a problem with the underlying container. I can get to the flink gui and even submit a job.
I haven't upgraded the operator with any applications deployed.
I have tried deleting the application, reinstalling the operator, even starting a new cluster. Do you have any tips on how we could go about debugging this? Enabling debug logging in the operator or a guide to writing tests to try and expose this? |
I have not hit the issue. So I am just listing cases where this can happen. It happened once in my local setup where I had multiple operators running. My previous reply was more towards the issue of ClusterStarting that @.soumyasmruti indicated. Can you provide more information
Can you also verify that you are not running multiple versions of the operators |
I've attached logs of the JobManager, TaskManager and Operator. Application is stuck in CreatingCluster. Also attaching the description of the application. Confirmed there is only 1 instance of 1 version of the operator running. jm.log How can I set the operator log level to debug mode? |
@cjmakes Can you share the output of the deployment and pod status. |
@anandswaminathan What I did was pull the latest changes from yesterday build the flink operator and deploy it in my local. I found this error when I tried to run my applications that was built on the docker files for word count example. I see there aren't any changes to the docker files, so may be something broke in there? Those examples worked fine with version 0.1.3. |
I have very similar output logs as @cjmakes |
You can modify the log level by setting
|
Both jm and tm pods are running and both jm and tm deployments are available. |
@cjmakes Can you share the output similar to |
I am debugging this. Here is the place that throws error - https://github.com/lyft/flinkk8soperator/blob/master/pkg/controller/flink/flink.go#L341 |
Have you been to replicate the error? I'm currently going through the code and I suspect it has something to do with the |
@cjmakes Nope. I am not able to replicate it. You can also see our integration tests here: https://travis-ci.org/lyft/flinkk8soperator/builds/568614007 Also tried https://github.com/lyft/flinkk8soperator/tree/master/integ and https://github.com/lyft/flinkk8soperator/blob/master/docs/quick-start-guide.md on the latest image - docker.io/lyft/flinkk8soperator:500fe6bd40da8efca4a48bbb1104896be2c1fae8 |
Looks like the apiequality.Semantic.DeepEqual(volume) is returning false. |
Thank you for time tracking this down! What could cause this/what would be the solution? Can I ask how you determined that? I'm trying to hone my kubenetes debugging skills. |
I have a PR here: #71 (But not sure if I want to merge) @cjmakes I sync'd with @soumyasmruti (over email), and he shared his application.yaml. Then I ran the operator for his application with few added log lines. @soumyasmruti You can fix this issue by setting
@cjmakes Similarly in your case can you add this. Set
I tried your applications with this, and they proceeded. Let me know if it works for you. |
@anandswaminathan, Thank you so much you help with this, that has fixed my issue! |
I have merged the PR: #71 Use the latest operator image: docker.io/lyft/flinkk8soperator:d8cbd7481943739740947f5adbd7debd2c0ebd1c You need not have to follow the temporary fix I mentioned above - #69 (comment). You original yaml should work now. Please try and let me know. |
Thanks @anandswaminathan |
@soumyasmruti Closing this issue. |
The flink application is failing with this message what could be the reason? how to debug?
The text was updated successfully, but these errors were encountered: