Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seldon-controller-manager crashing #2066

Closed
akshatgit opened this issue Jul 2, 2020 · 7 comments
Closed

seldon-controller-manager crashing #2066

akshatgit opened this issue Jul 2, 2020 · 7 comments
Labels
bug triage Needs to be triaged and prioritised accordingly

Comments

@akshatgit
Copy link

Describe the bug

We are running seldon with kubeflow in GCP. It was working fine but since 2 days it is crashing. The pod is going crashbackloop.

Logs from seldon-controller manger:

`E0702 17:48:38.000449 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:76
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:65
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:522
/usr/local/go/src/runtime/panic.go:82
/usr/local/go/src/runtime/signal_unix.go:390
/workspace/controllers/seldondeployment_prepackaged_servers.go:255
/workspace/controllers/seldondeployment_prepackaged_servers.go:255
/workspace/controllers/seldondeployment_controller.go:452
/workspace/controllers/seldondeployment_controller.go:1350
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1337
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x11bcfcb]

goroutine 358 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/runtime/runtime.go:58 +0x105
panic(0x130e6a0, 0x21c4930)
/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/seldonio/seldon-core/operator/apis/machinelearning/v1.IsPrepack(...)
/workspace/apis/machinelearning/v1/seldondeployment_webhook.go:125
github.com/seldonio/seldon-core/operator/controllers.createStandaloneModelServers(0xc0000d6180, 0xc000b23860, 0xc000d6a6c0, 0xc000978300, 0xc000296fa0, 0xc0001232b0, 0x0)
/workspace/controllers/seldondeployment_prepackaged_servers.go:255 +0x1cb
github.com/seldonio/seldon-core/operator/controllers.createComponents(0xc0000d6180, 0xc000b23860, 0x1698ce0, 0xc00092d5a0, 0x21f7760, 0xc000902f50, 0x8)
/workspace/controllers/seldondeployment_controller.go:452 +0x15a4
github.com/seldonio/seldon-core/operator/controllers.(*SeldonDeploymentReconciler).Reconcile(0xc0000d6180, 0xc000902f50, 0x8, 0xc000882080, 0x16, 0x21d8ce0, 0x42bd21, 0x166c700, 0xc000b21d88)
/workspace/controllers/seldondeployment_controller.go:1350 +0x4ae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00068e000, 0x1358a80, 0xc00050c0c0, 0x1358a00)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:216 +0x149
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00068e000, 0xc0004d0800)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192 +0xb5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker(0xc00068e000)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171 +0x2b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0003c44d0)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152 +0x54
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0003c44d0, 0x3b9aca00, 0x0, 0xc000260f01, 0xc000020000)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0003c44d0, 0x3b9aca00, 0xc000020000)
/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88 +0x4d
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:157 +0x311`

To reproduce

  1. We deployed a pytorch model with python wrapper.

Expected behaviour

The model was supposed to get deployed with seldon-core-microservice

Environment

  • Cloud Provider: GCP
  • Kubernetes Cluster Version [Output of kubectl version]:
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.0", GitCommit:"9e991415386e4cf155a24b1da15becaa390438d8", GitTreeState:"clean", BuildDate:"2020-03-26T06:17:09Z", GoVersion:"go1.14", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.10-gke.37", GitCommit:"34a615f32e9a0c9e97cdb9f749adb392758349a6", GitTreeState:"clean", BuildDate:"2020-04-10T21:17:01Z", GoVersion:"go1.12.12b4", Compiler:"gc", Platform:"linux/amd64"}
  • Deployed Seldon System Images: [Output of kubectl get --namespace seldon-system deploy seldon-controller-manager -o yaml | grep seldonio]:
docker.io/seldonio/seldon-core-operator:1.0.1
@akshatgit akshatgit added bug triage Needs to be triaged and prioritised accordingly labels Jul 2, 2020
@ukclivecox
Copy link
Contributor

Did this occur when you deployed a particular model? it maybe be an issue that the webhooks are not being called.
Are you able to update the kfdef to that which is in manifests v1.1 branch for Seldon which is the next release?
https://github.com/kubeflow/manifests/tree/v1.1-branch/seldon

@ukclivecox
Copy link
Contributor

can you show the yaml of your resource? You could also try changing the version of the SeldonDeployment to v1

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment

@akshatgit
Copy link
Author

My resource definition for the model is already the following :

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment

I am not sure but I think this happened when I tried to deploy a model. After that, I tried restarting seldon-manager-controller, I started getting the following errors for old seldon deployments and it gave segmentation error again:

github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:192
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:171
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:88

@ukclivecox
Copy link
Contributor

It looks like its not calling the webhook for the model. Have you followed: https://www.kubeflow.org/docs/components/serving/seldon/

In particualar fo ryou rnamespace done:

kubectl label namespace my-namespace serving.kubeflow.org/inferenceservice=enabled

@akshatgit
Copy link
Author

akshatgit commented Jul 2, 2020

Before this issue, I was able to deploy the test deployment given in the doc. Should I re-deploy seldon?

@akshatgit akshatgit reopened this Jul 2, 2020
@akshatgit
Copy link
Author

Ok, so the issue is solved now. My setup is such that kubeflow is there in 'kubeflow' ns and seldon deploys in 'models' ns, but once I had tried deploying seldon model in 'kubeflow' ns. I deleted that sdep and restarted, it worked. However, this is just my theory. @cliveseldon you may close this issue if you think this is correct. Thanks a lot for your help!

@ukclivecox
Copy link
Contributor

Yes you can't deploy in kubeflow namespace.
Glad its working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug triage Needs to be triaged and prioritised accordingly
Projects
None yet
Development

No branches or pull requests

2 participants