Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Force-delete for bad builds #482

Closed
iameskild opened this issue May 5, 2023 · 5 comments
Closed

[ENH] Force-delete for bad builds #482

iameskild opened this issue May 5, 2023 · 5 comments

Comments

@iameskild
Copy link
Contributor

iameskild commented May 5, 2023

Occasionally builds wind up in a bad state and need to be deleted. At the moment, there is no convenient way of doing this other than deleting the build from the DB directly (AFAIK).

As an example, build 155 has been running for a few weeks now:

Screenshot 2023-05-05 at 14 37 08


conda-store version: 0.4.14
@wroddenMSS
Copy link

I've experienced something similar. Builds will get stuck for days and any new builds will be listed as "Queued" while they are stuck. Has there been any progress on a fix for this?

An example build that has been getting stuck regularly is attached. We are using conda-store v0.4.14
tensorflow_build.txt

@iameskild
Copy link
Contributor Author

Hi @wroddenMSS while I don't have a long-term fix, I can share a possible workaround. My hypothesis is that the database is in a bad-state somehow and we need to manually delete the tainted records; to do this, you will need access to the underlying postgresql database.

On you local machine that means you can connect to it via psql (or db gui). If conda-store is running on a Nebari cluster, then you will need access to the underlying kubernetes api, (via kubectl).

In either case, next you will need the database credentials to login.

Locally this is a config.json/config.yaml or similar. On Nebari you can get it these secrets with this command:

kubectl get secrets -n dev conda-store-secret -o jsonpath='{.data.config\.json}'

From there you can kubectl exec to connect to the database mounted to the nebari-conda-store-server-xxxx pod.

CAUTION: when modifying the database, please using caution as it could break irreconcilably!

Once connected to the database, you can try and delete the build that is causing trouble; here's more detail on the conda-store database.

cc @pierrotsmnrd @dharhas

@costrouc
Copy link
Member

So a few things having builds lingering in the BUILDING state doesn't mean that the build is still building. It is due to the worker being interupted and not updating the database when the build fails.

I've merged a PR which is in main to mark these as FAILED. #530 which is the proper fix for this. Once it is in the FAILED state it is possible to delete that build and a force delete is not necessary.

@costrouc
Copy link
Member

With the upcoming release of conda-store the following issues should help with this issue:

@pavithraes
Copy link
Member

I'll close this as complete because it was addressed by #530, and we have #306 tracking #531. :)

Thanks to everyone who participated in this discussion & contribution. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants