[ENH] Force-delete for bad builds #482

iameskild · 2023-05-05T18:38:05Z

Occasionally builds wind up in a bad state and need to be deleted. At the moment, there is no convenient way of doing this other than deleting the build from the DB directly (AFAIK).

As an example, build 155 has been running for a few weeks now:

conda-store version: 0.4.14

The text was updated successfully, but these errors were encountered:

wroddenMSS · 2023-08-23T20:54:58Z

I've experienced something similar. Builds will get stuck for days and any new builds will be listed as "Queued" while they are stuck. Has there been any progress on a fix for this?

An example build that has been getting stuck regularly is attached. We are using conda-store v0.4.14
tensorflow_build.txt

iameskild · 2023-08-23T23:17:44Z

Hi @wroddenMSS while I don't have a long-term fix, I can share a possible workaround. My hypothesis is that the database is in a bad-state somehow and we need to manually delete the tainted records; to do this, you will need access to the underlying postgresql database.

On you local machine that means you can connect to it via psql (or db gui). If conda-store is running on a Nebari cluster, then you will need access to the underlying kubernetes api, (via kubectl).

In either case, next you will need the database credentials to login.

Locally this is a config.json/config.yaml or similar. On Nebari you can get it these secrets with this command:

kubectl get secrets -n dev conda-store-secret -o jsonpath='{.data.config\.json}'

From there you can kubectl exec to connect to the database mounted to the nebari-conda-store-server-xxxx pod.

CAUTION: when modifying the database, please using caution as it could break irreconcilably!

Once connected to the database, you can try and delete the build that is causing trouble; here's more detail on the conda-store database.

cc @pierrotsmnrd @dharhas

costrouc · 2023-08-24T01:10:23Z

So a few things having builds lingering in the BUILDING state doesn't mean that the build is still building. It is due to the worker being interupted and not updating the database when the build fails.

I've merged a PR which is in main to mark these as FAILED. #530 which is the proper fix for this. Once it is in the FAILED state it is possible to delete that build and a force delete is not necessary.

costrouc · 2023-08-24T01:18:40Z

With the upcoming release of conda-store the following issues should help with this issue:

pavithraes · 2023-08-30T12:34:35Z

I'll close this as complete because it was addressed by #530, and we have #306 tracking #531. :)

Thanks to everyone who participated in this discussion & contribution. :)

iameskild added the type: enhancement 💅🏼 label May 5, 2023

trallard added needs: discussion 💬 area: user experience 👩🏻‍💻 Items impacting the end-user experience labels Jul 21, 2023

pavithraes closed this as completed Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Force-delete for bad builds #482

[ENH] Force-delete for bad builds #482

iameskild commented May 5, 2023 •

edited

Loading

wroddenMSS commented Aug 23, 2023

iameskild commented Aug 23, 2023

costrouc commented Aug 24, 2023

costrouc commented Aug 24, 2023

pavithraes commented Aug 30, 2023

[ENH] Force-delete for bad builds #482

[ENH] Force-delete for bad builds #482

Comments

iameskild commented May 5, 2023 • edited Loading

wroddenMSS commented Aug 23, 2023

iameskild commented Aug 23, 2023

costrouc commented Aug 24, 2023

costrouc commented Aug 24, 2023

pavithraes commented Aug 30, 2023

iameskild commented May 5, 2023 •

edited

Loading