Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training failed. Failed to create trained image after successful training run #366

Open
balmoral opened this issue Oct 6, 2024 · 1 comment

Comments

@balmoral
Copy link

balmoral commented Oct 6, 2024

I'm fine-tuning sakemin /musicgen-fine-tuner:bc57274e.

Same error as #308 which was closed without resolution.

Error message after otherwise successful execution: "Training failed. Failed to create trained image after successful training run"

Last line of logs is "Executor: All workers completed successfully"

Has completed successfully many times in the past. May be related to later versions of sake min/musicgen-fine-tuner.

Would appreciate any suggestions or assistance.

Thanks!

@balmoral
Copy link
Author

balmoral commented Oct 7, 2024

Here's some info which I hope helps.

It appears the destination for the training is being lost after sending to replicate server, so suspect this is causing the failure to create the training image.

I am using a local clone of the replicate-python library for debugging.

In "training.py" I have a trace to see what is sent to the server as the training request body. Output is:

body={'input': {'dataset_path': 'https://replicate.delivery/pbxt/LdAeyWGSwFjAUV1DmFcrbXoAYZTHEFZoUsoSqBUbxwfUFvuV/9a4c6d7d2137baa2ef54ba859e5ed6ac.zip', 'one_same_description': False, 'auto_labelling': False, 'drop_vocals': True, 'model_version': 'melody', 'lr': 0.01, 'epochs': 1, 'updates_per_epoch': 10, 'batch_size': 8}, 'destination': 'balmoral/66e7855017380a888168a6bb'} 

So destination is definitely being sent.

And the model exists (checked before creating the training in my code, and also visually checked on web).

When retrieving the training in our code, the destination is None.

Also the JSON reported for the training on web at https://replicate.com/p/1nyf5jmwm1rgp0cjckjssqn8cw is also missing destination:

{
  "completed_at": "2024-10-07T03:01:07.459319Z",
  "created_at": "2024-10-07T02:53:42.432000Z",
  "data_removed": false,
  "error": "Failed to create trained image after successful training run.",
  "id": "1nyf5jmwm1rgp0cjckjssqn8cw",
  "input": {
    "lr": 0.01,
    "epochs": 1,
    "batch_size": 8,
    "drop_vocals": true,
    "dataset_path": "https://replicate.delivery/pbxt/LdAeyWGSwFjAUV1DmFcrbXoAYZTHEFZoUsoSqBUbxwfUFvuV/9a4c6d7d2137baa2ef54ba859e5ed6ac.zip",
    "model_version": "melody",
    "auto_labelling": false,
    "updates_per_epoch": 10,
    "one_same_description": false
  },
  "logs": ...,
  "metrics": {
    "predict_time": 353.364993846,
    "total_time": 445.027319
  },
  "output": null,
  "started_at": "2024-10-07T02:55:14.094325Z",
  "status": "failed",
  "urls": {
    "get": "https://api.replicate.com/v1/trainings/1nyf5jmwm1rgp0cjckjssqn8cw",
    "cancel": "https://api.replicate.com/v1/trainings/1nyf5jmwm1rgp0cjckjssqn8cw/cancel"
  },
  "version": "8d02c56b9a3d69abd2f1d6cc1a65027de5bfef7f0d34bd23e0624ecabb65acac",
  "_extras": {
    "api_token_name": "AiMuse2",
    "created_by": {
      "kind": "user",
      "url": "https://replicate.com/balmoral",
      "username": "balmoral"
    },
    "deployment": null,
    "hardware": "8x A40 (Large)",
    "input_files": [],
    "is_immutable": false,
    "is_shared": false,
    "is_waiting_for_boot": null,
    "may_have_sensitive_output": false,
    "official_model": null,
    "output_files": [],
    "source": "api"
  }
}

Thanks for your assistance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant