Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError! #219

Open
SomnusQue opened this issue Jan 21, 2024 · 12 comments
Open

RuntimeError! #219

SomnusQue opened this issue Jan 21, 2024 · 12 comments

Comments

@SomnusQue
Copy link

SomnusQue commented Jan 21, 2024

I run auto_100weight_inherit_100to75.sh, and meet this problem. I think I have been ready everything for this project, but it still have some problems which I can't solve. Please somebody help me!

@SomnusQue
Copy link
Author

2859391705835847

@wkcn
Copy link
Contributor

wkcn commented Jan 21, 2024

Hi @SomnusQue , thanks for your attention to our work!

Is the code of TinyCLIP latest?

It is a bug which is triggered on PyTorch 2.x.
We have fixed the bug by adding this line: https://github.com/microsoft/Cream/blob/main/TinyCLIP/src/open_clip/model.py#L28

checkpoint = functools.partial(checkpoint, use_reentrant=False)

@SomnusQue
Copy link
Author

Hi @SomnusQue , thanks for your attention to our work!

Is the code of TinyCLIP latest?

It is a bug which is triggered on PyTorch 2.x. We have fixed the bug by adding this line: https://github.com/microsoft/Cream/blob/main/TinyCLIP/src/open_clip/model.py#L28

checkpoint = functools.partial(checkpoint, use_reentrant=False)

OMG! The author answer my question! The code which I have really doesn't have these lines! Thx for your patience!
But I wondering when is the code update?

@SomnusQue
Copy link
Author

Hi @SomnusQue , thanks for your attention to our work!
Is the code of TinyCLIP latest?
It is a bug which is triggered on PyTorch 2.x. We have fixed the bug by adding this line: https://github.com/microsoft/Cream/blob/main/TinyCLIP/src/open_clip/model.py#L28

checkpoint = functools.partial(checkpoint, use_reentrant=False)

OMG! The author answer my question! The code which I have really doesn't have these lines! Thx for your patience! But I wondering when is the code update?
Furthermore... Is this LOSS normal?
2859731705841213_ pic_hd

@wkcn
Copy link
Contributor

wkcn commented Jan 21, 2024

@SomnusQue I fixed the bug in Jan. 11, 2024 (https://github.com/microsoft/Cream/pull/218/files#diff-2c756c8b8b99609dee1b59ce4dcfaf773aa9afbc84e093e03e3e0de653fa0124R28).

You can visualize the loss curve in wandb. The loss is normal if it is decreasing : )

@SomnusQue
Copy link
Author

@SomnusQue I fixed the bug in Jan. 11, 2024 (https://github.com/microsoft/Cream/pull/218/files#diff-2c756c8b8b99609dee1b59ce4dcfaf773aa9afbc84e093e03e3e0de653fa0124R28).

You can visualize the loss curve in wandb. The loss is normal if it is decreasing : )

Thanks for your patience! Due to the cluster, I can't use wandb(because it needs network..?), I change this line in .sh file'--report-to wandb' to '--report-to tensorboard'. Does it have anywhere else need to change in the code?

@wkcn
Copy link
Contributor

wkcn commented Jan 21, 2024

@SomnusQue
No code change required. It is also available to set the environmental variable WANDB_MODE=offline. The wandb log will be saved as a file. Then run wandb sync <file path> to upload the log.

@SomnusQue
Copy link
Author

@SomnusQue
No code change required. It is also available to set the environmental variable WANDB_MODE=offline. The wandb log will be saved as a file. Then run wandb sync <file path> to upload the log.

sry to bother u again...
3971705891663_ pic_hd
The result in tensorboard seems like sth went wrong...
3981705892612_ pic_hd
This is the final epoch of my training result..

@SomnusQue
Copy link
Author

3991705910386_ pic
4001705910406_ pic
This is our bash file, is there sth wrong...?

@wkcn
Copy link
Contributor

wkcn commented Jan 22, 2024

Sorry that I did not test TensorBoard yet.

The training data in the provided script is synthetic.
They should be replaced with the following command:

 --train-data <your yfcc_path or laion_path/> \
 --dataset-type webdataset \

@SomnusQue
Copy link
Author

Sorry that I did not test TensorBoard yet.

The training data in the provided script is synthetic. They should be replaced with the following command:

 --train-data <your yfcc_path or laion_path/> \
 --dataset-type webdataset \

I downloaded laion file, and put it in the path '/.cache/clip/'.
Is this the path I need to write?

@wkcn
Copy link
Contributor

wkcn commented Jan 23, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants