TrainingDivergedException executing TransferFreshFruit Example + Differences in ATLearn Embeddings #3335
Unanswered
rd-peter-braun
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I got
Caused by: ai.djl.TrainingDivergedException: The Loss became NaN, try reduce learning rate,add clipGradient option to your optimizer, check input data and loss calculation. at ai.djl.training.listener.DivergenceCheckTrainingListener.onTrainingBatch(DivergenceCheckTrainingListener.java:27) ~[api-0.28.0.jar:na]
Exception during the SoftmaxCrossEntropy evaluation when executing the TransferFreshFruit Example.
It occurs during the second epoch.
I tried it with the djl embedding model djl://ai.djl.pytorch/resnet18_embedding and with a self generated one using ATLearn.
Latter had a differing final layer - .addSingleton(nd -> nd.squeeze(new int[] {2, 3})) did not work since only 2 dimensions existed.
I use a custom trainset containing 83 classes - each class contains 250 images (jpg size 640x480). Which means my dataset construction differs from the DJL Example...
properties + code example:
trained-model-name: "eb_resnet_18" // props.getTrainedModelName()...
trainset-path: "trainset" // contains 83 classes....
epochs: 25
engine: PyTorch
device: CPU
batch-size: 32
learning-rate: 0.001
patience: 4
train-param: true
Beta Was this translation helpful? Give feedback.
All reactions