Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible Student T instability? #110

Closed
DHekstra opened this issue Jun 23, 2023 · 11 comments
Closed

possible Student T instability? #110

DHekstra opened this issue Jun 23, 2023 · 11 comments

Comments

@DHekstra
Copy link

See attached files. Performing two-step inference for data processed in CrystFEL by AP, Careless run by KIW. NLL term diverges. This seems to be the key part of the traceback:

`Traceback (most recent call last):
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in
sys.exit(main())
^^^^^^
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
run_careless(parser)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless
history = model.train_model(
^^^^^^^^^^^^^^^^^^
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model
_history = train_step((self, data))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:

Detected at node 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' defined at (most recent call last):
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in
sys.exit(main())
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
run_careless(parser)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless
history = model.train_model(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model
_history = train_step((self, data))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 159, in train_step
history = model.train_step((data,))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 1050, in train_step
y_pred = self(x, training=True)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 558, in call
return super().call(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/base_layer.py", line 1145, in call
outputs = call_fn(inputs, *args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 121, in call
z_f = self.surrogate_posterior.sample(self.mc_sample_size)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/surrogate_posteriors.py", line 50, in sample
s = self.distribution.sample(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1205, in sample
return self._call_sample_n(sample_shape, seed, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1182, in _call_sample_n
samples = self._sample_n(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/truncated_normal.py", line 251, in _sample_n
return tf.random.stateless_parameterized_truncated_normal(
Node: 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal'
Detected at node 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal' defined at (most recent call last):
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/bin/careless", line 8, in
sys.exit(main())
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 9, in main
run_careless(parser)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/careless.py", line 53, in run_careless
history = model.train_model(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 173, in train_model
_history = train_step((self, data))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 159, in train_step
history = model.train_step((data,))
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 1050, in train_step
y_pred = self(x, training=True)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/training.py", line 558, in call
return super().call(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/engine/base_layer.py", line 1145, in call
outputs = call_fn(inputs, *args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 96, in error_handler
return fn(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/variational.py", line 121, in call
z_f = self.surrogate_posterior.sample(self.mc_sample_size)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/careless/models/merging/surrogate_posteriors.py", line 50, in sample
s = self.distribution.sample(*args, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1205, in sample
return self._call_sample_n(sample_shape, seed, **kwargs)
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/distribution.py", line 1182, in _call_sample_n
samples = self._sample_n(
File "/home/groups/brunger/kiwhite/software/anaconda3/envs/careless/lib/python3.11/site-packages/tensorflow_probability/python/distributions/truncated_normal.py", line 251, in _sample_n
return tf.random.stateless_parameterized_truncated_normal(
Node: 'variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal'
2 root error(s) found.
(0) INVALID_ARGUMENT: Invalid parameters
[[{{node variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal}}]]
[[variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal/_14]]
(1) INVALID_ARGUMENT: Invalid parameters
[[{{node variational_merging_model/TruncatedNormal_CONSTRUCTED_AT_top_level/sample/stateless_parameterized_truncated_normal/StatelessParameterizedTruncatedNormal}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_step_6249]`

careless_22576794.out.txt
careless_22576794.err.txt
inputs_params.log.txt
slurm_script.txt

@kmdalton
Copy link
Member

Did all of the failed runs include --refine-uncertainties?

You can you try increasing the --mc-samples to 20 if you have enough memory or decreasing the --learning-rate to 1e-4 if you don't.

@DHekstra
Copy link
Author

yes, I think all successful and failed runs had --refine-uncertainties.
I'll make both suggestions.

@kmdalton
Copy link
Member

okay -- can you compile a list of parameters for which runs succeeded and failed? maybe the failures have something in common.

@DHekstra
Copy link
Author

Tentatively the same problem as #61 which was resolved by #62. I'll report back.

@kmdalton
Copy link
Member

not sure if this is causal, but the ev11 likelihood should be adjusted to used a shift in its bijectors for transformed variables:

self.Sdfac = tfu.TransformedVariable(1., tfb.Softplus())

@DHekstra
Copy link
Author

note to self:

  • adjusting the learning rate may help
  • adjusting --epsilon may help.

@kmdalton
Copy link
Member

@DHekstra , is it true that the common factor in failed training runs was not Student T but rather image layers?

@DHekstra
Copy link
Author

yes, that is true. this batch of runs did not include a no-image layer "control". the no-image-layer case did complete without problems previously.

@kmdalton
Copy link
Member

@DorisMai found a bug (#122) in the surrogate posteriors which could have been leading to numerical instability. After I do the next release, it'd be nice to see if your issues go away.

@kmdalton
Copy link
Member

Okay, @DHekstra , please give version 0.3.5 a try when you have a chance.

@kmdalton
Copy link
Member

I think this is fully addressed by #167 and #168. I'm closing this until we hear of numerical issues cropping up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants