Allow training on larger datasets #2

DhairyaLGandhi · 2020-07-31T02:25:40Z

So this was a bit unexpected, but it seems to be related to machine precision, since it was very difficult to get the model to fail even with the worst case scenario in many cases. The background is that the current definition of softplus lead to some ill-behaved gradients for large negative values.

julia> gradient(log ∘ exp, -100)
(0.9999999999999999,)
julia> gradient(log ∘ exp, -100 + eps(Float32))
(Inf32,)

Which eventually manifested like so:

julia> gradient(x -> sum(softplus.(x)), randn(Float32, 3,3) .* 10f1 )
(Float32[NaN NaN 1.0; NaN 0.0014634054 NaN; NaN 1.7532707f-27 1.0],)

There is a PR on upstream to fix this, but I am adding the adjoint here so we don't have to worry about the dependencies too much for the time being. Defining the adjoint for softplus fixes this.

cc @rkurchin

rkurchin · 2020-07-31T18:02:24Z

Just tested this 5x on the 15k test set and 5x on the 32k test set with no NaN's, merging!

add ajoint for softplus

45e9cb8

DhairyaLGandhi changed the title ~~Allow training on larger dataset~~ Allow training on larger datasets Jul 31, 2020

rkurchin merged commit f0d32fc into normalization Jul 31, 2020

rkurchin deleted the dg/train branch July 31, 2020 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow training on larger datasets #2

Allow training on larger datasets #2

DhairyaLGandhi commented Jul 31, 2020

rkurchin commented Jul 31, 2020

Allow training on larger datasets #2

Allow training on larger datasets #2

Conversation

DhairyaLGandhi commented Jul 31, 2020

rkurchin commented Jul 31, 2020