Basic `gradient_descent` optimization fails #86

Affie · 2021-06-18T16:13:23Z

Hi I'm new to Manopt and I'm trying to figure out how to integrate Manopt into IncrementalInference (also see JuliaRobotics/IncrementalInference.jl#1276 for more complex tries).

The optimization is formulated as a factor graph and this example represents a simple factor graph with one variable and 2 priors (so basically the minimum of 2 points). I don't get the results I expect and I made this MWE:

using Manopt
using Manifolds
using StaticArrays

# "factor"
function Prior_distance(M, meas, p)		
    return distance(M, meas, p)
end

## sanity check factors
M = TranslationGroup(2)

p = SA[1., 0]
q = SA[1., 1]
Prior_distance(M, p, q)

p = SA[0., 1]
q = SA[1., 1]
Prior_distance(M, p, q)

p = SA[0., -1]
q = SA[0., 0]
Prior_distance(M, p, q)

p = SA[0., -1]
q = SA[0., -1]
Prior_distance(M, p, q)

## Try  with Manopt

priorx1 = SA[1.1, 1.0]
priorx2 = SA[0.9, 1.0]

# Factor graph with 2 priors on variable x
function F(M,x) 
    return Prior_distance(M, priorx1, x) + Prior_distance(M, priorx2, x)
end

function gradF(M, x)
    Manifolds.gradient(M,(x)->F(M,x), x)
end

# start value
x0 = SA[0.,0]
# test cost function
F(M, x0)
gradF(M, x0)

# run
xMean = gradient_descent(M, F, gradF, x0)
F(M, xMean)
gradF(M, xMean)

#= results
# I expect [1,1]
julia> xMean = gradient_descent(M, F, gradF, x0)
2-element MVector{2, Float64} with indices SOneTo(2):
  0.9859635170635069
 -0.0031281100934246187

# the cost is still high
julia> F(M, xMean)
2.0163938998034947

# and the gradient also
julia> gradF(M, xMean)
2-element MVector{2, Float64} with indices SOneTo(2):
 -0.02757083871664094
 -1.9899485375183275
=#

cc @dehann

EDIT: I'm on
[1cead3c2] Manifolds v0.5.4
[0fc0a36d] Manopt v0.3.9

kellertuer · 2021-06-18T17:39:20Z

Interesting.
First you do not really have to use the AD tools (from Manifolds.gradient), since for distance there are gradients available here as grad_distance(M, p, x, exponent=2) (that is the default gradient is for the distance squared), so a little more efficient is to use

gradF(M, x) = grad_distance(M, priorx1, x0,1) + grad_distance(M, priorx2, x0,1)

this yields exactly the same numbers as your gradient, but on other manifolds that R2 (your manifold is essentially Euclidean(2) here).

Then, to see a little more what is happening (though I saw what the reason is directly) you can do

 xMean = gradient_descent(M, F, gradF, x0; debug= [:Iteration, :Cost,"\n",10])

to print a line of the iteration and the cost (allowed by a newline) every 10th iteration.

If you then look for the default values of the gradient descent, you can see that it works with a constant step size by default (machine learning people call that learning rate), which is not guaranteed to converge (ever).

But you can exchange that, see https://manoptjl.org/stable/plans/index.html#Stepsize-1 with a step size that might fit better, for example the Armijo step size rule, its defaults are chosen to be just fine, so

julia> xMean = gradient_descent(M, F, gradF, x0; stepsize=ArmijoLinesearch(),  debug= [:Iteration, :Cost,"\n",10])
InitialF(x): 2.831969279439222
# 10F(x): 0.2578005738010819
# 20F(x): 0.2000672051680747
# 30F(x): 0.20000006190528008
# 40F(x): 0.20000000005719792
# 50F(x): 0.2000000000000529
# 60F(x): 0.20000000000000012
2-element MVector{2, Float64} with indices SOneTo(2):
 1.0000197125126835
 0.9999999998505151

is this close enough? If not, you can tweak the stopping criterion, see https://manoptjl.org/stable/solvers/index.html#StoppingCriteria-1 and note that you can combine stopping criteria just with & or |. The default is StopAfterIteration(200) | StopWhenGradientNormLess(10.0^-8), so it stops after 200 iterations or when the gradient Is less than 10.0^-8, what ever hits first. Let's also increase the debug oversampling to 25 and we get for example

julia> xMean = gradient_descent(M, F, gradF, x0;
   stepsize=ArmijoLinesearch(),
   stopping_criterion=StopAfterIteration(200) | StopWhenGradientNormLess(10.0^-12),
   debug=[:Iteration, :Cost,"\n",25]
)
InitialF(x): 2.831969279439222
# 25F(x): 0.2000020367614238
# 50F(x): 0.2000000000000529
2-element MVector{2, Float64} with indices SOneTo(2):
 1.0000197125126835
 0.9999999999999954

Let me know if this solves your problem,
whether there is anything missing in the docs or
could be rephrased therein.

One could for example change the default to use Armijo - though that has per default quite some function evaluations to determine the step size.

kellertuer · 2021-06-18T17:46:01Z

Adendum, after clicking on your link – it seems the allocate in ManifoldsBase does something strange in NelderMead but if you provide the population yourself and use the mutating variant

NelderMead!(M.manifold, F, [random_point(M) for _=1:3])

it does something. Note that Nelder Mead is not a very efficient algorithm and should only be used if you do not have any gradient.

edit: sorry my bad :) I do not work with NelderMead usually.

You should call also NelderMead not with an initial point but with an initial population (manifold_dimension+1 points) so

NelderMead(M.manifold, F, [random_point(M) for _=1:3])

just works fine, too as does

NelderMead(M.manifold, F)

But note that its convergence is veeeeeery slow (the default 2000 iterations are not enough).

dehann · 2021-06-18T17:52:49Z

Hi Ronny Johan, thanks this helps me follow better too.

it seems the allocate in ManifoldsBase does something strange in NelderMead

hmm, makes me curious about #64 again. I'll check when I get a chance and comment there if something shows up.

kellertuer · 2021-06-18T17:53:31Z

Hi Ronny Johan, thanks this helps me follow better too.

it seems the allocate in ManifoldsBase does something strange in NelderMead

hmm, makes me curious about #64 again. I'll check when I get a chance and comment there if something shows up.

No, that was my mistake at first glance – the error is that you have to provide a population not a starting point. Then everything works :)

kellertuer · 2021-06-18T17:54:14Z

Oh and a final look, since you used BFGS in Optim. We have that on manifolds, too, just use

julia> quasi_Newton(M, F, gradF, x0)
2-element MVector{2, Float64} with indices SOneTo(2):
 1.0204551069445331
 1.000000000000014

whose default is the (inverse) BFGS here, too.

dehann · 2021-06-18T17:54:15Z

Ah, got it thanks!

kellertuer · 2021-06-18T17:58:22Z

Perfect. Let me know when you think a default should be adapted or the docs should be changed (for example at first glance Stopping Criteria and Step Size rules could be made a little more prominent).

dehann · 2021-06-18T18:06:40Z

Stopping Criteria and Step Size rules could be made a little more prominent

Maybe just a !!! note :-)

kellertuer · 2021-06-18T18:08:13Z

Where? The default step size is mentioned at every algorithm as are the default sopping criteria.
I was like planning to give them separate pages maybe

xref JuliaManifolds#86 (comment)

dehann · 2021-06-18T18:29:25Z

I made a PR on where I would add it... Feel free to ignore if that does not fit/work.

Reason for putting it there is that when following a design template example for the first time, one expects the defaults to work (regardless of performance). When I get stuck on the first example, I find myself scrolling to the bottom of the "Getting Started" page hoping to find the obvious stumbling blocks. FAQ is second best place (but can be a discouraging search). Once a user sees their own first example working (different than just copy paste), then you start to invest more time to read deeper into all the docs. Training wheels first with absolute minimum details to get going, then scale up more and more on details from there.

kellertuer · 2021-06-18T18:30:46Z

Manopt.jl does not have a FAQ.
I will comment on the idea in the PR.

Affie · 2021-06-18T19:35:15Z

Thanks a lot, I did find stopping criteria and played around with it. However, I expected the default to converge in such a simple example.

There are quite a few stumbling blocks for me, but I can already see the potential of Manopt. I know it's a relatively new package and perhaps we can help build out examples.

Reason for putting it there is that when following a design template example for the first time, one expects the defaults to work (regardless of performance). When I get stuck on the first example, I find myself scrolling to the bottom of the "Getting Started" page hoping to find the obvious stumbling blocks.

So, I'm in agreement with this, and that for defaults to rather work (with the possible cost of performance, with a note or something) especially if the user comes from other packages like Optim or JuMP where one can be a user without understanding the underlying algorithm.

Perhaps something like this will help a new user like me a lot, from the Optim.jl docs:

Summary
As a very crude heuristic:
For a low-dimensional problem with analytic gradients and Hessians, use the Newton method with trust region. For larger problems or when there is no analytic Hessian, use LBFGS, and tweak the parameter m if needed. If the function is non-differentiable, use Nelder-Mead. Use the HagerZhang linesearch for robustness and BackTracking for speed.

Affie · 2021-06-18T20:12:58Z

It's perhaps not part of this issue, but let me list some problems I encounter:
I run into a lot of error messages and don't know if it is something I'm doing wrong (in the NelderMead case):

Manopt.NelderMead(MP, F, xp)
ERROR: MethodError: no method matching NelderMeadOptions(::ProductRepr{Tuple{ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}}}, ::StopAfterIteration; α=1.0, γ=2.0, ρ=0.5, σ=0.5, retraction_method=ExponentialRetraction(), inverse_retraction_method=LogarithmicInverseRetraction())

or perhaps other errors, eg:

julia> xMean = quasi_Newton(M, F_prior, gradF_prior, x; debug= [:Iteration, :Cost,"\n",10])
InitialF(x): 1.4945569745366585
ERROR: vector_transport_to! not implemented on ProductManifold(TranslationGroup(2; field = ℝ), SpecialOrthogonal(2)) for arguments ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}} and ParallelTransport.

julia> particle_swarm(M, F_prior; n, x0)
ERROR: ArgumentError: broadcasting requires an assigned BroadcastStyle

gradF(MP, xp)
ERROR: MethodError: no method matching fill!(::ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ::Int64)

The other issues are more to do with a lack of knowledge coming from "I just want to optimize my cost function", such as

the choice of algorithm and parameters. (basically ran them and compared results)
how to write my gradient function (just used automatic diff)
how to work with covariences (previously used Mahalanobis distance for measurement and hessian to estimate answer)

I hope this helps improve the quality of this great package and let me know if there is anywhere I can help. (although I still have a lot to learn)

kellertuer · 2021-06-18T20:15:04Z

For the first and simple example, there is a certain tradeoff, especially on manifolds.
The constant step size might not converge, but other rules, that require function evaluations and/or exponential maps/retractions (Armijo needs both quite a lot) might take a long computational time. One has the choice between a user “stumbling” over either one of these.

We can still, for sure, switch to Armijo, I don‘t have a strong preference there.
The main problem is not that Manopt is that new (I started in 2016, but only got a proper first version in 2018), but that I am the only developer. Sure, some algorithms are provided by others, some of them were students of mine.
That‘s why some parts have not been used yet extensively, which always usually strengthens notes and tutorials. I hope, though, that the documentation is complete (concerning optional arguments for example).

The summary sounds like a good idea, and holds here completely analogously for Newton with TR, and LBFGS (BFGS uses https://manoptjl.org/stable/solvers/quasi_Newton.html#Manopt.AbstractQuasiNewtonDirectionUpdate switch to https://manoptjl.org/stable/solvers/quasi_Newton.html#Manopt.QuasiNewtonLimitedMemoryDirectionUpdate for the LBFGS one, which might also be good to have a shorter constructor).

Additionally, for non smooth functions, if you have probes use CyclicProximalPoint or DouglasRachford (Optim does not have those but there this package has also a little different focus, namely to include non smooth optimisation on manifolds) or if you are adventurous the ChambollePock algorithm from the most recent paper.

For the summary –

Affie · 2021-06-18T20:19:53Z

For the summary –

Thanks! A summary like that is perfect. I was actually looking for LBFGS, but didn't recognize it until you made it obvious just now 🤦

kellertuer · 2021-06-18T20:24:33Z

Manopt.NelderMead(MP, F, xp)
ERROR: MethodError: no method matching NelderMeadOptions(::ProductRepr{Tuple{ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}}}, ::StopAfterIteration; α=1.0, γ=2.0, ρ=0.5, σ=0.5, retraction_method=ExponentialRetraction(), inverse_retraction_method=LogarithmicInverseRetraction())

For this one without details I can not provide more details – where was it thrown, how did you call NelderMead, what are the variables? What is the Call stack? I can sadly not just guess from a signature what is wrong.

or perhaps other errors, eg:

julia> xMean = quasi_Newton(M, F_prior, gradF_prior, x; debug= [:Iteration, :Cost,"\n",10])
InitialF(x): 1.4945569745366585
ERROR: vector_transport_to! not implemented on ProductManifold(TranslationGroup(2; field = ℝ), SpecialOrthogonal(2)) for arguments ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}} and ParallelTransport.

This seems more like an issue for Manifolds.jl, since vector_transport_to! is something I use from the Manifoldsbase.jl interface. At first glance (same comment as above, with missing call stack / MWE / which variables are used?) I would say that vector_transport_to! seems to not pass through the group manifold correctly (i.e. “unpack” the group manifold decorator to call vector_transport_to! on the manifold (without the group)

julia> particle_swarm(M, F_prior; n, x0)
ERROR: ArgumentError: broadcasting requires an assigned BroadcastStyle

Again

which manifold?
which x0 ? I think, again, you used a single point x0 and not a population (you have to provide manifold_dimension(M)+1 points if you set the parameter
what is n?

gradF(MP, xp)
ERROR: MethodError: no method matching fill!(::ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ::Int64)

What is gradF here? that seems to be a function you defined yourself? Mutating or non-mutating? Mutating might not work 100% with ProductRepr, but Manopt.jl supports both types: functions of the form grad_F(M, p) that return the gradient and grad_F!(M, X, p)that compute the result in place of X.

The other issues are more to do with a lack of knowledge coming from "I just want to optimize my cost function", such as

the choice of algorithm and parameters. (basically ran them and compared results)

Can not be answered generically, depends heavily on the algorithm and the problem you use

how to write my gradient function (just used automatic diff)

AD is something that is not yet well covered (see my note on how many developers are here - just me – ), so the only AD support is the one you already used in Manifolds.jl. AD is on my list but just alone might take quite some time still.

how to work with covariences (previously used Mahalanobis distance for measurement and hessian to estimate answer)

I do not understand this, What is Mahalanobis? Where do you want to work with Covariances? As a manifold?

kellertuer · 2021-06-18T20:25:41Z

For the summary –

Thanks! A summary like that is perfect. I was actually looking for LBFGS, but didn't recognize it until you made it obvious just now 🤦

This might mean, as I said, that LBFGS should get a nice short constructor and a nice tutorial. This Algorithm was the bachelor thesis os a student of mine, and a tutorial was not part of that ;)

Affie · 2021-06-18T20:39:42Z

For this one without details I can not provide more details

Yes sorry, my intention was not to get the answers to the questions (thanks for trying though) but to just list general stumbling blocks.
I'll find MWE for all the errors I encounter if that helps. Perhaps some of them can make it into the tests.
Give me some time and I'll list it in a new issue.

kellertuer · 2021-06-18T20:42:09Z

Sure, carefully craft MWEs or MnWEs and open issues – but also look where the issue might come from, i.e. is it more like Manopt of Manifolds.

mateuszbaran · 2021-06-18T20:51:46Z

how to work with covariences (previously used Mahalanobis distance for measurement and hessian to estimate answer)

I do not understand this, What is Mahalanobis? Where do you want to work with Covariances? As a manifold?

This probably refers to the way covariance matrix for a multivariate normal distribution can be obtained from the Hessian of logpdf like here: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470824566.app1 . I guess there should be a manifold equivalent for some generalization of the multivariate normal distribution? And this is the Mahalanobis distance: https://en.wikipedia.org/wiki/Mahalanobis_distance and it uses the covariance matrix.

kellertuer · 2021-06-18T20:59:50Z

Ah, that was a too long way for me to get from this short note – especially since I am mote into optimization and less into distributions. Thanks for the information, sure this could be used in tangent spaces to some extend, though I am not sure about the details.

Affie · 2021-06-19T15:34:10Z

This probably refers to the way covariance matrix for a multivariate normal distribution can be obtained from the Hessian of logpdf like here: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470824566.app1 . I guess there should be a manifold equivalent for some generalization of the multivariate normal distribution? And this is the Mahalanobis distance: https://en.wikipedia.org/wiki/Mahalanobis_distance and it uses the covariance matrix.

Yes, that is spot on. If you know of any good sources on the subject it would be great. I still have a lot to learn.

Affie · 2021-06-19T16:23:03Z

Unfortunately, I still don't get the results I expect in general, here is the basic example updated to SE(2)

I experimented with step size but it didn't seem to make a difference.

using Manopt
using Manifolds
using StaticArrays

# "factor"
function Prior_distance(M, meas, p)		
    return distance(M, meas, p)
end

M = SpecialEuclidean(2)

priorx1 = ProductRepr(SA[1.1, 1.0], SA[0.707107 -0.707107; 0.707107 0.707107])
priorx2 = ProductRepr(SA[0.9, 1.0], SA[0.707107 -0.707107; 0.707107 0.707107])

# Factor graph with 2 priors on variable x
function F(M,x) 
    return Prior_distance(M, priorx1, x) + Prior_distance(M, priorx2, x)
end

function gradF(M, x)
    Manifolds.gradient(M,(x)->F(M,x), x)
end

# start value
x0 = ProductRepr(SA[0.,0], SA[1. 0; 0 1])
# test cost function
F(M, x0)
gradF(M, x0)

# run
xMean = gradient_descent(M, F, gradF, x0; stepsize=ArmijoLinesearch(), debug= [:Iteration, :Cost,"\n",10])

InitialF(x): 3.600341492608454
# 10F(x): 3.526239029861806
# 20F(x): 3.5165928358291243
# 30F(x): 3.5165823494284916
# 40F(x): 3.516582337500095
# 50F(x): 3.516582337489366
# 60F(x): 3.5165823374868284
# 70F(x): 3.5165823374884537
# 80F(x): 3.5165823374881358
# 90F(x): 3.516582337486814
# 100F(x): 3.5165823374881002
# 110F(x): 3.516582337487348
# 120F(x): 3.516582337489555
# 130F(x): 3.5165823374886473
# 140F(x): 3.516582337487341
# 150F(x): 3.516582337488459
# 160F(x): 3.5165823374871947
# 170F(x): 3.5165823374887344
# 180F(x): 3.5165823374883733
# 190F(x): 3.5165823374862595
# 200F(x): 3.516582337488268
ProductRepr with 2 submanifold components:
 Component 1 =
  2-element MVector{2, Float64} with indices SOneTo(2):
   0.9999999999999774
   1.0000000000001388
 Component 2 =
  2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
   -0.440261  -0.89787
    0.89787   -0.440261

The same code errors with quasi_Newton, see the issue at "JuliaManifolds/Manifolds.jl#384"

kellertuer · 2021-06-19T16:25:07Z

What do you expect?

Affie · 2021-06-19T16:29:21Z

The rotation part should be

Component 2 =
  2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
0.707107 -0.707107
0.707107 0.707107]

If you look at the gradient and cost function of the answer it also shows it's not working:

julia> gradF(M, xMean)
ProductRepr with 2 submanifold components:
 Component 1 =
  2-element MVector{2, Float64} with indices SOneTo(2):
   -1.5261019344236747e-13
   -1.627555396419925e-14
 Component 2 =
  2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
    0.0      1.40057
   -1.40057  0.0


julia> F(M, xMean)
1.4430881464841778

kellertuer · 2021-06-19T16:30:48Z

hen I think your gradient is wrong? Keep in mind that AD gradients on manifolds (in Manifolds.jl) are experimental only and heavily rely on the embedding currently. Please try with the gradient I proposed.

Affie · 2021-06-19T16:34:37Z

I just tried, it's exactly the same.

kellertuer · 2021-06-19T16:37:55Z

HM, the gradient really looks strange, since the points are identical in the second component, I have no clue why the gradient is not zero from the start. I currently do not have the time to invest, but it might be reasonable to just look at an example on Rotations for now.

kellertuer · 2021-06-19T16:54:55Z

I think you are running into a problem with convexity. Keep in mind, that all algorithms here (and in Optim) always require convexity for convergence to a minimiser. On manifolds (not all but for example the sphere or rotations) this is a huge problem, since there is no global convexity. So you have to be close enough to the staring point.

Just as a short check, take x0 = [0.707107 -0.707107; 0.707107 0.707107] (xprios1/2 also just their second component, Rotations(2) as manifold, the rest unchanged).

Then surely the gradient is zero and grad desc stops even before the first iterate. But your starting point might be too far away I think. So this is – I think inherently a problem on non-Hadamard manifolds, to drop the terminology, i.e. manifolds with finite infectivity radius.

Affie · 2021-06-19T17:20:20Z

Ok thank you. I'll investigate some more at a later stage.

As a last example, I took your advice and tried with Rotations only and changed the initial to closer to the value, without success.
Here is the code:

using Manopt
using Manifolds
using StaticArrays

M = SpecialOrthogonal(2)

priorx1 = SA[0.707107 -0.707107; 0.707107 0.707107]

function F(M,x) 
    return distance(M, priorx1, x)
end

function gradF(M, x)
    grad_distance(M, priorx1, x, 1)
end

# start value
x0 = SA[1. 0; 0 1]
x0 = SA[0.632981  -0.774167; 0.774167 0.632981]
# x0 = SA[0.707107 -0.707107; 0.707107 0.707107]
# test cost function
F(M, x0)
gradF(M, x0)

# run
stopping_criterion = (StopWhenAny(StopAfterIteration(2000),StopWhenGradientNormLess(10.0^-8)))
stepsize=ArmijoLinesearch()
# stepsize = ConstantStepsize(0.1)
# stepsize = WolfePowellBinaryLinesearch()

debug= [:Iteration, :Cost,"\n",10]
xMean = gradient_descent(M, F, gradF, x0; stopping_criterion, stepsize, debug)

F(M, xMean)
gradF(M, xMean)

kellertuer · 2021-06-19T17:26:32Z

Yeah, but again, look the gradient and you can also DebugStepsize() (within the debug array) or even

debug= [:Iteration, :Cost, DebugStepsize(), DebugGradientNorm(), "\n",10]

then you see: Your gradient Norm is 1, your step size ends up being 7 and your infectivity radius (in this situation maybe interpretable as the maximum step size that is actually reasonable) is less that 5. From the second iteration onwards. So I maybe would not recommend Armijo here, due to the nonconvexity.
But just now I am also out of ideas, why this does not walk towards the minimiser.

kellertuer · 2021-06-19T17:32:16Z

Ah there is one other major issue with your function.
The sum of two distances or even a distance itself. Is convex, but not differentiable (if one point is fixed and the other is the argument). For non-differentiable functions the gradient descent is not applicable. Compare the computation of the median in the docs. This will be the main issue here (sorry that I saw this a little late).

edit: Just noticed that your last example is just one point. I am confused that that does not finish in one step since the gradient points directly to the minimiser anyways. Maybe there is some issue with rotations, but I did not write that manifold, so I would have to invest quite some tie first checking that code, though, for which I do not have time at the moment (moved internationally, started new job, these things) – ah no Rotations look ok themselves.

mateuszbaran · 2021-06-19T21:35:34Z

This probably refers to the way covariance matrix for a multivariate normal distribution can be obtained from the Hessian of logpdf like here: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470824566.app1 . I guess there should be a manifold equivalent for some generalization of the multivariate normal distribution? And this is the Mahalanobis distance: https://en.wikipedia.org/wiki/Mahalanobis_distance and it uses the covariance matrix.

Yes, that is spot on. If you know of any good sources on the subject it would be great. I still have a lot to learn.

I'm still a beginner with manifold-valued statistics. Sources tend to be very heavy on math. Xavier Pennec does some really cool work in that topic (see for example https://hal.inria.fr/inria-00614994/PDF/Pennec.JMIV06.pdf ) but there may be better sources.

@sethaxen could you recommend some other good sources for manifold statistics?

Regarding minimization, I'll try to find some time tomorrow to check it. One quick note, @kellertuer , linesearch_backtrack seems to ignore injectivity radius, shouldn't that be taken into account? In the example above stepsize actually exceeds injectivity radius of the manifold.

dehann · 2021-06-20T00:39:09Z

started new job

Congrats! No worries on time delay know these discussion take up a lot of time, thanks for all the help!

Then also, thanks for the references. I am in need of an on-manifold covariance estimator myself, and will be building one at some point. I'll share back anything useful (xref JuliaRobotics/IncrementalInference.jl#1280).

kellertuer · 2021-06-20T08:47:36Z

Thanks – I can just not recommend to move between countries in these times, border rules are a little complicated these days.

Xavier Pennec is for sure a perfect source for statistics on manifolds.

One quick note, @kellertuer , linesearch_backtrack seems to ignore injectivity radius, shouldn't that be taken into account? In the example above stepsize actually exceeds injectivity radius of the manifold.

The problem is, that I am not aware of works that do take it into account, hence the line search here follows the same ideas as in Manopt and pymanopt. The reason is, that even if you exceed the (global) injectivity radius it might (a) locally ok, because the radius at that point is larger or (b) even be globally ok to do the step in that case, though in is not injective. So there is no good rule when to do or not to do a too large step. Here, however, I am very sure, this kills the example.

mateuszbaran · 2021-06-20T10:37:05Z

Regarding parallel transport, this question: https://math.stackexchange.com/questions/2066207/how-can-one-define-parallel-transport-on-lie-groups relates parallel transport on Lie groups with differentials of left translation. Our metric on SO(n) is bi-invariant so parallel transport should agree with left translate_diff which is just identity, and as far as I can see translate_diff is correct (I've differentiated left translation on paper).

And, regarding statistics on manifolds, one additional thing: I have this book https://www.amazon.pl/Riemannian-Geometric-Statistics-Medical-Analysis/dp/0128147253 on to-read list, it seems to be not too dense in math.

mateuszbaran · 2021-06-20T14:29:51Z

If I understand this correctly (see chapter 5.3 of that book, preprint available here: https://hal.inria.fr/hal-02342137/document ), the parallel transport of the (-) Cartan-Schouten connection is the identity on SO(n), but it's different from the Levi-Civita connection.

* Update src/tutorials/MeanAndMedian.jl xref #86 (comment) Co-authored-by: Ronny Bergmann <[email protected]>

Affie · 2021-06-24T17:19:42Z

Ah there is one other major issue with your function.
The sum of two distances or even a distance itself. Is convex, but not differentiable (if one point is fixed and the other is the argument). For non-differentiable functions the gradient descent is not applicable. Compare the computation of the median in the docs. This will be the main issue here (sorry that I saw this a little late).

Ahh, thanks. I was working from our factors and missed it since it's done with the Mahalanobis distance in our case. It should be distance squared. I still can't get a satisfactory result though.

Perhaps I should take a step back and rather ask how to solve on manifolds such as SE(2), SE(3), SO(2), SO(3). Perhaps start with SE(2) for a POC.
I have factors of this form so far for groups, but am open to any suggestions:

#M - manifold
#p and q are points on manifold
#m is a measurement from point p
# meas is the prior measurement 

function PointPoint(M, m, p, q)
    q̂ = compose(M, p, m)
    return distance(M, q, q̂)^2
end

function Prior(M, meas, p)	
#		
    return distance(M, meas, p)^2
end

The factors are combined into cost functions that depends on the structure of the factor graph.

kellertuer · 2021-06-24T17:45:10Z

Distance squared is better, that is geodetically convex (so locally it is convex.
I am not sure how compose (and the chain rule) act then here.

For distance squared, keep in mind that globally this might still be non-unique. For example on the sphere if you minimise with respect to x the function

distance(M, n, x)^2 + distance(M, s, x)^2

Where n and s are North and South Pole, respectively. Then any x on the equator is a minimiser of the function.

Your last sentence, I again can‘t follow. what factrors? what is to combine here? Which graph? What is its structure?

Affie · 2021-06-25T08:15:26Z

Hi, I found something interesting. If i set retraction_method=QRRetraction() in the basic rotations example it works as expected.
I looked at the exponential retraction in rotations but can't find anything wrong with it.

kellertuer · 2021-06-25T08:21:02Z

That is good to know :)
That was one of my concerns when I wrote this #86 (comment), but all tests thereon look ok. It might be good to take a look at exp/log on Rotations again then, because I also could not find anything wrong with them. Or maybe they are not stable currently at least; I am not sure, I usually do not work on that manifold.

Retractions should locally be as fine as exp (inverse retractions as fine as log), but are usually cheaper or numericaly more stable. So good that you checked.

Affie · 2021-06-25T08:21:53Z

For distance squared, keep in mind that globally this might still be non-unique. For example on the sphere if you minimise with respect to x the function
distance(M, n, x)^2 + distance(M, s, x)^2
Where n and s are North and South Pole, respectively. Then any x on the equator is a minimiser of the function.

That is where IncrementalInference.jl comes in, in short, you will get the ring around the equator as a belief if you set it up that way. For example figure 1 in https://marinerobotics.mit.edu/sites/default/files/fourie_iros19_manifolds.pdf has a similar case on a circular manifold.

Affie · 2021-06-25T08:35:59Z

Your last sentence, I again can‘t follow. what factrors? what is to combine here? Which graph? What is its structure?

Sorry, an explanation will be too much for a comment. Perhaps a basic example:

#for a user-defined graph that looks like this:

p -- x1 -- f1 -- x2 -- f2 -- x3
      \                    /
        ------- f3 ------ 

#the cost would look like this:
cost(M, x) = Prior(M, p, x[1])^2 + PointPoint(M, f1, x[1], x[2])^2 + 
             PointPoint(M, f2, x[2], x[3])^2 + PointPoint(M, f3, x[1], x[3])^2

dehann added a commit to dehann/Manopt.jl that referenced this issue Jun 18, 2021

note on step and stop for gradient descent

e542459

xref JuliaManifolds#86 (comment)

dehann mentioned this issue Jun 18, 2021

note on step and stop for gradient descent #87

Merged

mateuszbaran mentioned this issue Jun 20, 2021

Cartan-Schouten connections JuliaManifolds/Manifolds.jl#387

Closed

kellertuer added a commit that referenced this issue Jun 21, 2021

note on step and stop for gradient descent (#87)

95dddfb

* Update src/tutorials/MeanAndMedian.jl xref #86 (comment) Co-authored-by: Ronny Bergmann <[email protected]>

kellertuer closed this as completed Jun 25, 2021

JuliaManifolds locked and limited conversation to collaborators Jun 25, 2021

This issue was moved to a discussion.

Basic gradient_descent optimization fails #86

Basic gradient_descent optimization fails #86

Comments

Affie commented Jun 18, 2021 • edited Loading

kellertuer commented Jun 18, 2021

kellertuer commented Jun 18, 2021 • edited Loading

dehann commented Jun 18, 2021

kellertuer commented Jun 18, 2021

kellertuer commented Jun 18, 2021

dehann commented Jun 18, 2021

kellertuer commented Jun 18, 2021

dehann commented Jun 18, 2021

kellertuer commented Jun 18, 2021

dehann commented Jun 18, 2021

kellertuer commented Jun 18, 2021

Affie commented Jun 18, 2021

Affie commented Jun 18, 2021

kellertuer commented Jun 18, 2021

Affie commented Jun 18, 2021 • edited Loading

kellertuer commented Jun 18, 2021

kellertuer commented Jun 18, 2021

Affie commented Jun 18, 2021

kellertuer commented Jun 18, 2021

mateuszbaran commented Jun 18, 2021

kellertuer commented Jun 18, 2021

Affie commented Jun 19, 2021

Affie commented Jun 19, 2021

kellertuer commented Jun 19, 2021

Affie commented Jun 19, 2021

kellertuer commented Jun 19, 2021

Affie commented Jun 19, 2021

kellertuer commented Jun 19, 2021

kellertuer commented Jun 19, 2021

Affie commented Jun 19, 2021

kellertuer commented Jun 19, 2021

kellertuer commented Jun 19, 2021 • edited Loading

mateuszbaran commented Jun 19, 2021

dehann commented Jun 20, 2021

kellertuer commented Jun 20, 2021

mateuszbaran commented Jun 20, 2021

mateuszbaran commented Jun 20, 2021

Affie commented Jun 24, 2021

kellertuer commented Jun 24, 2021

Affie commented Jun 25, 2021

kellertuer commented Jun 25, 2021

Affie commented Jun 25, 2021

Affie commented Jun 25, 2021 • edited Loading

This issue was moved to a discussion.

Basic `gradient_descent` optimization fails #86

Basic `gradient_descent` optimization fails #86

Affie commented Jun 18, 2021 •

edited

Loading

kellertuer commented Jun 18, 2021 •

edited

Loading

Affie commented Jun 18, 2021 •

edited

Loading

kellertuer commented Jun 19, 2021 •

edited

Loading

Affie commented Jun 25, 2021 •

edited

Loading