Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic gradient_descent optimization fails #86

Closed
Affie opened this issue Jun 18, 2021 · 43 comments
Closed

Basic gradient_descent optimization fails #86

Affie opened this issue Jun 18, 2021 · 43 comments

Comments

@Affie
Copy link
Contributor

Affie commented Jun 18, 2021

Hi I'm new to Manopt and I'm trying to figure out how to integrate Manopt into IncrementalInference (also see JuliaRobotics/IncrementalInference.jl#1276 for more complex tries).

The optimization is formulated as a factor graph and this example represents a simple factor graph with one variable and 2 priors (so basically the minimum of 2 points). I don't get the results I expect and I made this MWE:

using Manopt
using Manifolds
using StaticArrays

# "factor"
function Prior_distance(M, meas, p)		
    return distance(M, meas, p)
end

## sanity check factors
M = TranslationGroup(2)

p = SA[1., 0]
q = SA[1., 1]
Prior_distance(M, p, q)

p = SA[0., 1]
q = SA[1., 1]
Prior_distance(M, p, q)

p = SA[0., -1]
q = SA[0., 0]
Prior_distance(M, p, q)

p = SA[0., -1]
q = SA[0., -1]
Prior_distance(M, p, q)

## Try  with Manopt

priorx1 = SA[1.1, 1.0]
priorx2 = SA[0.9, 1.0]

# Factor graph with 2 priors on variable x
function F(M,x) 
    return Prior_distance(M, priorx1, x) + Prior_distance(M, priorx2, x)
end

function gradF(M, x)
    Manifolds.gradient(M,(x)->F(M,x), x)
end

# start value
x0 = SA[0.,0]
# test cost function
F(M, x0)
gradF(M, x0)

# run
xMean = gradient_descent(M, F, gradF, x0)
F(M, xMean)
gradF(M, xMean)

#= results
# I expect [1,1]
julia> xMean = gradient_descent(M, F, gradF, x0)
2-element MVector{2, Float64} with indices SOneTo(2):
  0.9859635170635069
 -0.0031281100934246187

# the cost is still high
julia> F(M, xMean)
2.0163938998034947

# and the gradient also
julia> gradF(M, xMean)
2-element MVector{2, Float64} with indices SOneTo(2):
 -0.02757083871664094
 -1.9899485375183275
=#

cc @dehann

EDIT: I'm on
[1cead3c2] Manifolds v0.5.4
[0fc0a36d] Manopt v0.3.9

@kellertuer
Copy link
Member

Interesting.
First you do not really have to use the AD tools (from Manifolds.gradient), since for distance there are gradients available here as grad_distance(M, p, x, exponent=2) (that is the default gradient is for the distance squared), so a little more efficient is to use

gradF(M, x) = grad_distance(M, priorx1, x0,1) + grad_distance(M, priorx2, x0,1)

this yields exactly the same numbers as your gradient, but on other manifolds that R2 (your manifold is essentially Euclidean(2) here).

Then, to see a little more what is happening (though I saw what the reason is directly) you can do

 xMean = gradient_descent(M, F, gradF, x0; debug= [:Iteration, :Cost,"\n",10])

to print a line of the iteration and the cost (allowed by a newline) every 10th iteration.

If you then look for the default values of the gradient descent, you can see that it works with a constant step size by default (machine learning people call that learning rate), which is not guaranteed to converge (ever).

But you can exchange that, see https://manoptjl.org/stable/plans/index.html#Stepsize-1 with a step size that might fit better, for example the Armijo step size rule, its defaults are chosen to be just fine, so

julia> xMean = gradient_descent(M, F, gradF, x0; stepsize=ArmijoLinesearch(),  debug= [:Iteration, :Cost,"\n",10])
InitialF(x): 2.831969279439222
# 10F(x): 0.2578005738010819
# 20F(x): 0.2000672051680747
# 30F(x): 0.20000006190528008
# 40F(x): 0.20000000005719792
# 50F(x): 0.2000000000000529
# 60F(x): 0.20000000000000012
2-element MVector{2, Float64} with indices SOneTo(2):
 1.0000197125126835
 0.9999999998505151

is this close enough? If not, you can tweak the stopping criterion, see https://manoptjl.org/stable/solvers/index.html#StoppingCriteria-1 and note that you can combine stopping criteria just with & or |. The default is StopAfterIteration(200) | StopWhenGradientNormLess(10.0^-8), so it stops after 200 iterations or when the gradient Is less than 10.0^-8, what ever hits first. Let's also increase the debug oversampling to 25 and we get for example

julia> xMean = gradient_descent(M, F, gradF, x0;
   stepsize=ArmijoLinesearch(),
   stopping_criterion=StopAfterIteration(200) | StopWhenGradientNormLess(10.0^-12),
   debug=[:Iteration, :Cost,"\n",25]
)
InitialF(x): 2.831969279439222
# 25F(x): 0.2000020367614238
# 50F(x): 0.2000000000000529
2-element MVector{2, Float64} with indices SOneTo(2):
 1.0000197125126835
 0.9999999999999954

Let me know if this solves your problem,
whether there is anything missing in the docs or
could be rephrased therein.

One could for example change the default to use Armijo - though that has per default quite some function evaluations to determine the step size.

@kellertuer
Copy link
Member

kellertuer commented Jun 18, 2021

Adendum, after clicking on your link – it seems the allocate in ManifoldsBase does something strange in NelderMead but if you provide the population yourself and use the mutating variant

NelderMead!(M.manifold, F, [random_point(M) for _=1:3])

it does something. Note that Nelder Mead is not a very efficient algorithm and should only be used if you do not have any gradient.

edit: sorry my bad :) I do not work with NelderMead usually.

You should call also NelderMead not with an initial point but with an initial population (manifold_dimension+1 points) so

NelderMead(M.manifold, F, [random_point(M) for _=1:3])

just works fine, too as does

NelderMead(M.manifold, F)

But note that its convergence is veeeeeery slow (the default 2000 iterations are not enough).

@dehann
Copy link
Contributor

dehann commented Jun 18, 2021

Hi Ronny Johan, thanks this helps me follow better too.

it seems the allocate in ManifoldsBase does something strange in NelderMead

hmm, makes me curious about #64 again. I'll check when I get a chance and comment there if something shows up.

@kellertuer
Copy link
Member

Hi Ronny Johan, thanks this helps me follow better too.

it seems the allocate in ManifoldsBase does something strange in NelderMead

hmm, makes me curious about #64 again. I'll check when I get a chance and comment there if something shows up.

No, that was my mistake at first glance – the error is that you have to provide a population not a starting point. Then everything works :)

@kellertuer
Copy link
Member

Oh and a final look, since you used BFGS in Optim. We have that on manifolds, too, just use

julia> quasi_Newton(M, F, gradF, x0)
2-element MVector{2, Float64} with indices SOneTo(2):
 1.0204551069445331
 1.000000000000014

whose default is the (inverse) BFGS here, too.

@dehann
Copy link
Contributor

dehann commented Jun 18, 2021

Ah, got it thanks!

@kellertuer
Copy link
Member

Perfect. Let me know when you think a default should be adapted or the docs should be changed (for example at first glance Stopping Criteria and Step Size rules could be made a little more prominent).

@dehann
Copy link
Contributor

dehann commented Jun 18, 2021

Stopping Criteria and Step Size rules could be made a little more prominent

Maybe just a !!! note :-)

@kellertuer
Copy link
Member

Where? The default step size is mentioned at every algorithm as are the default sopping criteria.
I was like planning to give them separate pages maybe

@dehann
Copy link
Contributor

dehann commented Jun 18, 2021

I made a PR on where I would add it... Feel free to ignore if that does not fit/work.

Reason for putting it there is that when following a design template example for the first time, one expects the defaults to work (regardless of performance). When I get stuck on the first example, I find myself scrolling to the bottom of the "Getting Started" page hoping to find the obvious stumbling blocks. FAQ is second best place (but can be a discouraging search). Once a user sees their own first example working (different than just copy paste), then you start to invest more time to read deeper into all the docs. Training wheels first with absolute minimum details to get going, then scale up more and more on details from there.

@kellertuer
Copy link
Member

Manopt.jl does not have a FAQ.
I will comment on the idea in the PR.

@Affie
Copy link
Contributor Author

Affie commented Jun 18, 2021

Thanks a lot, I did find stopping criteria and played around with it. However, I expected the default to converge in such a simple example.

There are quite a few stumbling blocks for me, but I can already see the potential of Manopt. I know it's a relatively new package and perhaps we can help build out examples.

Reason for putting it there is that when following a design template example for the first time, one expects the defaults to work (regardless of performance). When I get stuck on the first example, I find myself scrolling to the bottom of the "Getting Started" page hoping to find the obvious stumbling blocks.

So, I'm in agreement with this, and that for defaults to rather work (with the possible cost of performance, with a note or something) especially if the user comes from other packages like Optim or JuMP where one can be a user without understanding the underlying algorithm.

Perhaps something like this will help a new user like me a lot, from the Optim.jl docs:

Summary
As a very crude heuristic:
For a low-dimensional problem with analytic gradients and Hessians, use the Newton method with trust region. For larger problems or when there is no analytic Hessian, use LBFGS, and tweak the parameter m if needed. If the function is non-differentiable, use Nelder-Mead. Use the HagerZhang linesearch for robustness and BackTracking for speed.

@Affie
Copy link
Contributor Author

Affie commented Jun 18, 2021

It's perhaps not part of this issue, but let me list some problems I encounter:
I run into a lot of error messages and don't know if it is something I'm doing wrong (in the NelderMead case):

Manopt.NelderMead(MP, F, xp)
ERROR: MethodError: no method matching NelderMeadOptions(::ProductRepr{Tuple{ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}}}, ::StopAfterIteration; α=1.0, γ=2.0, ρ=0.5, σ=0.5, retraction_method=ExponentialRetraction(), inverse_retraction_method=LogarithmicInverseRetraction())

or perhaps other errors, eg:

julia> xMean = quasi_Newton(M, F_prior, gradF_prior, x; debug= [:Iteration, :Cost,"\n",10])
InitialF(x): 1.4945569745366585
ERROR: vector_transport_to! not implemented on ProductManifold(TranslationGroup(2; field = ℝ), SpecialOrthogonal(2)) for arguments ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}} and ParallelTransport.
julia> particle_swarm(M, F_prior; n, x0)
ERROR: ArgumentError: broadcasting requires an assigned BroadcastStyle
gradF(MP, xp)
ERROR: MethodError: no method matching fill!(::ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ::Int64)

The other issues are more to do with a lack of knowledge coming from "I just want to optimize my cost function", such as

  • the choice of algorithm and parameters. (basically ran them and compared results)
  • how to write my gradient function (just used automatic diff)
  • how to work with covariences (previously used Mahalanobis distance for measurement and hessian to estimate answer)

I hope this helps improve the quality of this great package and let me know if there is anywhere I can help. (although I still have a lot to learn)

@kellertuer
Copy link
Member

For the first and simple example, there is a certain tradeoff, especially on manifolds.
The constant step size might not converge, but other rules, that require function evaluations and/or exponential maps/retractions (Armijo needs both quite a lot) might take a long computational time. One has the choice between a user “stumbling” over either one of these.

We can still, for sure, switch to Armijo, I don‘t have a strong preference there.
The main problem is not that Manopt is that new (I started in 2016, but only got a proper first version in 2018), but that I am the only developer. Sure, some algorithms are provided by others, some of them were students of mine.
That‘s why some parts have not been used yet extensively, which always usually strengthens notes and tutorials. I hope, though, that the documentation is complete (concerning optional arguments for example).

The summary sounds like a good idea, and holds here completely analogously for Newton with TR, and LBFGS (BFGS uses https://manoptjl.org/stable/solvers/quasi_Newton.html#Manopt.AbstractQuasiNewtonDirectionUpdate switch to https://manoptjl.org/stable/solvers/quasi_Newton.html#Manopt.QuasiNewtonLimitedMemoryDirectionUpdate for the LBFGS one, which might also be good to have a shorter constructor).

Additionally, for non smooth functions, if you have probes use CyclicProximalPoint or DouglasRachford (Optim does not have those but there this package has also a little different focus, namely to include non smooth optimisation on manifolds) or if you are adventurous the ChambollePock algorithm from the most recent paper.

For the summary – 

@Affie
Copy link
Contributor Author

Affie commented Jun 18, 2021

For the summary –

Thanks! A summary like that is perfect. I was actually looking for LBFGS, but didn't recognize it until you made it obvious just now 🤦

@kellertuer
Copy link
Member

Manopt.NelderMead(MP, F, xp)
ERROR: MethodError: no method matching NelderMeadOptions(::ProductRepr{Tuple{ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}}}, ::StopAfterIteration; α=1.0, γ=2.0, ρ=0.5, σ=0.5, retraction_method=ExponentialRetraction(), inverse_retraction_method=LogarithmicInverseRetraction())

For this one without details I can not provide more details – where was it thrown, how did you call NelderMead, what are the variables? What is the Call stack? I can sadly not just guess from a signature what is wrong.

or perhaps other errors, eg:

julia> xMean = quasi_Newton(M, F_prior, gradF_prior, x; debug= [:Iteration, :Cost,"\n",10])
InitialF(x): 1.4945569745366585
ERROR: vector_transport_to! not implemented on ProductManifold(TranslationGroup(2; field = ℝ), SpecialOrthogonal(2)) for arguments ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}} and ParallelTransport.

This seems more like an issue for Manifolds.jl, since vector_transport_to! is something I use from the Manifoldsbase.jl interface. At first glance (same comment as above, with missing call stack / MWE / which variables are used?) I would say that vector_transport_to! seems to not pass through the group manifold correctly (i.e. “unpack” the group manifold decorator to call vector_transport_to! on the manifold (without the group)

julia> particle_swarm(M, F_prior; n, x0)
ERROR: ArgumentError: broadcasting requires an assigned BroadcastStyle

Again

  • which manifold?
  • which x0 ? I think, again, you used a single point x0 and not a population (you have to provide manifold_dimension(M)+1 points if you set the parameter
  • what is n?
gradF(MP, xp)
ERROR: MethodError: no method matching fill!(::ProductRepr{Tuple{MVector{2, Float64}, MMatrix{2, 2, Float64, 4}}}, ::Int64)

What is gradF here? that seems to be a function you defined yourself? Mutating or non-mutating? Mutating might not work 100% with ProductRepr, but Manopt.jl supports both types: functions of the form grad_F(M, p) that return the gradient and grad_F!(M, X, p)that compute the result in place of X.

The other issues are more to do with a lack of knowledge coming from "I just want to optimize my cost function", such as

  • the choice of algorithm and parameters. (basically ran them and compared results)

Can not be answered generically, depends heavily on the algorithm and the problem you use

  • how to write my gradient function (just used automatic diff)

AD is something that is not yet well covered (see my note on how many developers are here - just me – ), so the only AD support is the one you already used in Manifolds.jl. AD is on my list but just alone might take quite some time still.

  • how to work with covariences (previously used Mahalanobis distance for measurement and hessian to estimate answer)

I do not understand this, What is Mahalanobis? Where do you want to work with Covariances? As a manifold?

@kellertuer
Copy link
Member

For the summary –

Thanks! A summary like that is perfect. I was actually looking for LBFGS, but didn't recognize it until you made it obvious just now 🤦

This might mean, as I said, that LBFGS should get a nice short constructor and a nice tutorial. This Algorithm was the bachelor thesis os a student of mine, and a tutorial was not part of that ;)

@Affie
Copy link
Contributor Author

Affie commented Jun 18, 2021

For this one without details I can not provide more details

Yes sorry, my intention was not to get the answers to the questions (thanks for trying though) but to just list general stumbling blocks.
I'll find MWE for all the errors I encounter if that helps. Perhaps some of them can make it into the tests.
Give me some time and I'll list it in a new issue.

@kellertuer
Copy link
Member

Sure, carefully craft MWEs or MnWEs and open issues – but also look where the issue might come from, i.e. is it more like Manopt of Manifolds.

@mateuszbaran
Copy link
Member

  • how to work with covariences (previously used Mahalanobis distance for measurement and hessian to estimate answer)

I do not understand this, What is Mahalanobis? Where do you want to work with Covariances? As a manifold?

This probably refers to the way covariance matrix for a multivariate normal distribution can be obtained from the Hessian of logpdf like here: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470824566.app1 . I guess there should be a manifold equivalent for some generalization of the multivariate normal distribution? And this is the Mahalanobis distance: https://en.wikipedia.org/wiki/Mahalanobis_distance and it uses the covariance matrix.

@kellertuer
Copy link
Member

Ah, that was a too long way for me to get from this short note – especially since I am mote into optimization and less into distributions. Thanks for the information, sure this could be used in tangent spaces to some extend, though I am not sure about the details.

@Affie
Copy link
Contributor Author

Affie commented Jun 19, 2021

This probably refers to the way covariance matrix for a multivariate normal distribution can be obtained from the Hessian of logpdf like here: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470824566.app1 . I guess there should be a manifold equivalent for some generalization of the multivariate normal distribution? And this is the Mahalanobis distance: https://en.wikipedia.org/wiki/Mahalanobis_distance and it uses the covariance matrix.

Yes, that is spot on. If you know of any good sources on the subject it would be great. I still have a lot to learn.

@Affie
Copy link
Contributor Author

Affie commented Jun 19, 2021

Unfortunately, I still don't get the results I expect in general, here is the basic example updated to SE(2)

I experimented with step size but it didn't seem to make a difference.

using Manopt
using Manifolds
using StaticArrays

# "factor"
function Prior_distance(M, meas, p)		
    return distance(M, meas, p)
end

M = SpecialEuclidean(2)

priorx1 = ProductRepr(SA[1.1, 1.0], SA[0.707107 -0.707107; 0.707107 0.707107])
priorx2 = ProductRepr(SA[0.9, 1.0], SA[0.707107 -0.707107; 0.707107 0.707107])

# Factor graph with 2 priors on variable x
function F(M,x) 
    return Prior_distance(M, priorx1, x) + Prior_distance(M, priorx2, x)
end

function gradF(M, x)
    Manifolds.gradient(M,(x)->F(M,x), x)
end

# start value
x0 = ProductRepr(SA[0.,0], SA[1. 0; 0 1])
# test cost function
F(M, x0)
gradF(M, x0)

# run
xMean = gradient_descent(M, F, gradF, x0; stepsize=ArmijoLinesearch(), debug= [:Iteration, :Cost,"\n",10])

InitialF(x): 3.600341492608454
# 10F(x): 3.526239029861806
# 20F(x): 3.5165928358291243
# 30F(x): 3.5165823494284916
# 40F(x): 3.516582337500095
# 50F(x): 3.516582337489366
# 60F(x): 3.5165823374868284
# 70F(x): 3.5165823374884537
# 80F(x): 3.5165823374881358
# 90F(x): 3.516582337486814
# 100F(x): 3.5165823374881002
# 110F(x): 3.516582337487348
# 120F(x): 3.516582337489555
# 130F(x): 3.5165823374886473
# 140F(x): 3.516582337487341
# 150F(x): 3.516582337488459
# 160F(x): 3.5165823374871947
# 170F(x): 3.5165823374887344
# 180F(x): 3.5165823374883733
# 190F(x): 3.5165823374862595
# 200F(x): 3.516582337488268
ProductRepr with 2 submanifold components:
 Component 1 =
  2-element MVector{2, Float64} with indices SOneTo(2):
   0.9999999999999774
   1.0000000000001388
 Component 2 =
  2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
   -0.440261  -0.89787
    0.89787   -0.440261

The same code errors with quasi_Newton, see the issue at "JuliaManifolds/Manifolds.jl#384"

@kellertuer
Copy link
Member

What do you expect?

@Affie
Copy link
Contributor Author

Affie commented Jun 19, 2021

The rotation part should be

Component 2 =
  2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
0.707107 -0.707107
0.707107 0.707107]

If you look at the gradient and cost function of the answer it also shows it's not working:

julia> gradF(M, xMean)
ProductRepr with 2 submanifold components:
 Component 1 =
  2-element MVector{2, Float64} with indices SOneTo(2):
   -1.5261019344236747e-13
   -1.627555396419925e-14
 Component 2 =
  2×2 MMatrix{2, 2, Float64, 4} with indices SOneTo(2)×SOneTo(2):
    0.0      1.40057
   -1.40057  0.0


julia> F(M, xMean)
1.4430881464841778

@kellertuer
Copy link
Member

hen I think your gradient is wrong? Keep in mind that AD gradients on manifolds (in Manifolds.jl) are experimental only and heavily rely on the embedding currently. Please try with the gradient I proposed.

@Affie
Copy link
Contributor Author

Affie commented Jun 19, 2021

I just tried, it's exactly the same.

@kellertuer
Copy link
Member

HM, the gradient really looks strange, since the points are identical in the second component, I have no clue why the gradient is not zero from the start. I currently do not have the time to invest, but it might be reasonable to just look at an example on Rotations for now.

@kellertuer
Copy link
Member

I think you are running into a problem with convexity. Keep in mind, that all algorithms here (and in Optim) always require convexity for convergence to a minimiser. On manifolds (not all but for example the sphere or rotations) this is a huge problem, since there is no global convexity. So you have to be close enough to the staring point.

Just as a short check, take x0 = [0.707107 -0.707107; 0.707107 0.707107] (xprios1/2 also just their second component, Rotations(2) as manifold, the rest unchanged).

Then surely the gradient is zero and grad desc stops even before the first iterate. But your starting point might be too far away I think. So this is – I think inherently a problem on non-Hadamard manifolds, to drop the terminology, i.e. manifolds with finite infectivity radius.

@Affie
Copy link
Contributor Author

Affie commented Jun 19, 2021

Ok thank you. I'll investigate some more at a later stage.

As a last example, I took your advice and tried with Rotations only and changed the initial to closer to the value, without success.
Here is the code:

using Manopt
using Manifolds
using StaticArrays

M = SpecialOrthogonal(2)

priorx1 = SA[0.707107 -0.707107; 0.707107 0.707107]

function F(M,x) 
    return distance(M, priorx1, x)
end

function gradF(M, x)
    grad_distance(M, priorx1, x, 1)
end

# start value
x0 = SA[1. 0; 0 1]
x0 = SA[0.632981  -0.774167; 0.774167 0.632981]
# x0 = SA[0.707107 -0.707107; 0.707107 0.707107]
# test cost function
F(M, x0)
gradF(M, x0)

# run
stopping_criterion = (StopWhenAny(StopAfterIteration(2000),StopWhenGradientNormLess(10.0^-8)))
stepsize=ArmijoLinesearch()
# stepsize = ConstantStepsize(0.1)
# stepsize = WolfePowellBinaryLinesearch()

debug= [:Iteration, :Cost,"\n",10]
xMean = gradient_descent(M, F, gradF, x0; stopping_criterion, stepsize, debug)

F(M, xMean)
gradF(M, xMean)

@kellertuer
Copy link
Member

Yeah, but again, look the gradient and you can also DebugStepsize() (within the debug array) or even

debug= [:Iteration, :Cost, DebugStepsize(), DebugGradientNorm(), "\n",10]

then you see: Your gradient Norm is 1, your step size ends up being 7 and your infectivity radius (in this situation maybe interpretable as the maximum step size that is actually reasonable) is less that 5. From the second iteration onwards. So I maybe would not recommend Armijo here, due to the nonconvexity.
But just now I am also out of ideas, why this does not walk towards the minimiser.

@kellertuer
Copy link
Member

kellertuer commented Jun 19, 2021

Ah there is one other major issue with your function.
The sum of two distances or even a distance itself. Is convex, but not differentiable (if one point is fixed and the other is the argument). For non-differentiable functions the gradient descent is not applicable. Compare the computation of the median in the docs. This will be the main issue here (sorry that I saw this a little late).

edit: Just noticed that your last example is just one point. I am confused that that does not finish in one step since the gradient points directly to the minimiser anyways. Maybe there is some issue with rotations, but I did not write that manifold, so I would have to invest quite some tie first checking that code, though, for which I do not have time at the moment (moved internationally, started new job, these things) – ah no Rotations look ok themselves.

@mateuszbaran
Copy link
Member

This probably refers to the way covariance matrix for a multivariate normal distribution can be obtained from the Hessian of logpdf like here: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470824566.app1 . I guess there should be a manifold equivalent for some generalization of the multivariate normal distribution? And this is the Mahalanobis distance: https://en.wikipedia.org/wiki/Mahalanobis_distance and it uses the covariance matrix.

Yes, that is spot on. If you know of any good sources on the subject it would be great. I still have a lot to learn.

I'm still a beginner with manifold-valued statistics. Sources tend to be very heavy on math. Xavier Pennec does some really cool work in that topic (see for example https://hal.inria.fr/inria-00614994/PDF/Pennec.JMIV06.pdf ) but there may be better sources.

@sethaxen could you recommend some other good sources for manifold statistics?

Regarding minimization, I'll try to find some time tomorrow to check it. One quick note, @kellertuer , linesearch_backtrack seems to ignore injectivity radius, shouldn't that be taken into account? In the example above stepsize actually exceeds injectivity radius of the manifold.

@dehann
Copy link
Contributor

dehann commented Jun 20, 2021

started new job

Congrats! No worries on time delay know these discussion take up a lot of time, thanks for all the help!

Then also, thanks for the references. I am in need of an on-manifold covariance estimator myself, and will be building one at some point. I'll share back anything useful (xref JuliaRobotics/IncrementalInference.jl#1280).

@kellertuer
Copy link
Member

Thanks – I can just not recommend to move between countries in these times, border rules are a little complicated these days.

Xavier Pennec is for sure a perfect source for statistics on manifolds.

One quick note, @kellertuer , linesearch_backtrack seems to ignore injectivity radius, shouldn't that be taken into account? In the example above stepsize actually exceeds injectivity radius of the manifold.

The problem is, that I am not aware of works that do take it into account, hence the line search here follows the same ideas as in Manopt and pymanopt. The reason is, that even if you exceed the (global) injectivity radius it might (a) locally ok, because the radius at that point is larger or (b) even be globally ok to do the step in that case, though in is not injective. So there is no good rule when to do or not to do a too large step. Here, however, I am very sure, this kills the example.

@mateuszbaran
Copy link
Member

Regarding parallel transport, this question: https://math.stackexchange.com/questions/2066207/how-can-one-define-parallel-transport-on-lie-groups relates parallel transport on Lie groups with differentials of left translation. Our metric on SO(n) is bi-invariant so parallel transport should agree with left translate_diff which is just identity, and as far as I can see translate_diff is correct (I've differentiated left translation on paper).

And, regarding statistics on manifolds, one additional thing: I have this book https://www.amazon.pl/Riemannian-Geometric-Statistics-Medical-Analysis/dp/0128147253 on to-read list, it seems to be not too dense in math.

@mateuszbaran
Copy link
Member

If I understand this correctly (see chapter 5.3 of that book, preprint available here: https://hal.inria.fr/hal-02342137/document ), the parallel transport of the (-) Cartan-Schouten connection is the identity on SO(n), but it's different from the Levi-Civita connection.

kellertuer added a commit that referenced this issue Jun 21, 2021
* Update src/tutorials/MeanAndMedian.jl

xref #86 (comment)

Co-authored-by: Ronny Bergmann <[email protected]>
@Affie
Copy link
Contributor Author

Affie commented Jun 24, 2021

Ah there is one other major issue with your function.
The sum of two distances or even a distance itself. Is convex, but not differentiable (if one point is fixed and the other is the argument). For non-differentiable functions the gradient descent is not applicable. Compare the computation of the median in the docs. This will be the main issue here (sorry that I saw this a little late).

Ahh, thanks. I was working from our factors and missed it since it's done with the Mahalanobis distance in our case. It should be distance squared. I still can't get a satisfactory result though.


Perhaps I should take a step back and rather ask how to solve on manifolds such as SE(2), SE(3), SO(2), SO(3). Perhaps start with SE(2) for a POC.
I have factors of this form so far for groups, but am open to any suggestions:

#M - manifold
#p and q are points on manifold
#m is a measurement from point p
# meas is the prior measurement 

function PointPoint(M, m, p, q)
    q̂ = compose(M, p, m)
    return distance(M, q, q̂)^2
end

function Prior(M, meas, p)	
#		
    return distance(M, meas, p)^2
end

The factors are combined into cost functions that depends on the structure of the factor graph.

@kellertuer
Copy link
Member

Distance squared is better, that is geodetically convex (so locally it is convex.
I am not sure how compose (and the chain rule) act then here.

For distance squared, keep in mind that globally this might still be non-unique. For example on the sphere if you minimise with respect to x the function

distance(M, n, x)^2 + distance(M, s, x)^2

Where n and s are North and South Pole, respectively. Then any x on the equator is a minimiser of the function.

Your last sentence, I again can‘t follow. what factrors? what is to combine here? Which graph? What is its structure?

@Affie
Copy link
Contributor Author

Affie commented Jun 25, 2021

Hi, I found something interesting. If i set retraction_method=QRRetraction() in the basic rotations example it works as expected.
I looked at the exponential retraction in rotations but can't find anything wrong with it.

@kellertuer
Copy link
Member

That is good to know :)
That was one of my concerns when I wrote this #86 (comment), but all tests thereon look ok. It might be good to take a look at exp/log on Rotations again then, because I also could not find anything wrong with them. Or maybe they are not stable currently at least; I am not sure, I usually do not work on that manifold.

Retractions should locally be as fine as exp (inverse retractions as fine as log), but are usually cheaper or numericaly more stable. So good that you checked.

@Affie
Copy link
Contributor Author

Affie commented Jun 25, 2021

For distance squared, keep in mind that globally this might still be non-unique. For example on the sphere if you minimise with respect to x the function
distance(M, n, x)^2 + distance(M, s, x)^2
Where n and s are North and South Pole, respectively. Then any x on the equator is a minimiser of the function.

That is where IncrementalInference.jl comes in, in short, you will get the ring around the equator as a belief if you set it up that way. For example figure 1 in https://marinerobotics.mit.edu/sites/default/files/fourie_iros19_manifolds.pdf has a similar case on a circular manifold.

@Affie
Copy link
Contributor Author

Affie commented Jun 25, 2021

Your last sentence, I again can‘t follow. what factrors? what is to combine here? Which graph? What is its structure?

Sorry, an explanation will be too much for a comment. Perhaps a basic example:

#for a user-defined graph that looks like this:

p -- x1 -- f1 -- x2 -- f2 -- x3
      \                    /
        ------- f3 ------ 

#the cost would look like this:
cost(M, x) = Prior(M, p, x[1])^2 + PointPoint(M, f1, x[1], x[2])^2 + 
             PointPoint(M, f2, x[2], x[3])^2 + PointPoint(M, f3, x[1], x[3])^2 

@JuliaManifolds JuliaManifolds locked and limited conversation to collaborators Jun 25, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants