Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually inject derivative information that can't be computed by AD #89

Open
KristofferC opened this issue Jan 27, 2016 · 17 comments
Open

Comments

@KristofferC
Copy link
Collaborator

I have a function that is computed in a "complicated enough" way for Automatic Differentiation to not be suitable for it. I can however get the derivative in some other way.

I want to use this function as part of another function where the rest can be AD:ed so I need a way to overload the ForwardDiff call to this function and inject the returned derivative.

I have started a bit but I know I am doing it wrong because I don't know how to propagate correctly my computed derivative to the Partials correctly.

using ForwardDiff

# Function that in reality can't be differentiated with AD
function ff(x)
    @assert length(x) == 3
    y = 2*x
    return y
end

# Derivative to above function
function ff_diff(x)
    @assert length(x) == 3
    return 2*eye(3)
end

# Function that I want to use AD on that uses f
function g(x)
    a = 2*x
    b = ff(a)
    return 5*b
end

typealias GradNumFloat3 ForwardDiff.GradientNumber{3,Float64,Tuple{Float64,Float64,Float64}}
typealias PartialFloat3 ForwardDiff.Partials{Float64,Tuple{Float64,Float64,Float64}}

#Overload f
function ff(x::Vector{GradNumFloat3})
    println("I got called")
    # Extract the values
    x_val = ForwardDiff.get_value(x)
    # Call f and get the derivative
    y = ff(x_val)
    dydx = ff_diff(x_val)
    # Insert back 
    # THIS IS WRONG because it ignores the already stored paritals in x
    nums = [GradNumFloat3(y[i], PartialFloat3((dydx[:, i]...))) for i in 1:length(x)]
    return nums
end

ForwardDiff.jacobian(g, rand(3))

# I got called
# 3x3 Array{Float64,2}:
#  10.0   0.0   0.0
#  0.0  10.0   0.0
#  0.0   0.0  10.0

# Above should be 20.0 on diagonal

Any tips on how to do this correctly? Maybe we could put it in the documentation if there is a good way to currently do it.

@KristofferC
Copy link
Collaborator Author

From f(x + ε) = f(x) + εf'(x) I guess the computed derivative must be inserted with a multiplication of the already computed partials... Componentwise..?

@jrevels
Copy link
Member

jrevels commented Jan 27, 2016

From f(x + ε) = f(x) + εf'(x) I guess the computed derivative must be inserted with a multiplication of the already computed partials... Componentwise..?

Yup! Multiplication is overloaded on Partials just for this purpose. You might've already seen this, but in case you haven't, this page of the docs might be useful to you.

@jrevels
Copy link
Member

jrevels commented Jan 27, 2016

Though it might be easier to use the *num_from_deriv functions, which are vaguely discussed in the Contribution guide.

@KristofferC
Copy link
Collaborator Author

Ah, my bad I should have read the documentation more carefully. I will make another attempt. Thanks a bunch.

@jrevels
Copy link
Member

jrevels commented Jan 27, 2016

No problem. I never really thought about how *num_from_deriv might facilitate external derivative injection in cases where ForwardDiff can't compute derivatives - I just wrote those functions to make contribution easier, so I'm glad you brought this up! I'm interested to see how it goes.

@KristofferC
Copy link
Collaborator Author

I guess what I am not sure about is, how my computed jacobian matrix should go into the deriv field.

My attempt was along:

function f{T <: ForwardDiff.GradientNumber}(x::Vector{T})
    x_val = ForwardDiff.get_value(x)
    y = f(x_val)
    J = f_diff(x_val)
    grad_vec = T[ForwardDiff.gradnum_from_deriv(x[i], y[i], J[:,i]) for i = 1:length(x)]
    return grad_vec 
end

Now, in the deriv field it seems like there should be a scalar but I have for example 9 components in the jacobian but only 3 gradient numbers so how will they all go in there... :P

@jrevels
Copy link
Member

jrevels commented Jan 27, 2016

I think you have to propagate the Jacobian information in accordance with the chain rule, i.e. do a matrix multiplication between your J and the matrix you can get from ForwardDiff.get_jacobian(x). This should result in a new Jacobian, whose components can be loaded into the GradientNumbers returned inside of grad_vec.

@KristofferC
Copy link
Collaborator Author

Thank you. This seems to work (hard coded for 3 components to avoid the splatting into tuple penalty). To make it general I guess I need a @stagedfunction to create the expression that inserts the components into the tuple for the partials?

Does this look reasonable to you?

#Overload f
function f{T <: ForwardDiff.GradientNumber}(x::Vector{T})
    x_val = ForwardDiff.get_value(x)
    f_val, J = f(x_val), f_diff(x_val)
    J_new = J * ForwardDiff.get_jacobian(x)
    return [T(f_val[i], (J_new[1,i], J_new[2,i], J_new[3,i])) for i = 1:length(x)]
end

@jrevels
Copy link
Member

jrevels commented Jan 28, 2016

Yup, looks good! This is pretty exciting. Like you said, generalizing this would require @generated functions, but I can see how this kind of thing might be exposed via the API to allow people to inject external derivatives in a variety of cases.

@KristofferC
Copy link
Collaborator Author

I made a macro that implements a jacobian of a function by giving the function, the derivative and, right now, the number of components.

macro implement_jacobian(f, f_grad, n_components)
    Jexp = Expr(:tuple, [:(J[i, $k]) for k=1:n_components]...)

    return quote
        function $(esc(f)){T <: ForwardDiff.GradientNumber}(x::Vector{T})
            x_val = ForwardDiff.get_value(x)
            f_val, J = $f(x_val), $f_grad(x_val)
            J_new = J * ForwardDiff.get_jacobian(x)
            return [T(f_val[i], $Jexp) for i = 1:length(x)]
        end
    end
end

Using it:

foo(x) = return 2*x

function foo_grad(x)
    println("I got called")
    return 2*eye(length(x))
end

julia> @implement_jacobian foo foo_grad 5
foo (generic function with 2 methods)

julia> ForwardDiff.jacobian(foo, rand(5))
I got called
5x5 Array{Float64,2}:
 2.0  0.0  0.0  0.0  0.0
 0.0  2.0  0.0  0.0  0.0
 0.0  0.0  2.0  0.0  0.0
 0.0  0.0  0.0  2.0  0.0
 0.0  0.0  0.0  0.0  2.0

Haven't tested it extensively. I guess the ForwardDiffCache should probably be used somehow for optimal performance to store the Jacobian extracted from the partials and the returning vector but haven't figured that out yet.

@jrevels
Copy link
Member

jrevels commented Jan 28, 2016

This is awesome. Here's a version of the above with some tweaks (using ForwardDiffCache/removing the need to specify n_components):

macro implement_jacobian(f, jacf)
    cachesym = esc(gensym("cache"))
    return quote
        const $(cachesym) = ForwardDiff.ForwardDiffCache()
        function $(esc(f)){G<:ForwardDiff.GradientNumber}(x::Vector{G})
            x_val = ForwardDiff.get_value(x)
            f_val, J = $(f)(x_val), $(jacf)(x_val)
            J_new = J * ForwardDiff.get_jacobian(x)
            result = ForwardDiff.get_workvec!($(cachesym), G, length(x))
            for i in eachindex(result)
                result[i] = G(f_val[i], getrowpartials(G, J_new, i))
            end
            return result
        end
    end
end

@generated function getrowpartials{G<:ForwardDiff.GradientNumber}(::Type{G}, J, i)
    return Expr(:tuple, [:(J[i, $k]) for k=1:ForwardDiff.npartials(G)]...)
end

@implement_jacobian f jacf

@KristofferC
Copy link
Collaborator Author

Thanks so much. That works really well.

I experimented with implementing the analytical jacobian of a matrix multiplication:

const N = 8
const A = rand(N, N)

foo(x) = A * x
foo_jac(x) = return A
@implement_jacobian foo foo_jac

foo2(x) = A * x

jac_overload = ForwardDiff.jacobian(foo)
jac_normal = ForwardDiff.jacobian(foo2)

@time for i = 1:10^3 jac_normal(rand(N)) end

@time for i = 1:10^3 jac_overload(rand(N)) end

julia> @time for i = 1:10^3 jac_normal(rand(N)) end
  0.545726 seconds (11.69 M allocations: 1.439 GB, 29.32% gc time)

julia> @time for i = 1:10^3 jac_overload(rand(N)) end
  0.042785 seconds (214.00 k allocations: 71.808 MB, 17.93% gc time)

Maybe I am doing something wrong in the benchmarking but it seems that it could be useful to overload some of the BLAS operations that has simple analytical derivatives? I guess this is a similar case to the manual optimization of unary functions except at a higher level.

@jrevels jrevels changed the title Overloading "black box" function. Manually inject derivative information that can't be computed by AD Jan 30, 2016
@jrevels
Copy link
Member

jrevels commented Jan 30, 2016

Wow, that's quite a significant improvement. I'll have to play around with an implementation that tests performance in a more general case.

I'm planning on doing a ForwardDiff sprint next week to resolve some issues/implement some stuff I've been talking about with @mlubin, and this issue is now definitely on the TODO list.

I wonder how @implement_jacobian could be generalized to the case where the input dimension doesn't equal the chunk size, such that x only contains a carries a few columns of the Jacobian at a time. You'd have to somehow pass slice information through to the point of propagation, i.e. where the matmul occurs, so that it knows which slices of J are relevant for the chunk you're computing. Hmm...

@mlubin
Copy link
Contributor

mlubin commented Jan 30, 2016

There's definitely something to gain from overloading BLAS operations. This makes a lot of sense if you're doing vector mode. I don't have a good solution for chunk mode...

@KristofferC
Copy link
Collaborator Author

@mlubin Why do you think that there would be extra difficulty with chunk_mode?

Assume a function R^M -> R^N and that we have a chunk_length of k.

The jacobian from the function will be a N x M matrix, (J_f).
The jacobian that get_jacobian(x) gives will be a M x k matrix, (J_x) .

The new "partial matrix" is computed with J_f * J_x which will be a N x k matrix.

These are then inserted into new gradient numbers and returned.

I made a gist here which implements an overloaded dot operator (input_length = 12 and chunk_size = 6) and calls it with a vector of gradient numbers. https://gist.github.com/KristofferC/b66b8b4b78e54f71124f

Result is that the overloaded dot operator runs in about 0.6x the time of the normal one.

julia> @time for i = 1:10^5 ddot_inject(Gs) end
  0.110522 seconds

julia> @time for i = 1:10^5 ddot(Gs) end
  0.164776 seconds

@mlubin
Copy link
Contributor

mlubin commented Feb 17, 2016

If it works well, I'm all for it.

@KristofferC
Copy link
Collaborator Author

The problem is that without preallocating the needed temporary arrays, the cost model becomes quite complex and which one is faster depends on what BLAS function is being overloaded and the size of the input vector etc.

Doing this would make a lot of sense for someone who knows a priori exactly what input_sizes and chunk_sizes that will be used but doing it so that it will be faster in the general case seems hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants