Saving Gradient in Backward Pass #5336

clee1994 · 2021-01-06T21:56:38Z

clee1994
Jan 6, 2021

I am using JAX to implement a simple neural network (NN) and I want to access and save the gradients from the backward pass for further analysis after the NN ran. I can access and look at the gradients temporarily with the python debugger (as long as I am not using jit). But I want to save all gradients over the whole training process and analyze them after the training is done. I have come up with a rather hacky solution for this using id_tap and a global variable (see the code below). But I was wondering whether there is a better solution which does not violate the functional principles of JAX.

Many thanks!

import jax.numpy as jnp
from jax import grad, jit, vmap, random, custom_vjp
from jax.experimental.host_callback import id_tap

# experimental solution
global_save_list = {'x':[],'w':[],'g':[],'des':[]}
def global_save_func(ctx, des):
    x, w, g = ctx
    global_save_list['x'].append(x)
    global_save_list['w'].append(w)
    global_save_list['g'].append(g)
    global_save_list['des'].append(des)


@custom_vjp
def qmvm(x, w):
    return jnp.dot(x, w)

def qmvm_fwd(x, w):
    return qmvm(x, w), (x, w)

def qmvm_bwd(ctx, g):
    x, w = ctx

    # here I would like to save gradients g - or at least running statistics of them

    # experimental solution with id_tap
    id_tap(global_save_func, ((x, w, g)))

    fwd_grad = jnp.dot(g, w.transpose())
    w_grad = jnp.dot(x, g.transpose())
    
    return fwd_grad, w_grad

qmvm.defvjp(qmvm_fwd, qmvm_bwd)

def run_nn(x, w):
    out = qmvm(x, w)   # 1st MVM
    out = qmvm(out, w) # 2nd MVM
    return out

run_nn_batched = vmap(run_nn)

@jit
def loss(x, w, target):
    out = run_nn_batched(x, w)
    return jnp.sum((out - target)**2)

key = random.PRNGKey(42)
subkey1, subkey2, subkey3 = random.split(key, 3)

A = random.uniform(subkey1, (10, 10, 10), minval = -10, maxval = 10)
B = random.uniform(subkey2, (10, 10, 10), minval = -10, maxval = 10)
C = random.uniform(subkey3, (10, 10, 10), minval = -10, maxval = 10)

for e in range(10):
    gval = grad(loss, argnums = 0)(A, B, C)
    # some type of update rule

# here I would like to access gradients, preferably knowing to which MVM (1st or 2nd) and example they belong

# experimental solution:
print(global_save_list)

lukaszlew · 2021-01-06T22:17:16Z

lukaszlew
Jan 6, 2021

Problem in one sentence: how to compute the mean of the some of the the gradients of an intermediate activation tensor over multiple training steps. Can we use custom gradient function for that?

0 replies

mattjj · 2021-01-08T19:17:08Z

mattjj
Jan 8, 2021
Maintainer

One way to do it would be to define a 'perturbation function' (related to this idea). As a simpler example, let's say you want access to gradients with respect to the intermediate y of the function

import jax.numpy as jnp

def f(x):
  y = jnp.sin(x)
  return y ** 2

print(grad(f)(3.))  # -0.2794155

We can write a perturbation function as

def f_perturbed(x, delta_y):
  y = jnp.sin(x)
  y = y + delta_y
  return y ** 2

Notice that f(x) == f_perturbed(x, 0) for all x. But we can differentiate f_perturbed to get the gradient with respect to y:

print(grad(f_perturbed, (0, 1))(3., 0.)  # (DeviceArray(-0.2794155, dtype=float32), DeviceArray(0.28224, dtype=float32))

Here's an example implementation of K-FAC on fully-connected networks using this trick (in Autograd, but the differentiation API is the same as JAX's).

Something annoying about this is that you need to know the shapes of all the intermediates, so that you can construct appropriately-shaped zeros arrays for the perturbation arguments. The zeros could have some runtime cost too, but I wouldn't worry about that; under a jit those shouldn't cost anything.

One could automate the construction of both the perturbation function itself and the appropriately-shaped perturbation values by writing a custom jaxpr interpreter.

I don't think a custom gradient function would help on its own because it doesn't give you a way to plumb out additional outputs.

Another way to do it if you're willing to go beyond "core JAX" would be to use a library that adds "state management" features on top of JAX. I'm sure Oryx can do this for you (I believe with its Harvest API), and it's possible libraries like Flax and Haiku can also do this for you, but I'm less familiar with how it would work with those. Maybe folks on those three respective libraries could help answer questions about their APIs. (Oryx's implementation basically has custom jaxpr interpreters for this kind of thing.)

With these other libraries, you may be able to simply save the values in a custom_vjp rule. They also might give you a more convenient way to implement the perturbation function approach, though if you can just save values inside a custom_vjp rule that approach seems the simplest.

The difference between these and the id_tap approach has to do with where values end up getting stored. With id_tap values are sent from the device to the host. With these other state management libraries, values would stay on the device, which seems appropriate for this situation. In other words, while both id_tap and things like Oryx's Harvest API are semantically about stashing values away from inside a computation, id_tap additionally affects where buffers are stored.

6 replies

sharadmv Jan 8, 2021
Collaborator

Here's a my attempt to record the values using a custom_vjp using Oryx (I haven't tested it extensively, but might be a good starting point).

import jax
import jax.numpy as jnp
from jax import grad
import oryx
core = oryx.core

def record_vjp(f, name):
  @jax.custom_vjp
  def custom_f(*args):
    return f(*args)
  def custom_fwd(*args, **kwargs):
    return f(*args, **kwargs), args
  def custom_bwd(res, dy):
    _, vjp = jax.vjp(f, *res)
    return core.sow(vjp(dy), tag='vjp', name=name)
  custom_f.defvjp(custom_fwd, custom_bwd)
  return custom_f

Now we change the function Matt provided:

def f(x):
  y = jnp.sin(x)
  return record_vjp(lambda x: x ** 2., name='square_vjp')(y)
core.reap(grad(f), tag='vjp')(3.) # ==> {'square_vjp': (DeviceArray(0.28224, dtype=float32),)}

clee1994 Mar 14, 2021
Author

@sharadmv I just saw that Flax also has a sow function now (https://flax.readthedocs.io/en/latest/howtos/extracting_intermediates.html?highlight=sow). Do you have any idea whether oryx or flax sow might be better when dealing with flax models?

sharadmv Apr 19, 2021
Collaborator

I'm not familiar with the details of Flax's sow but it seems similar to Oryx's sow in spirit. I'm not sure which would be more convenient for you, but you should be able to embed Oryx sows into Flax functions and reap the module.apply to get out intermediates. Perhaps someone more familiar with Flax's internals (@levskaya @jheek ?) can help us understand the more subtle differences?

I hope that's helpful!

jheek Apr 20, 2021

My guess is both sow methods will work but I wouldn't recommend using the two interchangeably because now you have to different systems for supporting state with JAX transforms. This could get really confusing especially if you want to do things like vmap your model because oryx will not now how to deal with Flax state and Flax will not now how to deal with oryx state.

The custom_vjp trick is really cool though! We should have a similar pattern for Flax using sow

siddharth-joshi Apr 23, 2021

I think the reason @clee1994 was asking is because we're likely going to need to log a lot of tensors and were wondering if there was any specific recommendation re: sow in either Oryx or Flax. But we can benchmark that and let you know :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving Gradient in Backward Pass #5336

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Saving Gradient in Backward Pass #5336

clee1994 Jan 6, 2021

Replies: 2 comments · 6 replies

lukaszlew Jan 6, 2021

mattjj Jan 8, 2021 Maintainer

sharadmv Jan 8, 2021 Collaborator

clee1994 Mar 14, 2021 Author

sharadmv Apr 19, 2021 Collaborator

jheek Apr 20, 2021

siddharth-joshi Apr 23, 2021

clee1994
Jan 6, 2021

Replies: 2 comments 6 replies

lukaszlew
Jan 6, 2021

mattjj
Jan 8, 2021
Maintainer

sharadmv Jan 8, 2021
Collaborator

clee1994 Mar 14, 2021
Author

sharadmv Apr 19, 2021
Collaborator