Tensorflow Gradients is NAN

(from stack overflow)

https://stackoverflow.com/questions/41918795/minimize-a-function-of-one-variable-in-tensorflow

Many of the other solutions use clipping to avoid an undefined

gradient. Depending on your problem, clipping introduces bias and may

not be acceptable in all cases. As the following code demonstrates, we

need only handle the point of discontinuity--not the region near it.

Specific Answer

def cross_entropy(x, y, axis=-1):

safe_y = tf.where(tf.equal(x, 0.), tf.ones_like(y), y)

return -tf.reduce_sum(x * tf.log(safe_y), axis)

def entropy(x, axis=-1):

return cross_entropy(x, x, axis)

But did it work?

x = tf.constant([0.1, 0.2, 0., 0.7])

e = entropy(x)

# ==> 0.80181855

g = tf.gradients(e, x)[0]

# ==> array([1.30258512, 0.60943794, 0., -0.64332503], dtype=float32) Yay! No NaN.

(Note: deleteddup cross-post.)

General Recipe

Use an innertf.whereto ensure the function has no asymptote.That is, alter the input to the inf generating function such that no inf can be created.Then use a secondtf.whereto always select the valid code-path.That is, implement the mathematical condition as you would "normally", i.e., the "naive" implementation.

In Python code, the recipe is:

Instead of this:

tf.where(x_ok, f(x), safe_f(x))

Do this:

safe_x = tf.where(x_ok, x, safe_x)

tf.where(x_ok, f(safe_x), safe_f(x))

Example

Suppose you wish to compute:

f(x) = { 1/x, x!=0

{ 0, x=0

A naive implementation results in NaNs in the gradient, i.e.,

def f(x):

x_ok = tf.not_equal(x, 0.)

f = lambda x: 1. / x

safe_f = tf.zeros_like

return tf.where(x_ok, f(x), safe_f(x))

Does it work?

x = tf.constant([-1., 0, 1])

tf.gradients(f(x), x)[0].eval()

# ==> array([ -1., nan, -1.], dtype=float32)

# ...bah! We have a NaN at the asymptote despite not having

# an asymptote in the non-differentiated result.

The basic pattern for avoiding NaN gradients when usingtf.whereis to calltf.wheretwice. The innermosttf.whereensures that the resultf(x)is always finite. The outermosttf.whereensures the correct result is chosen. For the running example, the trick plays out like this:

def safe_f(x):

x_ok = tf.not_equal(x, 0.)

f = lambda x: 1. / x

safe_f = tf.zeros_like

safe_x = tf.where(x_ok, x, tf.ones_like(x))

return tf.where(x_ok, f(safe_x), safe_f(x))

But did it work?

x = tf.constant([-1., 0, 1])

tf.gradients(safe_f(x), x)[0].eval()

# ==> array([-1., 0., -1.], dtype=float32)

# ...yay! double-where trick worked. Notice that the gradient

# is now a constant at the asymptote (as opposed to being NaN).