Root-finding algorithms share a very straightforward and intuitive approach to approximating roots. The general structure goes something like: a) start with an initial guess, b) calculate the result of the guess, c) update the guess based on the result and some further conditions, d) repeat until you’re satisfied with the result.
Let’s start with a simple question: f ( x ) = x 2 − 20 f(x)=x^2-20 f(x)=x2−20
naive root begin with something like 4.5 (you have the luxury of knowing what a good first guess is – this is not always the case), square it to get 20.25, reduce 4.5 by some small amount since the result is greater than 20, recalculate, etc. until you get an answer you’re satisfied with. Our naive root-finding algorithm looks something like this:
def naive_root(f, x0, tol, step):
steps_taken = 0
while abs(f(x0)) > tol:
if f(x0) > 0:
x0 -= step
elif f(x0) < 0:
x0 += step
else:
return x0
steps_taken += 1
return x0, steps_taken
f = lambda x: x**2 - 20
%timeit root, steps = naive_root(f, x0=4.5, tol=.01, step=.001)
print("root is:", root)
print("steps taken:", steps)
The bisection method introduces a simple idea to hone in on the root. Start with two guesses such that f(guess_1) and f(guess_2) are of opposite sign. This means that we have one guess that’s too large and another guess that’s too small. The idea is to gradually squeeze the guess that’s too high and the guess that’s too low towards zero until we’re satisfied that the range between the two is small enough or equal to zero.
At each step we take the midpoint of the two guesses. If the midpoint is zero (the guesses are the same) or if the guesses are sufficiently close to one another, we return the midpoint. Otherwise, if the f(midpoint) is negative, then we replace the lower bound with the midpoint, and if f(midpoint) is positive, then we replace the upper bound with the midpoint. And, of course, repeat.
def bisection(f, lower, upper, max_iters=50, tol=1e-5):
steps_taken = 0
while steps_taken < max_iters:
m = (lower + upper) / 2.0
if m == 0 or abs(upper-lower) < tol:
return m, steps_taken
if f(m) > 0:
upper = m
else:
lower = m
steps_taken += 1
final_estimate = (lower + upper) / 2.0
return final_estimate, steps_taken
f = lambda x: x**2 - 20
%timeit root, steps = bisection(f, 1, 8)
print("root is:", root)
print("steps taken:", steps)
You’ve probably guessed that the derivative is an obvious candidate for improving step sizes: the derivative tells us about the direction and step size to take on reasonably convex, continuous, well-behaved functions; all we need to do is find a point on the curve where the derivative is zero.
For Newton-Raphson, we supply an initial guess, calculate the derivative of f at the initial guess, calculate where the derivative intersects the x-axis, and use that as our next guess.
We have f ′ ( x n ) ∗ ( x n + 1 − x n ) = 0 − f ( x n ) f'(x_{n})*(x_{n+1}-x_{n})=0-f(x_n) f′(xn)∗(xn+1−xn)=0−f(xn), that is δ x = − f ( x n ) / f ′ ( x n ) \delta x=-f(x_n)/f'(x_n) δx=−f(xn)/f′(xn)
For this code we approximate the derivative of univariate f at x so that you can play around with the function without having to calculate the derivatives, but you can easily substitute in the actual derivative function to get similar results. The method can be extended to n dimensions with the Jacobian and/or higher orders of the Taylor series expansion, but for now we’ll keep it simple.
We know that Deep Learning Tools(such as PyTorch) can calculate derivations automatically and fast, thus it is probably a good idea to implement this algorithm on PyTorch.
def discrete_method_approx(f, x, h=.00000001):
return (f(x+h) - f(x)) / h
def newton_raphson(f, x, tol=.001):
steps_taken = 0
while abs(f(x)) > tol:
df = discrete_method_approx(f, x)
x = x - f(x)/df
steps_taken += 1
return x, steps_taken
f = lambda x: x**2 - 20
%timeit root, steps = newton_raphson(f, 8)
print("root is:", root)
print("steps taken:", steps)
As we can see, this method takes far fewer iterations than the Bisection Method, and returns an estimate far more accurate than our imposed tolerance (Python gives the square root of 20 as 4.47213595499958)
The drawback with Newton’s Method is that we need to compute the derivative at each iteration. While it wasn’t a big deal for our toy problem, computing gradients and higher order derivatives for large, complex systems of equations can be very expensive. So while Newton’s Method may find a root in fewer iterations than Algorithm B, if each of those iterations takes ten times as long as iterations in Algorithm B then we have a problem. To remedy this, let’s look at some Quasi-Newtonian methods
The idea of the Quasi-Newtonian Secant Method and other Quasi-Newtonian methods is to substitute the expensive calculation of the derivative/Jacobian/Hessian at each step with an inexpensive but good-enough approximation. All of the more popular methods (BFGS, Secant, Broyden, etc.) have the same basic form, save for different rules about how best to approximate the gradient and update the guess.
You can think of the Secant Method as derivative-lite. Instead of computing the derivative, we approximate it using two points (x0, f(x0)) and (x1, f(x1)), calculate where the line intercepts the x-axis, and use that as one of our new points. This provides a fast way to generate a line that should approximate the tangent line to the function somewhere between these two points.
Given our two initial guesses x 0 x_{0} x0 and x 1 x_{1} x1, the secant line in slope-intercept form is:
y = f ( x n − 1 ) − f ( x n − 2 ) x n − 1 − x n − 2 ( x − x n − 1 ) + f ( x n − 1 ) y = \frac{f(x_{n-1}) -f(x_{n-2})}{x_{n-1} - x_{n-2}} (x - x_{n-1}) + f(x_{n-1}) y=xn−1−xn−2f(xn−1)−f(xn−2)(x−xn−1)+f(xn−1)
Setting y to 0 (our new guess is where the line intercepts the x-axis), we get:
x = x n − 1 − f ( x n − 1 ) x n − 1 − x n − 2 f ( x n − 1 ) − f ( x n − 2 ) x = x_{n-1} - f(x_{n-1}) \frac{x_{n-1} - x_{n-2}} {f(x_{n-1}) - f(x_{n-2})} x=xn−1−f(xn−1)f(xn−1)−f(xn−2)xn−1−xn−2
def secant_method(f, x0, x1, max_iter=100, tol = 1e-5):
steps_taken = 1
while steps_taken < max_iter and abs(x1-x0) > tol:
x2 = x1 - ((f(x1) * (x1 - x0)) / (f(x1) - f(x0)))
x1, x0 = x2, x1
steps_taken += 1
return x2, steps_taken
f = lambda x: x**2 - 20
%timeit root, steps = secant_method(f, 2, 8)
print("root is:", root)
print("steps taken:", steps)
This method is similar to the secant method but instead is initialized with three points, interpolates a polynomial curve based on those points, calculates where the curve intercepts the x-axis and uses this point as the new guess in the next iteration.
The quadratic interpolation method is the Lagrange polynomial:
L ( x ) = ∑ j = 0 k y j l j ( x ) L(x) = \sum_{j=0}^{k} y_j l_j (x) L(x)=∑j=0kyjlj(x)
where
l j ( x ) = ∏ 0 ≤ m ≤ k , m ≠ j x − x m x j − x m l_j(x) = \prod_{0 \leq m \leq k , m \neq j} \frac{x - x_m}{x_j - x_m} lj(x)=∏0≤m≤k,m=jxj−xmx−xm
is carried out with three points to get a second degree polynomial curve. Basically, in this instance we create three basic l(x) degree two polynomial curves, where each curve is zero when m != j and approximates the f(m) value otherwise. The Lagrange polynomial is designed to do exactly this. You get three curves that each pass through one of the points to be interpolated and is zero at all other points, then take the linear combination of those curves for an interpolation that passes through all desired points.
def inverse_quadratic_interpolation(f, x0, x1, x2, max_iter=20, tol=1e-5):
steps_taken = 0
while steps_taken < max_iter and abs(x1-x0) > tol: # last guess and new guess are v close
fx0 = f(x0)
fx1 = f(x1)
fx2 = f(x2)
L0 = (x0 * fx1 * fx2) / ((fx0 - fx1) * (fx0 - fx2))
L1 = (x1 * fx0 * fx2) / ((fx1 - fx0) * (fx1 - fx2))
L2 = (x2 * fx1 * fx0) / ((fx2 - fx0) * (fx2 - fx1))
new = L0 + L1 + L2
x0, x1, x2 = new, x0, x1
steps_taken += 1
return x0, steps_taken
f = lambda x: x**2 - 20
%timeit root, steps = inverse_quadratic_interpolation(f, 4.3, 4.4, 4.5)
print("root is:", root)
print("steps taken:", steps)
Brent’s Method seeks to combine the robustness of the bisection method with the fast convergence of inverse quadratic interpolation. The basic idea is to switch between inverse quadratic interpolation and bisection based on the step performed in the previous iteration and based on inequalities gauging the difference between guesses:
def brents(f, x0, x1, max_iter=50, tol=10e-5):
fx0 = f(x0)
fx1 = f(x1)
assert (fx0 * fx1) <= 0, "Root not bracketed"
if abs(fx0) < abs(fx1):
x0, x1 = x1, x0
fx0, fx1 = fx1, fx0
x2, fx2 = x0, fx0
mflag = True
steps_taken = 0
while steps_taken < max_iter and abs(x1-x0) > tol:
fx0 = f(x0)
fx1 = f(x1)
fx2 = f(x2)
if fx0 != fx2 and fx1 != fx2:
L0 = (x0 * fx1 * fx2) / ((fx0 - fx1) * (fx0 - fx2))
L1 = (x1 * fx0 * fx2) / ((fx1 - fx0) * (fx1 - fx2))
L2 = (x2 * fx1 * fx0) / ((fx2 - fx0) * (fx2 - fx1))
new = L0 + L1 + L2
else:
new = x1 - ((fx1 * (x1 - x0)) / (fx1 - fx0))
if ((new < ((3 * x0 + x1) / 4) or new > x1) or
(mflag == True and (abs(new - x1)) >= (abs(x1 - x2) / 2)) or
(mflag == False and (abs(new - x1)) >= (abs(x2 - d) / 2)) or
(mflag == True and (abs(x1 - x2)) < tol) or
(mflag == False and (abs(x2 - d)) < tol)):
new = (x0 + x1) / 2
mflag = True
else:
mflag = False
fnew = f(new)
d, x2 = x2, x1
if (fx0 * fnew) < 0:
x1 = new
else:
x0 = new
if abs(fx0) < abs(fx1):
x0, x1 = x1, x0
steps_taken += 1
return x1, steps_taken
f = lambda x: x**2 - 20
%timeit root, steps = brents(f, 2, 5, tol=10e-5)
print("root is:", root)
print("steps taken:", steps)