An Introduction to Dual Numbers and Automatic Differentiation

#Introduction

Derivatives crop up everywhere, from biology to machine learning. This makes them incredibly important to be able to calculate, something you probably learnt how to do in secondary school or college.

But have you ever considered how to compute them in code?

The naive approach is to implement the rules of the symbolic method you already know, and if you’ve tried this, you’ll know this quickly gets messy due to the huge number of rules. Another option you might have considered is to use finite differences, but this is an approximation. What happens when you need exact derivatives, without the mess?

Enter dual numbers: a surprisingly simple number system that can differentiate functions automatically. In this post, I’ll explain dual numbers from scratch, covering what they are, how algebra works with them, and why they compute derivatives automatically.

#Definition

If you are familiar with complex numbers1, dual numbers follow a simple pattern. Complex numbers are defined using set-builder notation2 as:

C={a+bia,bR,i2=1} \mathbb{C}=\set{a+bi\mid a,b \in \mathbb{R}, i^2=-1}

or equivalently using quotient rings3:

C=R[i]/i2+1\mathbb{C}=\mathbb{R}[i]/\langle i^2+1\rangle

Dual numbers use the same structure but instead of the imaginary unit ii , they use the symbol ϵ\epsilon . To give dual numbers their special properties, ϵ\epsilon is defined to satisfy two equations:

ϵ2=0ϵ0 \begin{aligned} \epsilon^2 &= 0\\ \epsilon &\ne 0 \end{aligned}

Using the same pattern given to us by the complex numbers, we get:

D={a+bϵa,bR,ϵ2=0,ϵ0}\mathbb{D} = \set{a+b \epsilon \mid a,b \in \mathbb{R}, \epsilon^2=0, \epsilon \ne 0}

or with quotient rings:

D=R[ϵ]/ϵ2\mathbb{D} = \mathbb{R}[\epsilon]/\langle \epsilon^2 \rangle

If the quotient ring construction is unfamiliar, don’t worry. The set builder definition describes everything you need to know: dual numbers are pairs of numbers, written as a+bϵa + b\epsilon where ϵ2=0\epsilon^2=0

These constraints cannot be satisfied by any real number, so ϵ\epsilon lies outside R\mathbb{R} . At first glance, ϵ2=0\epsilon^2 = 0 isn’t that interesting, but as we’ll see later, it’s this property that enables automatic differentiation.

#Algebra with Dual Numbers

Before we can see how automatic differentiation works, we need to know how algebraic operations work with dual numbers. Don’t worry too much if you don’t understand the derivations of these operations, as the final rule is what matters.

#Addition/Subtraction

Similar to complex numbers, addition and subtraction is done component-wise:

(a+bϵ)±(c+dϵ)=(a±c)+(b±d)ϵ(a+b\epsilon)\pm(c+d\epsilon)=(a\pm c)+(b\pm d)\epsilon

#Multiplication

Multiplication is also relatively straightforward.

(a+bϵ)(c+dϵ)=ac+adϵ+bcϵ+bdϵ2Multiply out the brackets=ac+adϵ+bcϵApply ϵ2=0=ac+(ad+bc)ϵCollect like terms \begin{align*} (a+b\epsilon)(c+d\epsilon) &= ac + ad\epsilon + bc\epsilon + bd\epsilon^2 && \text{Multiply out the brackets}\\ &=ac +ad\epsilon +bc\epsilon && \text{Apply } \epsilon^2=0 \\ &=ac+(ad+bc)\epsilon && \text{Collect like terms}&& \end{align*}

Notice the ϵ2\epsilon^2 term vanishes. This truncation of higher-order terms is the reason dual numbers work for automatic differentiation.

#Division

Division is slightly more involved, but nothing too abnormal. Since the division of two duals should yield a dual4, we can express this as a+bϵc+dϵ=p+qϵ\frac{a + b\epsilon}{c + d\epsilon} = p + q\epsilon (assume c0c \ne 0 ). Then we solve for the numerator:

a+bϵ=(p+qϵ)(c+dϵ)=pc+(pd+qc)ϵ \begin{aligned} a + b\epsilon &= (p + q\epsilon)(c + d\epsilon) \\ &= pc + (pd + qc)\epsilon \end{aligned}

Since we now have two dual numbers that are equal, we can equate their coefficients for each part:

This gives us the rule for division:

a+bϵc+dϵ=ac+bcadc2ϵ\frac{a + b\epsilon}{c + d\epsilon} = \frac{a}{c} + \frac{bc - ad}{c^2}\epsilon

#Automatic Differentiation

Now that we know how to do basic algebra with dual numbers, let’s see what makes them useful. Given any differentiable function f(x)f(x) , then:

f(x+ϵ)=f(x)+f(x)ϵ f(x+\epsilon) = f(x) + f'(x)\epsilon

This means that the coefficient of ϵ\epsilon in our result is actually equal to the derivative. This isn’t just a coincidence.

#The Taylor Series

To understand why this works, let’s express our function as the Taylor series5 expansion of f(x)f(x) :

f(x+ϵ)=n=0f(n)(x)n!ϵn f(x+\epsilon) = \sum_{n=0}^\infty \frac{f^{(n)}(x)}{n!}\epsilon^n

When shown using the summation, it’s not immediately clear what the relevance is, but watch what happens when we expand this for the first few terms:

f(x+ϵ)=f(x)+f(x)ϵ+f(x)2!ϵ2+f(x)3!ϵ3+ \begin{aligned} f(x+\epsilon) &= f(x) + f'(x)\epsilon + \frac{f''(x)}{2!}\epsilon^2 + \frac{f'''(x)}{3!}\epsilon^3 + \cdots \end{aligned}

This is where the magic happens. Since ϵ2=0\epsilon^2 = 0 , all terms with n2n \ge 2 will vanish, leaving just:

f(x+ϵ)=f(x)+f(x)ϵ f(x+\epsilon) = f(x) + f'(x)\epsilon

#An Example

Let’s verify this by trying a couple of examples. I’ll assume you can do basic differentiation to save writing out the steps.

f(x)=x2f(x+ϵ)=(x+ϵ)2=x2+2xϵ+ϵ2=x2+2xϵf(x)ϵ=2xϵ \begin{aligned} f(x) &= x^2 \\ f(x+\epsilon) &= (x+\epsilon)^2 \\ &= x^2 + 2x\epsilon + \epsilon^2 \\ &= x^2 +2x\epsilon \\ \therefore f'(x)\epsilon &= 2x\epsilon \end{aligned}

The coefficient of ϵ\epsilon is 2x2x , which is exactly f(x)f'(x) . The derivative appears automatically without needing to use symbolic methods of differentiation. Let’s try another:

f(x)=2x2+4x+7f(x+ϵ)=2(x+ϵ)2+4(x+ϵ)+7=2x2+4xϵ+2ϵ2+4x+4ϵ+7=2x2+4x+7+(4x+4)ϵsince ϵ2=0f(x)ϵ=(4x+4)ϵ \begin{aligned} f(x) &= 2x^2+4x+7 \\ f(x+\epsilon) &= 2(x+\epsilon)^2 +4(x+\epsilon)+7 \\ &= 2x^2 + 4x\epsilon + 2\epsilon^2 + 4x + 4\epsilon + 7\\ &= 2x^2 + 4x + 7 + (4x + 4)\epsilon &&\text{since }\epsilon^2=0 \\ \therefore f'(x)\epsilon &= (4x+4)\epsilon \end{aligned}

Again, the derivative appears in the ϵ\epsilon coefficient.

#Beyond Polynomials

The examples with polynomials showed the basics, but dual numbers work for any differentiable function. Let’s see how they handle exponentials and trigonometric functions.

Like earlier, let’s start with a Taylor series expansion of exp(x)\exp(x) :

exp(x+ϵ)=exp(x)+exp(x)ϵ+exp(x)2!ϵ2+exp(x)3!ϵ3+exp(x+ϵ)=exp(x)+exp(x)ϵ \begin{aligned} \exp(x+\epsilon) &= \exp(x) + \exp(x)\epsilon + \frac{\exp(x)}{2!}\epsilon^2 + \frac{\exp(x)}{3!}\epsilon^3 + \cdots \\ \exp(x+\epsilon) &= \exp(x) + \exp(x)\epsilon \end{aligned}

Just like we did here, we write out the expansion and cancel out all terms that are multiplied by ϵ2\epsilon^2 (or higher), leaving exp(x)\exp(x) as the coefficient of ϵ\epsilon , the correct derivative6. The same pattern applies to all elementary functions:

sin(x+ϵ)=sin(x)+cos(x)ϵ\sin(x + \epsilon) = \sin(x) + \cos(x)\epsilon

cos(x+ϵ)=cos(x)sin(x)ϵ\cos(x + \epsilon) = \cos(x) - \sin(x)\epsilon

log(x+ϵ)=log(x)+1xϵ\log(x + \epsilon) = \log(x) + \frac{1}{x}\epsilon

While this does require knowing the derivative to begin with, we only need to know the derivatives of the elementary functions, but these are well known.

#Composition and the Chain Rule

The real power of dual numbers becomes evident when we compose functions. Let’s compute the derivative of sin(x2)\sin(x^2) .

If we were doing this symbolically, we’d need to use the chain rule, but with dual numbers we just have to evaluate it like we did before:

(x+ϵ)2=x2+2xϵsin((x+ϵ)2)=sin(x2+2xϵ)=sin(x2)+cos(x2)2xϵ \begin{aligned} (x+\epsilon)^2 &= x^2 + 2x\epsilon \\ \sin((x+\epsilon)^2) &= \sin(x^2 + 2x\epsilon) \\ &=\sin(x^2) + \cos(x^2)\cdot2x\epsilon\\ \end{aligned}

We can see that cos(x2)2x\cos(x^2) \cdot 2x appears automatically; the chain rule happened automatically through algebra. This also applies for arbitrarily complex compositions: sin(exp(x2+6x))\sin(\exp(x^2+6x)) would work in exactly the same way.

#Conclusion

Dual numbers give a very elegant solution for computing derivatives through a single constraint: ϵ2=0\epsilon^2 = 0 . This causes the Taylor series to be truncated, leaving the derivative in the ϵ\epsilon coefficient. All of this happens without any manipulation of symbols or approximation. Just simple algebra!

In a future post, I’ll cover how this can be implemented in Rust and explore the applications of Forward-Mode Automatic Differentiation. We’ll also see how this method can become inefficient, and when alternative approaches such as Reverse-Mode Automatic Differentiation become necessary.