Proof of the Multivariable Chain Rule

We prove the chain rule for compositions of differentiable multivariable functions, which is the foundation for all computations involving multivariable derivatives.

Proof

Theorem: Let $\mathbf{g} : \mathbb{R}^m \to \mathbb{R}^n$ be differentiable at $\mathbf{a}$ and $\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^p$ be differentiable at $\mathbf{b} = \mathbf{g}(\mathbf{a})$ . Then $\mathbf{f} \circ \mathbf{g}$ is differentiable at $\mathbf{a}$ and $D(\mathbf{f} \circ \mathbf{g})(\mathbf{a}) = D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})$

Step 1: Setup.

Since $\mathbf{f}$ is differentiable at $\mathbf{b}$ , we can write $\mathbf{f}(\mathbf{b} + \mathbf{k}) = \mathbf{f}(\mathbf{b}) + D\mathbf{f}(\mathbf{b})\mathbf{k} + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})$ where $\boldsymbol{\epsilon}_1(\mathbf{k}) \to \mathbf{0}$ as $\mathbf{k} \to \mathbf{0}$ .

Since $\mathbf{g}$ is differentiable at $\mathbf{a}$ : $\mathbf{g}(\mathbf{a} + \mathbf{h}) = \mathbf{g}(\mathbf{a}) + D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h})$ where $\boldsymbol{\epsilon}_2(\mathbf{h}) \to \mathbf{0}$ as $\mathbf{h} \to \mathbf{0}$ .

Step 2: Compose.

Set $\mathbf{k} = \mathbf{g}(\mathbf{a} + \mathbf{h}) - \mathbf{g}(\mathbf{a}) = D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h})$ . Then:

$(\mathbf{f} \circ \mathbf{g})(\mathbf{a} + \mathbf{h}) = \mathbf{f}(\mathbf{b} + \mathbf{k}) = \mathbf{f}(\mathbf{b}) + D\mathbf{f}(\mathbf{b})\mathbf{k} + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})$

Substituting $\mathbf{k}$ : $= (\mathbf{f} \circ \mathbf{g})(\mathbf{a}) + D\mathbf{f}(\mathbf{b})[D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h})] + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})$

$= (\mathbf{f} \circ \mathbf{g})(\mathbf{a}) + [D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})]\mathbf{h} + \underbrace{\|\mathbf{h}\| D\mathbf{f}(\mathbf{b})\boldsymbol{\epsilon}_2(\mathbf{h}) + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})}_{\text{error term}}$

Step 3: Show the error term is $o(\|\mathbf{h}\|)$ .

For the first error term: $\frac{\|\mathbf{h}\| \cdot \|D\mathbf{f}(\mathbf{b})\boldsymbol{\epsilon}_2(\mathbf{h})\|}{\|\mathbf{h}\|} \leq \|D\mathbf{f}(\mathbf{b})\| \cdot \|\boldsymbol{\epsilon}_2(\mathbf{h})\| \to 0$ .

For the second error term: we need $\|\mathbf{k}\| / \|\mathbf{h}\|$ to remain bounded. Indeed: $\frac{\|\mathbf{k}\|}{\|\mathbf{h}\|} = \frac{\|D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h})\|}{\|\mathbf{h}\|} \leq \|D\mathbf{g}(\mathbf{a})\| + \|\boldsymbol{\epsilon}_2(\mathbf{h})\|$ which is bounded as $\mathbf{h} \to \mathbf{0}$ (say by $C = \|D\mathbf{g}(\mathbf{a})\| + 1$ for small $\mathbf{h}$ ).

Also, $\mathbf{k} \to \mathbf{0}$ as $\mathbf{h} \to \mathbf{0}$ (since $\mathbf{g}$ is continuous), so $\boldsymbol{\epsilon}_1(\mathbf{k}) \to \mathbf{0}$ .

Therefore: $\frac{\|\mathbf{k}\| \cdot \|\boldsymbol{\epsilon}_1(\mathbf{k})\|}{\|\mathbf{h}\|} \leq C \|\boldsymbol{\epsilon}_1(\mathbf{k})\| \to 0$ .

Step 4: Conclusion.

The entire error term divided by $\|\mathbf{h}\|$ tends to $\mathbf{0}$ , establishing: $(\mathbf{f} \circ \mathbf{g})(\mathbf{a} + \mathbf{h}) = (\mathbf{f} \circ \mathbf{g})(\mathbf{a}) + [D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})]\mathbf{h} + o(\|\mathbf{h}\|)$

By definition of differentiability, $\mathbf{f} \circ \mathbf{g}$ is differentiable at $\mathbf{a}$ with derivative $D(\mathbf{f} \circ \mathbf{g})(\mathbf{a}) = D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})$ . $\square$

■

RemarkMatrix multiplication and the chain rule

In matrix form, the chain rule says $J_{\mathbf{f} \circ \mathbf{g}} = J_\mathbf{f} \cdot J_\mathbf{g}$ , so the chain rule for derivatives translates to matrix multiplication for Jacobians. This is why the derivative of a composition is the product (not sum) of derivatives.