ProofComplete

Proof of the Multivariable Chain Rule

We prove the chain rule for compositions of differentiable multivariable functions, which is the foundation for all computations involving multivariable derivatives.


Proof

Theorem: Let g:Rmβ†’Rn\mathbf{g} : \mathbb{R}^m \to \mathbb{R}^n be differentiable at a\mathbf{a} and f:Rnβ†’Rp\mathbf{f} : \mathbb{R}^n \to \mathbb{R}^p be differentiable at b=g(a)\mathbf{b} = \mathbf{g}(\mathbf{a}). Then f∘g\mathbf{f} \circ \mathbf{g} is differentiable at a\mathbf{a} and D(f∘g)(a)=Df(b)∘Dg(a)D(\mathbf{f} \circ \mathbf{g})(\mathbf{a}) = D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})

Step 1: Setup.

Since f\mathbf{f} is differentiable at b\mathbf{b}, we can write f(b+k)=f(b)+Df(b)k+βˆ₯kβˆ₯Ο΅1(k)\mathbf{f}(\mathbf{b} + \mathbf{k}) = \mathbf{f}(\mathbf{b}) + D\mathbf{f}(\mathbf{b})\mathbf{k} + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k}) where Ο΅1(k)β†’0\boldsymbol{\epsilon}_1(\mathbf{k}) \to \mathbf{0} as kβ†’0\mathbf{k} \to \mathbf{0}.

Since g\mathbf{g} is differentiable at a\mathbf{a}: g(a+h)=g(a)+Dg(a)h+βˆ₯hβˆ₯Ο΅2(h)\mathbf{g}(\mathbf{a} + \mathbf{h}) = \mathbf{g}(\mathbf{a}) + D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h}) where Ο΅2(h)β†’0\boldsymbol{\epsilon}_2(\mathbf{h}) \to \mathbf{0} as hβ†’0\mathbf{h} \to \mathbf{0}.

Step 2: Compose.

Set k=g(a+h)βˆ’g(a)=Dg(a)h+βˆ₯hβˆ₯Ο΅2(h)\mathbf{k} = \mathbf{g}(\mathbf{a} + \mathbf{h}) - \mathbf{g}(\mathbf{a}) = D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h}). Then:

(f∘g)(a+h)=f(b+k)=f(b)+Df(b)k+βˆ₯kβˆ₯Ο΅1(k)(\mathbf{f} \circ \mathbf{g})(\mathbf{a} + \mathbf{h}) = \mathbf{f}(\mathbf{b} + \mathbf{k}) = \mathbf{f}(\mathbf{b}) + D\mathbf{f}(\mathbf{b})\mathbf{k} + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})

Substituting k\mathbf{k}: =(f∘g)(a)+Df(b)[Dg(a)h+βˆ₯hβˆ₯Ο΅2(h)]+βˆ₯kβˆ₯Ο΅1(k)= (\mathbf{f} \circ \mathbf{g})(\mathbf{a}) + D\mathbf{f}(\mathbf{b})[D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h})] + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})

=(f∘g)(a)+[Df(b)∘Dg(a)]h+βˆ₯hβˆ₯Df(b)Ο΅2(h)+βˆ₯kβˆ₯Ο΅1(k)⏟errorΒ term= (\mathbf{f} \circ \mathbf{g})(\mathbf{a}) + [D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})]\mathbf{h} + \underbrace{\|\mathbf{h}\| D\mathbf{f}(\mathbf{b})\boldsymbol{\epsilon}_2(\mathbf{h}) + \|\mathbf{k}\|\boldsymbol{\epsilon}_1(\mathbf{k})}_{\text{error term}}

Step 3: Show the error term is o(βˆ₯hβˆ₯)o(\|\mathbf{h}\|).

For the first error term: βˆ₯hβˆ₯β‹…βˆ₯Df(b)Ο΅2(h)βˆ₯βˆ₯hβˆ₯≀βˆ₯Df(b)βˆ₯β‹…βˆ₯Ο΅2(h)βˆ₯β†’0\frac{\|\mathbf{h}\| \cdot \|D\mathbf{f}(\mathbf{b})\boldsymbol{\epsilon}_2(\mathbf{h})\|}{\|\mathbf{h}\|} \leq \|D\mathbf{f}(\mathbf{b})\| \cdot \|\boldsymbol{\epsilon}_2(\mathbf{h})\| \to 0.

For the second error term: we need βˆ₯kβˆ₯/βˆ₯hβˆ₯\|\mathbf{k}\| / \|\mathbf{h}\| to remain bounded. Indeed: βˆ₯kβˆ₯βˆ₯hβˆ₯=βˆ₯Dg(a)h+βˆ₯hβˆ₯Ο΅2(h)βˆ₯βˆ₯hβˆ₯≀βˆ₯Dg(a)βˆ₯+βˆ₯Ο΅2(h)βˆ₯\frac{\|\mathbf{k}\|}{\|\mathbf{h}\|} = \frac{\|D\mathbf{g}(\mathbf{a})\mathbf{h} + \|\mathbf{h}\|\boldsymbol{\epsilon}_2(\mathbf{h})\|}{\|\mathbf{h}\|} \leq \|D\mathbf{g}(\mathbf{a})\| + \|\boldsymbol{\epsilon}_2(\mathbf{h})\| which is bounded as hβ†’0\mathbf{h} \to \mathbf{0} (say by C=βˆ₯Dg(a)βˆ₯+1C = \|D\mathbf{g}(\mathbf{a})\| + 1 for small h\mathbf{h}).

Also, k→0\mathbf{k} \to \mathbf{0} as h→0\mathbf{h} \to \mathbf{0} (since g\mathbf{g} is continuous), so ϡ1(k)→0\boldsymbol{\epsilon}_1(\mathbf{k}) \to \mathbf{0}.

Therefore: βˆ₯kβˆ₯β‹…βˆ₯Ο΅1(k)βˆ₯βˆ₯hβˆ₯≀Cβˆ₯Ο΅1(k)βˆ₯β†’0\frac{\|\mathbf{k}\| \cdot \|\boldsymbol{\epsilon}_1(\mathbf{k})\|}{\|\mathbf{h}\|} \leq C \|\boldsymbol{\epsilon}_1(\mathbf{k})\| \to 0.

Step 4: Conclusion.

The entire error term divided by βˆ₯hβˆ₯\|\mathbf{h}\| tends to 0\mathbf{0}, establishing: (f∘g)(a+h)=(f∘g)(a)+[Df(b)∘Dg(a)]h+o(βˆ₯hβˆ₯)(\mathbf{f} \circ \mathbf{g})(\mathbf{a} + \mathbf{h}) = (\mathbf{f} \circ \mathbf{g})(\mathbf{a}) + [D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a})]\mathbf{h} + o(\|\mathbf{h}\|)

By definition of differentiability, f∘g\mathbf{f} \circ \mathbf{g} is differentiable at a\mathbf{a} with derivative D(f∘g)(a)=Df(b)∘Dg(a)D(\mathbf{f} \circ \mathbf{g})(\mathbf{a}) = D\mathbf{f}(\mathbf{b}) \circ D\mathbf{g}(\mathbf{a}). β–‘\square

β– 

RemarkMatrix multiplication and the chain rule

In matrix form, the chain rule says Jf∘g=Jfβ‹…JgJ_{\mathbf{f} \circ \mathbf{g}} = J_\mathbf{f} \cdot J_\mathbf{g}, so the chain rule for derivatives translates to matrix multiplication for Jacobians. This is why the derivative of a composition is the product (not sum) of derivatives.