Chain rule for single variable functions
The chain rule is, intuitively, a product of two derivatives. Suppose we have three moving things: A, B and C and their respective speeds are A > B > C. If we know how many times the speed of A is greater in comparison to B and B in comparison to C, then we can know how many times A is in comparison to C too. What we have to do is to multiply the ratios between A and B and between B and C. This is the example in Wikipedia's article.
Another example. In meteorology there is the Lapse rate which is the variation in temperature according to how high we are in the atmosphere. It's the ratio °C / km. If we fly up or down we experience changes in temperature because we are moving in respect to each level of temperature in the atmosphere. The other way around doesn't happen because if we stay put the atmosphere won't move up or down in respect to us. Flying faster naturally yields a faster change in temperature. Our speed is a ratio km / time. If we want the ratio °C / time we have to do the product [math]\displaystyle{ \frac{^{\text{o}}C}{km} \frac{km}{time} = \frac{^{\text{o}}C}{time} }[/math].
[math]\displaystyle{ \frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx} }[/math]
It's important to highlight one thing: the above example of the atmosphere's temperature is a linear case (assuming we're flying at constant speed), which translates to [math]\displaystyle{ T_1'(t) = T_2'(h) \cdot h'(t) }[/math]. Where [math]\displaystyle{ T_1'(t) }[/math] is the variation in temperature over time and the right side is the product of the variation of the temperature over height by the variation of height over time. Notice that the function that goes nested in [math]\displaystyle{ T_2'(h) }[/math] is the function that gives the ratio height / time. Notice that there are two different rates of change.
[math]\displaystyle{ h'(x) = g'(x) \cdot f'(g(x)) }[/math]
We can have any number of functions nested within another. The rule still holds and the name comes from the fact that we have a chain of operations, a chain of derivatives.
Note: sometimes we have composite functions but we don't see them clearly. For example: [math]\displaystyle{ y = \sin^2(x) }[/math]. We can see that we have a product [math]\displaystyle{ y = \sin(x) \sin(x) }[/math], but we could also see it as [math]\displaystyle{ y = x^2 }[/math] and [math]\displaystyle{ x = sin(x) }[/math]. In a more conventional notation: [math]\displaystyle{ f(x) = x^2 }[/math], and [math]\displaystyle{ g(x) = \sin(x) }[/math] and [math]\displaystyle{ f \circ g = \sin^2(x) }[/math]. This is specially common with implicit differentiation.
Graphical reasoning for the chain rule
I don't know about textbooks that show a graphical interpretation for the chain rule. Let's consider [math]\displaystyle{ f(x) = 3x }[/math] and [math]\displaystyle{ g(x) = x^2 }[/math]. The graph of the former is a straight line and the constant factor is the angular coefficient, greater meaning a stepper inclination. The latter is a parabola. The first has a constant rate of change, the second does not.
The graph of [math]\displaystyle{ g(f(x)) = (3x)^2 }[/math] has a greater rate of change than the graph of [math]\displaystyle{ g(x) = x^2 }[/math]. Think about this: if we choose [math]\displaystyle{ x = 2 }[/math] the rates of change are, at that point and for each function, [math]\displaystyle{ f'(2) = 6 }[/math] and [math]\displaystyle{ g'(2) = 4 }[/math]. For the composite function we have [math]\displaystyle{ g'(f(x)) = f'(2)g'(f(2)) = 6 \cdot 2 \cdot 3 = 36 }[/math]. I did this simple example with positive numbers but the chain rule holds for negative numbers and for more complicated functions.
Note: in this specific case we could have used the product rule. Or even faster, the power rule.
Proof of the chain rule
It's natural to think that the derivative of the composite function is the composition of the derivatives. It's the same intuition that commonly happens with the product and quotient rules. When we have a composition, the value of one function depends on the value of the other. We can be easily fooled and think that the derivative of [math]\displaystyle{ f(g(x)) }[/math] is [math]\displaystyle{ f'(g'(x)) }[/math]. Mathematically this doesn't make sense because we just swapped a function by its derivative. Who said that it's right to replace a function by its derivative and expect the result of this operation to be meaningful? Who said that the rate of change of [math]\displaystyle{ f }[/math] depends on the rate of change of [math]\displaystyle{ g }[/math]?
The problem of finding the tangent line describes how a differentiable function can be seen as a linear function if we consider a small enough interval around a point. Let's begin by defining two affine functions:
[math]\displaystyle{ f(x) = ax + b }[/math]
[math]\displaystyle{ g(x) = cx + d }[/math]
Let's take a look at:
[math]\displaystyle{ f(g(x)) = ag(x) + b }[/math]
[math]\displaystyle{ f(cx + d) = a(cx + d) + b }[/math]
[math]\displaystyle{ f \circ g = acx + ad + b }[/math]
Did you notice the product between the angular coefficients, [math]\displaystyle{ a \cdot c }[/math]? If we differentiate the expression [math]\displaystyle{ acx + ad + b }[/math] in respect to [math]\displaystyle{ x }[/math], the operation yields [math]\displaystyle{ ac }[/math]! Surprise! That's not a formal proof though. The fundamental idea behind it is that if the function is differentiable, then near some point of it we can treat it as a linear function.
Links for the proof: