Derivatives, Gradients, Jacobians and Hessians

blog.demofox.org

278 points by ibobev 4 days ago


GistNoesis - 4 days ago

The way that really made me understand gradients and derivative was when visualizing them as Arrow Maps. I even made a small tool https://github.com/GistNoesis/VisualizeGradient . This visualization helps understand optimization algorithm.

Jacobians can be understood as a collection of gradients when considering each coordinates of the output independently.

My mental picture for Hessian is to associate each point with the shape of a parabola (or saddle), which best match the function locally. It's easy to visualize once you realize it's the shape of what you see when you zoom-in on the point. (Technically this mental picture is more of a hessian + gradient tangent plane simultaneously multivariate Taylor expansion but I find them hard to mentally separate the slope from the curvature).

fouronnes3 - 4 days ago

There's something that's always been deeply confusing to me about comparing the Jacobian and the Hessian because their nature is very different.

The Hessian shouldn't have been called a matrix.

The Jacobian describes all the first order derivatives of a vector valued function (of multiple inputs), while the Hessian is all the second order derivatives of a scalar valued output function (of multiple inputs). Why doesn't the number of dimensions of the array increase by one as the derivation order increases? It does! The object that fully describes second order derivation of a vector valued function of multiple inputs is actually a 3 dimensionnal tensor. One dimension for the original vector valued output, and one for each derivation order. Mathematicians are afraid of tensors of more than 2 dimensions for some reason and want everything to be a matrix.

In other words, given a function R^n -> R^m:

Order 0: Output value: 1d array of shape (m) (a vector)

Order 1: First order derivative: 2d array of shape (m, n) (Jacobian matrix)

Order 2: Second order derivative: 3d array of shape (m, n, n) (array of Hessian matrices)

It all makes sense!

Talking about "Jacobian and Hessian" matrices as if they are both naturally matrices is highly misleading.

sestep - 4 days ago

A bit more advanced than this post, but for calculating Jacobians and Hessians, the Julia folks have done some cool work recently building on classical automatic differentiation research: https://iclr-blogposts.github.io/2025/blog/sparse-autodiff/

tired_and_awake - 4 days ago

About a decade ago I was interviewed for Apple's self driving car project and an exec on the project asked me to define these exact 4 things in great detail and provide examples. Shrugs.

leopoldj - 3 days ago

Thank you so much for posting. I finally understand Jacobian matrix. The key is to know this applies to a function that returns multiple values. The wikipedia article was difficult to understand, until now! Note: Technically, a function can map an input to a single output only. Here, when we say a function returns multiple values, we mean a single set of multiple values. For example, a function that outputs the heating and cooling cost of a building. Where as, a circle is not a function because it outputs two y values for a single x.

nickpsecurity - 4 days ago

"What I just described is an iterative optimization method that is similar to gradient descent. Gradient descent simulates a ball rolling down hill to find the lowest point that we can, adjusting step size, and even adding momentum to try and not get stuck in places that are not the true minimum."

That is so much easier to understand than most descriptions. The whole opening was.

ziofill - 4 days ago

Mmh, this is a bit sloppy. The derivative of a function f::a -> b is a function Df::a -> a -o b where the second funny arrow indicates a linear function. I.e. the derivative Df takes a point in the domain and returns a linear approximation of f (the jacobian) at that point. And it’s always the jacobian, it’s just that when f is R -> R we conflate the jacobian (a 1x1 matrix in this case) with the number inside of it.

flufluflufluffy - 4 days ago

Fantastic post! As short as it needs to be while still communicating its points effectively. I love walking up the generalization levels in math.

vismit2000 - 3 days ago

This is a fantastic video on Jacobian [Mathemaniac]: https://www.youtube.com/watch?v=wCZ1VEmVjVo

whatever1 - 4 days ago

I can look around me and find the minimum of anything without tracing its surface and following the gradient. I can also identify immediately global minima instead of local ones.

We all can do it in 2-3D. But our algorithms don’t do it. Even in 2D.

Sure if I was blindfolded, feeling the surface and looking for minimization direction would be the way to go. But when I see, I don’t have to.

What are we missing?

divbzero - 4 days ago

Would love to see div and curl added to this post.

throwpr - 4 days ago

[dead]

amelius - 4 days ago

> (...) The derivative of w with respect to x. Another way of saying that is “If you added 1 to x before plugging it into the function, this is how much w would change

Incorrect!