Convolutions, Polynomials and Flipped Kernels
eli.thegreenplace.net105 points by mfrw 4 days ago
105 points by mfrw 4 days ago
My favourite use case for this: By the same derivation as this blog, one can prove that, if you have any two probability distributions X and Y (they can be different), the probability distribution of X+Y is a convolution of the PMFs/PDFs of X and Y.
On similar lines the MAX operator on the random variables become PRODUCT operator on its distribution.
It's fun to play with the (Max, +)algebra of random variables and infer it's distribution.
This turns out to be quite useful in estimating completion time of dependant parallel jobs.
Spawning multiple parallel jobs becomes a Max operation and chaining sequential jobs becomes a '+' operation on the completion times. This expression tree of Max'es and Plus'es can be algebraically processed to obtain bounds on the completion time distribution.
One example is the straggler problem in mapreduce/Hadoop. In the naive case, the completion time is the max of each parallel subtask. (*)
If the tasks have a heavy tail, which sometimes they do, the straggler's completion time can be really bad. This can be mitigated by k-out-n set up, where you encode the problem in such a way that only k out of n jobs need to finish to obtain the final result. One can play with this trade-off between potentially wasted computation and expected completion time.
For heavy tailed distributions another simplification is possible. The tails of Max and + start becoming of the same order, so one can switch between convolutions and products.
(*) This shows up in microservices architectures also. The owner/maintainer of a microservice end point might be very happy with its tail latency. However an end user who consumes and composed the results of the endpoints can experience really bad tail latencies for the final output.
Thanks for sharing the name of that problem! I've encountered it before while optimizing batched LLM inference. The whole batch would last until all queries in a batch were done, and by changing the batch size, you'd trade off per-query-speed (better in a larger batch) with overall performance (worse with a larger batch).
Nowadays I think this is solved in an entirely different way, though.
It gets more entertaining.
It's common to wrap API calls with
retry on failure, or
spawn an identical request if taking longer than x,or
recursively spawn an identical request if taking longer than x,or
retry on failure but no more than k times.
All of these and similar patterns/decorators can be analysed using the same idea.
Oh wow, pretty cool stuff! If you have more to share, you can always dump it in my mail inbox
You should be careful with your estimation. The events should be independent to apply those properties but it is very common that one cause can influence many factors, so they are not independent and all the beauty math does not work as with independence. In the worst day, all fail, because one resource can block others and the system get strangled. There is the black swan book when one rare event make the financial market realize what is risk.
A load-balancer transparently sitting in front of the api end-point (not an uncommon scenario) usually decouples things well enough to be practically independent.
That said, silloncito's warning does need to be paid heed.
While independence is essential for the proof to go through, the relationships need not break catastrophically with break of independence, usually it is graceful degradation with degree of independence. There are however specific, often degenerate, theoretical edge cases where the degradation is rapid.
Almost every probability theorem starts with "let's take independent random variables". But in reality almost nothing is independent. Superdeterminism even claims that exactly nothing is independent.
Almost every elementary probability theorem. There are plenty of theorems conditional on bounds on how weak the dependency should be (or how fast correlations decrease with order, etc.), including CLT-like theorems, it’s just that they are difficult to state, very difficult to prove, and almost impossible to verify the applicability of. In practice you are anyway going to instead use the simpler version and check if the results make sense after the fact.
You are right.
The degree of dependence matters though. Mutual information is one way to measure that.
Thankfully, some theorems remain valid even when independence is violated. The next stop after independence is martingale criteria. Martingale difference sequences can be quite strongly dependent yet allow some of the usual theorems to go through but with worse convergence rates.
What ? I've never heard of this math. Do you mean literally those are the resulting operations in the general case or are those approximate explanation and we need to find more specific cases to make this true ?
The math is general and exact.
Max and Plus at the random variables space becomes product and convolution in their distribution function space.
Distr(X+Y) = DistrX ° DistrY
Distr (X ^ Y) = DistrX * DistrY.
Where '^' denotes Max and '°' denotes convolution.
Note *, +, ° and ^ being commutative and associative they can be chained. One can also use their distributive properties. This really the math of groups and rings.However, one can and one does resort to approximations to compute the desired end results.
More specifically, people are often interested not in the distribution, but some statistics. For example, mean, standard deviation, some tail percentile etc. To compute those stats from the exact distributions, approximations can be employed.
Surely this isn't quite right.
Max of variables = product of cumulative distribution functions.
Sum of variables = convolution of probability density functions.
So both of the equations you write down are correct, but only if you interpret "Distr" as meaning different things in the two cases.
[EDITED to add:] Provided the random variables in question are independent, as mentioned elsewhere in the discussion; if they aren't then none of this works.
The original post, to which I replied, is about the correspondence between summation of random variables and convolution of their distribution. Independence is sufficient for that.
I just carried through that assumption of independence in my own comment, thinking it was obvious to do that (carry over the assumptions).
But is he right about the different meanings of Distr in your equations ?
No he is not.
Both the ideas work for the cumulative distribution function that is called just distribution function in math. I think he got confused by the fact that convolution relation works also with densities (so he might have assumed that it works with densities only and not distributions)
I'm sorry, but I think you are just wrong about convolutions and cumulative distribution functions.
Let's take the simplest possible case: a "random" variable that's always equal to 0. Its cdf is a step function: 0 for negative values, 1 for positive values. (Use whatever convention you prefer for the value at 0.)
The sum of two such random variables is another with the same distribution, of course. So, what's the convolution of the cdfs?
Answer: it's not even well defined.
The convolution of functions f and g is a function such that h(x) = integral over t of f(t) g(x-t) dt. The integral is over the whole of (in this case) the real numbers.
In this case f and g are both step functions as described above, so (using the convenient Iverson bracket notation for indicator functions) this is the integral over t of [t>0] [x-t>0] dt, i.e., of [0<t<x] dt, whose value is 0 for negative x and x for positive x.
This is not the cdf of any probability distribution since it doesn't -> 1 as x -> oo. In particular, it isn't the cdf of the right probability distribution which, as mentioned above, would be the same step function as f and g.
If X,Y are independent with pdfs f,g and cdfs F,G then the cdf of X+Y is (not F conv G but) f conv G = F conv g.