Super fast aggregations in PostgreSQL 19
cybertec-postgresql.com116 points by jnord 7 days ago
116 points by jnord 7 days ago
Interestingly, "aggregate first, join later" has been the standard way of joining fact tables in BI tools for a long time. Since fact tables are typically big and also share common dimensions, multi-fact joins for drill-across are best done by first aggregating on those common dimensions, then joining on them.
Makes you wonder how many cases there are out there of optimizations that feel almost second nature in one domain, but have never been applied to other domains because no one thought of it.
It's not that nobody thought of it. Group pushdown has been a thing in papers for ~10 years at least, but it's hard to plan; your search space (which was already large) explodes, and it's always hard to know exactly how many rows come out of a given grouping. I have no idea how Postgres deals with these. Hopefully, they're doing something good (enough) :-)
Next up would hopefully be groupjoin, where you combine grouping and hash join into one operation if they are on the same or compatible keys (which is surprisingly often).
Probably quite a lot, being a specialist in multiple domains is getting more difficult.
The key idea here seems to be that if you’re grouping on a column on a related table you can do your main aggregation by grouping on the foreign key id on the primary table and use that as a proxy for the data on the related table that you’re actually grouping by.
In the examples given, it’s much faster, but is that mostly due to the missing indexes? I’d have thought that an optimal approach in the colour example would be to look at the product.color_id index, get the counts directly from there and you’re pretty much done.
I have a feeling that Postgres doesn’t make that optimisation (I’ve looked before, but it was older Postgres). And I guess depending on the aggregation maybe it’s not useful in the general case. Maybe in this new world it _can_ make that optimisation?
Anyway, as ever, pg just getting faster is always good.
> In the examples given, it’s much faster, but is that mostly due to the missing indexes? I’d have thought that an optimal approach in the colour example would be to look at the product.color_id index, get the counts directly from there and you’re pretty much done.
So I tried to test this (my intuition being that indexes wouldn't change much, at best you could just do an index scan instead of a seq scan), and I couldn't understand the plans I was getting, until I realized that the query in the blog post has a small error:
> AND c1.category_id = c1.category_id
should really be
> AND p.category_id = c1.category_id
otherwise we're doing a cross-product on the category. Probably doesn't really change much, but still a bit of an oopsie. Anyway, even with the right join condition an index only reduces execution time by about 20% in my tests, through an index scan.
Neat, I see how this can prevent a lot of frustration when it comes to making certain queries stay quick as complexity grows. I wonder if this means I can forget the trick to chose LATERAL queries all over the place for performance reasons.
[flagged]
[flagged]
Please don't do this here. If a comment seems unfit for HN, please flag it and email us at hn@ycombinator.com so we can have a look.
Got it. I didn't know mailing you was an option / something you'd prefer in such a case.