Hashed sorting is typically faster than hash tables

reiner.org

215 points by Bogdanp 5 days ago


tialaramex - 2 days ago

Though big enough to be worthwhile at scale, it's notable how small, relatively, the difference is from the out-of-box Rust unstable sort to the tuned radix sort.

Lukas Bergdoll and Orson Peters did some really important heavy lifting to get this stuff from "If you're an expert you could probably tune a general purpose sort to have these nice properties" to "Rust's general purpose sort has these properties out of the box" and that's going to make a real difference to a lot of software that would never get hand tuned.

markisus - 2 days ago

Interesting article. It’s actually very strange that the dataset needs to be “big” for the O(n log n) algorithm to beat the O(n). Usually you’d expect the big O analysis to be “wrong” for small datasets.

I expect that in this case, like in all cases, as the datasets become gallactically large, the O(n) algorithm will start winning again.

dooglius - 2 days ago

̶F̶o̶r̶ ̶t̶h̶e̶ ̶s̶p̶a̶r̶s̶e̶-̶m̶a̶t̶r̶i̶x̶-̶m̶u̶l̶t̶i̶p̶l̶i̶c̶a̶t̶i̶o̶n̶ ̶e̶x̶a̶m̶p̶l̶e̶ ̶g̶i̶v̶e̶n̶,̶ ̶w̶o̶u̶l̶d̶n̶'̶t̶ ̶i̶t̶ ̶b̶e̶ ̶b̶e̶t̶t̶e̶r̶ ̶t̶o̶ ̶p̶r̶e̶-̶s̶o̶r̶t̶ ̶e̶a̶c̶h̶ ̶v̶e̶c̶t̶o̶r̶ ̶a̶n̶d̶ ̶d̶o̶ ̶n̶^̶2̶ ̶w̶a̶l̶k̶s̶ ̶o̶n̶ ̶p̶a̶i̶r̶s̶ ̶o̶f̶ ̶p̶r̶e̶s̶o̶r̶t̶e̶d̶ ̶v̶e̶c̶t̶o̶r̶s̶,̶ ̶r̶a̶t̶h̶e̶r̶ ̶t̶h̶a̶n̶ ̶(̶a̶s̶ ̶I̶ ̶t̶h̶i̶n̶k̶ ̶w̶a̶s̶ ̶i̶m̶p̶l̶i̶e̶d̶)̶ ̶d̶o̶i̶n̶g̶ ̶n̶^̶2̶ ̶h̶a̶s̶h̶e̶d̶ ̶s̶o̶r̶t̶i̶n̶g̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶?̶ ̶(̶F̶o̶r̶ ̶t̶h̶a̶t̶ ̶m̶a̶t̶t̶e̶r̶,̶ ̶I̶ ̶w̶o̶n̶d̶e̶r̶ ̶w̶h̶y̶ ̶s̶p̶a̶r̶s̶e̶ ̶m̶a̶t̶r̶i̶c̶e̶s̶ ̶w̶o̶u̶l̶d̶n̶'̶t̶ ̶a̶l̶r̶e̶a̶d̶y̶ ̶b̶e̶ ̶r̶e̶p̶r̶e̶s̶e̶n̶t̶e̶d̶ ̶i̶n̶ ̶s̶o̶r̶t̶e̶d̶-̶a̶d̶j̶a̶c̶e̶n̶c̶y̶-̶l̶i̶s̶t̶ ̶f̶o̶r̶m̶ ̶i̶n̶ ̶t̶h̶e̶ ̶f̶i̶r̶s̶t̶ ̶p̶l̶a̶c̶e̶)̶ ̶

EDIT: ah no I'm being dense, you'd aggregate the union of all the set-columns indices across rows and the union of the set-row indices across the columns, keeping track of the source locations, and do the hashed sorting on those union vectors to find all the collision points. You could still get a small win I think by sorting the row-aggregation and column-aggregation separately though?

gus_massa - 2 days ago

> Hash tables win the interview O(n) vs O(n log n),

If the original table has n entries, the hash must have at least log(n) bits to get most of the time a different hash. I don't know the current state of the art, but I think the recommendation was to have at most a 95% filled array, so the are not too many collision. So both methods are O(n log n), one under the hood and the other explicitly.

DennisL123 - 2 days ago

A way to give radix sort a performance boost is to allocate uninitialized blocks of memory for each bucket that are of size n. It exploits the MMU and you don’t have to worry about managing anything related to buckets potentially overwriting. The MMU is doing all the bookkeeping for you and the OS actually allocates memory only on page access.

tomp - 2 days ago

Very interesting and cool article, if you love low-level optimisation (like myself!)

Interestingly, recently I've been thinking that basically the Big-O notation is essentially a scam, in particular the log(N) part.

For small values of N, log(N) is essentially a constant, <= 32, so we can just disregard it, making sorting simply O(N).

For large values, even so-called linear algorithms (e.g. linear search) are actually O(N log(N)), as the storage requirements for a single element grow with log(N) (i.e. to store distinct N=2^32 elements, you need N log(N) = 2^32 * 32 bits, but to store N=2^64 elements, you need 2^64 * 64 bits).

Cache locality consideration make this effect even more pronounced.

tmyklebu - 2 days ago

Seems like the author is missing a trick?

You can use a simple hash table without probing to classify each number as either (1) first instance of a number in the table, (2) known duplicate of a number in the table, or (3) didn't fit in the table.

If you use a hash table with n buckets and one item per bucket and the input is random, then at least 63% of the distinct numbers in the input will fall into cases 1 or 2 in expectation. Increasing table size or bucket size improves this part of the math.

On really big inputs, it's probably still healthy to do a pass or two of radix sort so that the hash table accesses are cheap.

Voultapher - 2 days ago

Interesting article, I particularly like the reference to real-world workloads.

Do I understand correctly, that the data being tested is fully random? If so I'd be curious to see how the results change if a Zipfian distribution is used. I don't have hard data for sorting specifically, but I suspect the majority of real-world data being sorted - that isn't low cardinality or already nearly or fully sorted - to follow a Zipfian distribution rather than true randomness. The Rust std lib sort_unstable efficiently filters out common values using the pdqsort repeat pivot flip partition algorithm.

gchamonlive - 2 days ago

It's nice that we can write relatively performant code with the batteries included in these languages and no hand tuning. I only wished it was as easy as that to write code that is secure by default.

scrubs - 4 days ago

Great article. Informative and well written.

smallpipe - 2 days ago

Could you reduce the BW requirement of the hash table by using an atomic increment instruction instead of read-modify-write ? It still performs a read modify write, but further down in the hierarchy, where more parallelism is available.

zahlman - 2 days ago

> First, extremely sparse unstructured matrix multiplication with e.g. sparsity of one in a billion and one scalar per nonzero.

Why does this have an equivalent inner loop, and why is it an important task?

nasretdinov - 2 days ago

I imagine it must make a bigger difference in languages where allocations are more costly too. E.g. in Go sorting to count uniques is most certainly much faster than using a map + it saves memory too

kazinator - a day ago

This sounds like a spherical cow problem.

The one time in your career you run into it, the next day your boss will add the requirement that entries are going to be inserted and deleted all the time, and your sorting approach is fucked.

If the entries can change all the time, we can use two hash tables, U and D. U maintains the set of unique items at all times, D maintains duplicates. An item is never in both at the same time. In D, it is associated with a count that is at least 2.

A new item is inserted into U. The first duplicate insertion removes the item from U and adds it into D with a count of 2. Subsequent insertions increment the count for that item in D.

A deletion first tries to decrement a count in D, and when that hits below 2, the item is removed from D. If it's not in D, it is removed from U.

At any time we can walk through U to visit all the unique items, without having to deal with spaces of non-unique item.

That has implications for complexity. For instance suppose that for whatever reason, the number of unique items is bounded to sqrt(N). Then iterating over them is O(sqrt(N)), whereas if we just had one hash table, it would be O(N) to iterate over all items and skip the non-unique ones.

pklausler - 2 days ago

The author discusses finding “unique” values, but I think they meant “distinct”.

o11c - 2 days ago

Note that "fixes bad distributions" is the same thing as "causes bad distributions" if your input is untrusted (or even if you're just unlucky). Especially when you are deliberately choosing a bijective function and it is trivial for an attacker to generate "unfortunate" hashes.

The usual trie tricks can avoid this problem without letting the worst case happen. But as often, adding the extra logic can mean worse performance for non-worst-case input.

tgv - 2 days ago

Isn't radix sorting somewhat like hashing?

bluecalm - 2 days ago

While the main conclusion of the article isn't unexpected to me I am very surprised a well optimized quicksort isn't faster than radix sort. My experience in C is that you can significantly speed-up quick sort by removing function calls and doing insertion sorts on small arrays at the end. I was never able to get radix sort even close to that performance wise. I don't know Rust at all so it would be pointless for me to try to code an optimized quick sort in it. Hopefully the author knows how to optimize quick sort and is not missing on huge gains there.

Just in case someone feels like porting it to Rust, here is a link to a good and simple quick sort implementation that significantly outperforms standard library one: https://www.ucw.cz/libucw/doc/ucw/sort.html

curtisszmania - 2 days ago

[dead]