Branchless Quicksort faster than std:sort and pdqsort with C and C++ API

tiki.li

176 points by birdculture 3 days ago


orlp - 11 hours ago

Since pdqsort (an older project of mine) was mentioned, I felt it wouldn't be entirely inappropriate to mention that I've since then collaborated with Lukas Bergdoll to provide two high-quality sort implementations for the Rust standard library, ipnsort (unstable) and driftsort (stable).

So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)

On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:

    Sorting 50 million doubles:
    ipnsort             0.79s
    blqs                0.90s
    driftsort           1.13s   (stable)
    std::sort           1.22s
    std::stable_sort    4.64s   (stable)

    Sorting 50 million (i32, i32) structs:
    ipnsort             0.82s
    blqs                0.89s
    driftsort           1.07s   (stable)
    std::sort           3.09s
    std::stable_sort    3.15s   (stable)

And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random:

    driftsort           0.29s   (stable)
    ipnsort             0.81s
    std::sort           1.15s
    std::stable_sort    1.63s   (stable)
    blqs                1.89s
kleiba2 - 4 hours ago

    for (int i = 0; i < 1000; i++) {
        small_numbers[smlen] = numbers[i];
        smlen += (numbers[i] < 500);
    }
 
is much faster than the conventional version with a conditional branch:

    for (int i = 0; i < 1000; i++) {
        if (numbers[i] < 500) {
            small_numbers[smlen] = numbers[i];
            smlen += 1;
        }
    }
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?
quuxplusone - 8 hours ago

It's unfortunate that the C++ version of the code assumes the type T is default-constructible (and that the default constructor is cheap). It also assumes that the type T is copy-constructible; at a glance I can't tell if the algorithm depends on making a copy in every place that it does make a copy. E.g. in the `heap_sort` helper we have

    T k;                       // default-construct
    if (i > 0) k = left[--i];  // copy-assign
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know. Again, in `partition_small`, we have

    T swbuf[SMALLPART];
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.

All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.

None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".

teo_zero - 4 hours ago

Nitpicking the C variant:

> #define BLQS_CMP(a, b) ((a) < (b))

A function that returns true when one operand is Less Than the other, should be called BLQS_LT. The CMP abbreviation is idiomatic for a function that returns -1,0, or 1.

Aardwolf - an hour ago

On what datatype though, e.g. for sorting arbitrary length strings? I think that is if the comparator is expensive, quicksort and variants do not win because they do a constant factor more comparisons

mgaunard - 12 hours ago

Aren't there several bitonic sort network implementations that are vectorized, Intel's in particular?

Why not compare against that?

Tomte - 3 hours ago

I‘m always a bit envious when I see those branchless styles. In my day job I have the obligation to hit 100% modified condition/decision coverage, and I‘m daydreaming about having just one control flow through everything, in order to save module tests that only test the umpteenth condition combination.

Obviously, readable code wins, but at least once I had the computing time budget to be able to have a central function go straight through by calculating all five or so variations (it was about several kinds of encodings of the output values) and just pick the correct one in the end. That felt good.

bagxrvxpepzn - 3 hours ago

> On modern CPUs, avoiding branch misprediction is a key technique to speed up programs.

This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html

Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.

- 4 hours ago
[deleted]
davidkwast - 12 hours ago

It is so simple that I had to look very slowly to understand. Nicely done.

kvuj - 9 hours ago

>On modern CPUs, avoiding branch misprediction is a key technique to speed up programs. This branchless approach:

>

>for (int i = 0; i < 1000; i++) {

> small_numbers[smlen] = numbers[i];

> smlen += (numbers[i] < 500);

>}

Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?

globular-toast - 2 hours ago

Anyone interested in branch-free code might like the book Hacker's Delight. Lots of examples of stuff like this in there.

- 9 hours ago
[deleted]
dekdrop - 8 hours ago

If it's branch predicting, why would if statement run slow? How come unnecessary memory write is fine?