Optimistic Locking in B-Trees

cedardb.com

184 points by uds5501 3 months ago


ianbutler - 3 months ago

Ah optimistic locking. I implemented that on top of a radix tree to make a concurrent disk backed adaptive radix tree for text search in Rust.

I was looking at a blog post talking about the early days of Algolia’s search engine when I decided I wanted to try my own implementation of their algorithm.

https://github.com/iantbutler01/dart

dart = disk backed adaptive radix tree

The idea being hot paths stay in memory and cold paths get shunted to disk and since text search tends to have a pretty regular set of queries people make you get a really nice trade off for speed and storage.

cryptonector - 3 months ago

  > B-Trees won’t get obsolete
  > 
  > <reaons>
Among the many reasons that b-trees aren't going away is that they are prefix indices. This means that an index on multiple columns -or on one column the values of whose type can have prefixes, like strings- can be searched with just a prefix of the key, which enables skip-scan type optimizations, and allows the index to be applicable to many more queries than a hash index. It also means that you can have covering indices as the actual table where the primary key columns are the prefix for the rest -- look ma'! no rowids!
msanlop - 3 months ago

Nice! I actually worked on this recently for my bachelor project. We got some promising preliminary results that showed performance gains on B-trees by using "delegation" on top of optimistic lock coupling.

The way delegation locks work is by having threads delegate their critical function instead of getting exclusive access to shared memory. Think of a client/server model, where the server executes other core's critical sections. For machines with very high core counts, the result is that the shared memory remains valid in the delegate's cache, instead of each new lock holder getting cache misses and fetching memory from other cores (or worse, other sockets).

foota - 3 months ago

Funny timing, I've been looking at this recently. Some other well performing concurrent ordered data structures are ART (adaptive radix trees) and masstree. Skiplists are common since they're easier to implement lock free, but they don't seem to perform well relative to these alternatives.

jrockway - 3 months ago

Something I always wonder about is figuring out the optimal page size. The author seems to use 64KB. This is neither the size of a CPU cache line (often 64 bytes) nor the size of a block on an SSD (often 512KB). So why 64KB?

agent281 - 3 months ago

I think there are a couple of bugs in the pseudo-code:

traverse_node(page, version, key): retry: # Read the page pageCopy = copy(page)

    # BUG1: version is an argument so it likely won't be correct on retry
    if version != atomic_load(page.version):
        goto retry;
    
    # Go to the next page
    childPtr = binary_search(pageCopy, key)
    nextPage = load(childPtr)
    nextVersion = atomic_load(nextPage.version)

    # BUG2: seems like we should be using nextVersion instead of version
    # Validate that no writer overtook the reader
    if version != atomic_load(page.version):
        goto retry;
    
    # Safely traverse to the next node
    page = nextPage
Or am I misunderstanding something here?
koverstreet - 3 months ago

kids stuff :)

bcachefs uses shared/intent/exclusive locks on nodes, so we only have to take write locks during transaction commit when we're doing the final update, and percpu read locks for interior nodes (and others).

this means that there's basically zero lock contention on interior nodes, or even cacheline bouncing. lock contention only comes up when you've got different keys in the same leaf nodes.

emmanueloga_ - 3 months ago

I wonder about the angle of the article, starting with "B-Trees stand the test of time" ending with "everything else has seriously diminishing returns".

I never ask myself why we still use hashmaps or heaps or whatnot, so makes me wonder if this is really an article about why Cedar does not use something else? (LSM-trees the elephant in the room?)

NightMKoder - 3 months ago

Talking of trees and caches, back in school I remember learning about splay trees. I’ve never actually seen one used in a production system though, I assume because of the poor concurrency. Has anyone heard of any systems with tree rebalancing based on the workload (ie reads too not just writes)?

uds5501 - 3 months ago

[not-author] but found it really fascinating. My open questions - is the version monotonically increasing though? - how would one handle wrap arounds once the versions don't fit a datatype? - who decides the version to be assigned to a thread's attempt?

vander_elst - 3 months ago

A bit of a tangential question. Does anyone have any experience with cedars?

kikimora - 3 months ago

How do you implement a concurrency-safe copy operation? Especially if data structure you copy is 64Kb.