Optimistic Locking in B-Trees

184 points by uds5501 3 months ago

Ah optimistic locking. I implemented that on top of a radix tree to make a concurrent disk backed adaptive radix tree for text search in Rust.

I was looking at a blog post talking about the early days of Algolia’s search engine when I decided I wanted to try my own implementation of their algorithm.

https://github.com/iantbutler01/dart

dart = disk backed adaptive radix tree

The idea being hot paths stay in memory and cold paths get shunted to disk and since text search tends to have a pretty regular set of queries people make you get a really nice trade off for speed and storage.

cryptonector - 3 months ago

  > B-Trees won’t get obsolete
  > 
  > <reaons>

Among the many reasons that b-trees aren't going away is that they are prefix indices. This means that an index on multiple columns -or on one column the values of whose type can have prefixes, like strings- can be searched with just a prefix of the key, which enables skip-scan type optimizations, and allows the index to be applicable to many more queries than a hash index. It also means that you can have covering indices as the actual table where the primary key columns are the prefix for the rest -- look ma'! no rowids!

hinkley - 3 months ago

Software architecture is dominated by a few inequalities that tend to remain in the same order most of the time, get closer or farther away from each other, and only occasionally reorder themselves because some big breakthrough in one kind of hardware has no equivalence in another and things get out of whack.
When they get out of whack we try some really crazy algorithms that make sense when the inequalities invert, and then disappear when order is restored.
One of the first times networking got faster than local storage, John Ousterhout (of Raft fame) and his group at Berkeley introduced a distributed operating system called Sprite (which featured heavily in the Distributed Computing class I took) that could not only run processes on different machines but could also do live process migration - in the 1980's. NICs getting faster than disk also brought us in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage.
B-trees fit well with the dominant equation, so they never really go away.
- twotwotwo - 3 months ago
  
  > in-memory and disk-backed, in-memory KV stores, which haven't really gone away even with the introduction of NVMe storage
  If we're talking caches and Redis, one thought about why they haven't gone away--
  In practice I think sharing is part of what keeps them network services, not only access to a cluster's RAM. As a dev you reach for something shared and 'shaped' like a cache, and the in-RAM aspect may be vestigial but still comes with the package.
  We use Redis when e.g. tracking stats on active client IPs to see which are up to shenanigans, a maybe common use where you want cluster-wide stats. When caching, we may still want one DB query per expiry period app-wide, not one per webserver.
  We would probably be fine 'caching' some stuff in the DB; Django's cache package has supported it for a while! I'm sure one of the disk-using successors of an in-RAM store could hit perf targets (DragonflyDB looks cool!). Just not a priority to change stuff that's working. (Corollary: I bet some folks unhappy with their Redis RAM needs are more than happy to try Dragonfly.)
hot_gril - 3 months ago

Also if you have roughly increasing row IDs, theoretically a btree is faster to insert into at a large scale, even if you don't care for prefix lookups or inequalities. And I think faster to read from.
I'm not sure if that's why, but in practice, Postgres doesn't care too much for hash indexes. It technically has them, but they're non-default, not recommended, previously poorly supported, etc. The default is btree (and serial pk).
- cryptonector - 3 months ago
  
  But tables in PG are heaps, so appending to them is even faster than in b-trees.
  - dspillett - 3 months ago
    
    This is specifically taking about indexed structures though, so inserting data in a way that means it can be efficiently searched afterwards. Comparing heaps to b-trees, hash tables, and other ones structures in this context isn't really a Granny Smith Vs Golden Delicious style comparison.
    Many (though not all) databases allow the base data to be a heap for those situations where a pure heap or a heap with small supporting indexes is more efficient (very wide data for instance, particularly in intermediate structures for ETL processes (a longer process acronym might better describe this, say, ELTEL)).
    
    cryptonector - 3 months ago
    
    A SQL table that has a `PRIMARY KEY` is (or can be) an index. Indeed, in SQLite3 if you use the `WITHOUT ROWID` feature you get exactly that: a covering b-tree index where the `PRIMARY KEY` columns are a prefix of the index key.
    Also in SQLite3 tables w/o `WITHOUT ROWID` are still b-tree indices, but they are indexed by the row ID, or the `INTEGER PRIMARY KEY` column (which if there is one, is the same thing as a row ID).
    I.e., tables can be indices, and therefore covering indices can be tables.
    If you have a table that has no `PRIMARY KEY` then a heap makes some sense, but then again so does a b-tree indexing by row ID. But then again a table without a `PRIMARY KEY` actually makes little sense in a world of relational algebra where tables are sets (as they are supposed to be in SQL) because a table without a `PRIMARY KEY` implies allowing duplicate rows, and then you need a row ID to deal with those... A table without a `PRIMARY KEY` is always a table that has an implied `PRIMARY KEY` that is just a row ID.
    
    dspillett - 3 months ago
    
    > A table without a `PRIMARY KEY` is always a table that has an implied `PRIMARY KEY` that is just a row ID.
    While it is true that there will be a surrogate key (the row ID), that doesn't imply there is any sort of index as there is in SQLite.
    In SQL Server a heap table is not represented by a tree indexed by rowID, there will be one or more IAM (Index Allocation Map) pages which just list the pages used by that file. These are not even ordered lists. The internal RowID is actually a 8-byte encoding of the data's location in the files (two bytes for file, four bytes for page, two bytes for slot) for instance “1:560:0” which is file 1, page 560, first slot. This is what is stored in the IAM page and what any indexes point to for finding data that is not part of the index (i.e. not part of the index key and also not “included”). That isn't even necessarily where the data is, just where it was first put: if the row is updated in a way that it grows beyond what will fit in the page it shares with other rows, it goes elsewhere and a forwarding record in the original slot will point to the new location (this can become a long list of jumps to find the actual data, if a row grows many times in its existence, and you are unlucky wrt what free space is available on the current page each time it does).
    Yes, this is more than a bit nasty for most uses. That is why most (almost all) tables in SQL Server should have a clustered index (at which point it is no longer a heap). Heaps have their uses where they are better than tables with clustered indexes, but those are few and far between (off the top of my head, some intermediate tables for ETL processes is all I can think of, technically very small (one-or-few page) tables too, but any measured efficiency difference there is so small either way that it is, or may as well be, statistical noise).
    Having a primary key does not stop a heap being a head (in SQL Server) as the primary key does not have to be a clustering key. It usually is, but there are times when another key is a better choice and there can only be one clustered key (ignoring the wide index hack that makes things more-or-less behave like there are multiple clustering keys, at the expense of extra storage space and much slower insert/update operations).
    
    cryptonector - 3 months ago
    
    > While it is true that there will be a surrogate key (the row ID), that doesn't imply there is any sort of index as there is in SQLite.
    In PG heaps it does, though it's not stable across `VACUUM`. From what you say the same is true of SQL Server. The point being that two duplicate rows in a table w/o a `PRIMARY KEY` will have different "row IDs" or "ctids" or whatever even though that row ID or ctid will not be stable across vacuuming though it should be stable across a single query.
    (In PG all tables are heaps, always. You can have indices, and even covering indices, and your read queries might never hit the heap, but a heap will be there. IMO PG ought to support a b-tree instead of a heap for tables with `PRIMARY KEY`s -- it would be an improvement, and a performance optimization.)
    
    hot_gril - 3 months ago
    
    It was unclear of me to say "row IDs." I meant primary keys, not something like ctid https://www.postgresql.org/docs/current/ddl-system-columns.h...
    
    cryptonector - 3 months ago
    
    If you declare a table that has no `PRIMARY KEY` (and no `UNIQUE` constraints on `NOT NULL` columns) you're essentially allowing duplicate rows. Either the RDBMS treats such a table as having something of a primary key on all columns (but NULLs sneak duplicate rows back in!), or the RDBMS gives you a row ID like concept (ctid counts) to truly disambiguate duplicate rows.
  - avinassh - 3 months ago
    
    > But tables in PG are heaps, so appending to them is even faster than in b-trees.
    elaborate? how it is different, than say, MySQL?
    
    cryptonector - 3 months ago
    
    In SQLite3 tables are b-trees.

msanlop - 3 months ago

Nice! I actually worked on this recently for my bachelor project. We got some promising preliminary results that showed performance gains on B-trees by using "delegation" on top of optimistic lock coupling.

The way delegation locks work is by having threads delegate their critical function instead of getting exclusive access to shared memory. Think of a client/server model, where the server executes other core's critical sections. For machines with very high core counts, the result is that the shared memory remains valid in the delegate's cache, instead of each new lock holder getting cache misses and fetching memory from other cores (or worse, other sockets).

senderista - 3 months ago

Sounds a bit like flat combining?
https://people.csail.mit.edu/shanir/publications/Flat%20Comb...
- msanlop - 3 months ago
  
  Indeed, generally delegation has become to mean that there is a dedicated server thread that executes critical sections. Where as combining indicates that any thread can become a combiner while acquiring a lock, for a certain period of time. For the project, we used queue-based combining, which seems to be similar to the the lock in the paper.

foota - 3 months ago

Funny timing, I've been looking at this recently. Some other well performing concurrent ordered data structures are ART (adaptive radix trees) and masstree. Skiplists are common since they're easier to implement lock free, but they don't seem to perform well relative to these alternatives.

judofyr - 3 months ago

You should check out Height-Oriented Tries as well. Their benchmarks shows that it outperforms both ART and masstree.
EDIT: Oops, yes, I meant Height-Optimized Trie! Sorry about that.
- foota - 3 months ago
  
  Will take a look, thanks! I think you meant height-optimized tries, since I had a tough time giving it at first :)
  Ideally I'd like to find a way to store a non overlapping ordered list of intervals while being able to atomically (wrt other insertions) avoid inserting overlaps. This is part of a data structure tracking locked ranges, so being able to update them in parallel would be nice, since it today takes a lock over the whole tree while inserting new locked ranges.
  Probably, this would require something with next value pointers. Maybe I could do some kind of scheme where to insert you have to first optimistically lock the predecessor with the to be inserted node, and then insert the new node where it belongs?
  - mad - 3 months ago
    
    You should also check out the Maple Tree (https://docs.kernel.org/core-api/maple_tree.html).
    It's the data structure used to track non-overlapping intervals in the Linux kernel's virtual memory subsystem.
    If you don't mind sharing, what's your use case for such a data structure?

jrockway - 3 months ago

Something I always wonder about is figuring out the optimal page size. The author seems to use 64KB. This is neither the size of a CPU cache line (often 64 bytes) nor the size of a block on an SSD (often 512KB). So why 64KB?

magnat - 3 months ago

If your data/leaf/heap pages hold indexed cells PostgreSQL-style [1], 16-bit in-page offset limits your page size to 64KB. On the other hand, physical memory frame size usually is 4KB or larger, so going below that is not really productive.
Having said that, both PostgreSQL and Microsoft SQL has 8KB page size, while original SQLite had 1KB pages, before switching to 4KB, while still supporting 64KB pages. On top of that, Microsoft SQL operates internally on extents of 8 pages (64 KB) instead of a single page, but still pretends to use 4 KB pages [2].
In other words - no idea.
[1] https://www.postgresql.org/docs/current/storage-page-layout....
[2] https://learn.microsoft.com/en-us/sql/relational-databases/p...
- msanlop - 3 months ago
  
  I tried looking for papers on the subject but couldn't find much. There recent research about in-memory B-tree, and there is has been shown that smaller pages (sizes as low as 256/512B) are more performant due to better cache behavior[0]. The general wisdom seems to be that disk-based databases' IO performance is the main bottleneck—but again I couldn't find any concrete data/benchmarks about why those higher sizes were chosen.
  [0] There's even some cool research about B-trees that have the layout depend on the probabilistic distribution of the lookups: https://dl.acm.org/doi/10.1145/3592980.3595316
- sroussey - 3 months ago
  
  The default macOS page size is currently 16KB, and various (Apple) stuff is optimized for it, and I think SQLite is among them (not 100%). I think their server optimized builds for internal use are 64.
- o11c - 3 months ago
  
  16-bit offsets doesn't limit you to 64K unless you require byte-alignment. 128K or 256K is usually trivial and even 512K or 1M is often viable.
tomnipotent - 3 months ago

Cedar uses a PAX storage layout (hybrid row/column) so I imagine 64KB optimizes well for OLAP workloads while keeping excess disk I/O for point look-ups manageable. Depending on your OLAP load that "excess" I/O could either be a rounding error or a bottleneck.
shinycode - 3 months ago

From the article : « The 64KB root node of our B-Tree fits nicely into the L1/L2 cache of a modern CPU »
Maybe that’s why ?
- jrockway - 3 months ago
  
  I'm just looking for more depth. Like, is that all loaded in 1 load operation when the page is touched and is therefore optimal? Does disk access / writing dirty pages not matter?
  - menaerus - 3 months ago
    
    No, 64K is definitely not loaded in 1 load operation. Maximum amount of load operations modern CPUs can do is 2x512-bit or 128 bytes per cycle or around ~600 GB of L1 bandwidth.
    That said, there is no universal page size one can choose from. Some workloads will benefit from smaller page size while others will benefit from larger page size but 512K is not the size you will want to choose. Read-, write-, space-amplification, CPU cache thrashing etc.
harrison_clarke - 3 months ago

a u16 can address it, which is nice. it's also the minimum virtual address allocation on windows. not sure if either of those are the reason, in this case
remexre - 3 months ago

64KiB also happens to be the largest minimum page size you can configure reasonably common hardware to have (not that this is a common configuration).

agent281 - 3 months ago

I think there are a couple of bugs in the pseudo-code:

traverse_node(page, version, key): retry: # Read the page pageCopy = copy(page)

    # BUG1: version is an argument so it likely won't be correct on retry
    if version != atomic_load(page.version):
        goto retry;
    
    # Go to the next page
    childPtr = binary_search(pageCopy, key)
    nextPage = load(childPtr)
    nextVersion = atomic_load(nextPage.version)

    # BUG2: seems like we should be using nextVersion instead of version
    # Validate that no writer overtook the reader
    if version != atomic_load(page.version):
        goto retry;
    
    # Safely traverse to the next node
    page = nextPage

Or am I misunderstanding something here?

koverstreet - 3 months ago

kids stuff :)

bcachefs uses shared/intent/exclusive locks on nodes, so we only have to take write locks during transaction commit when we're doing the final update, and percpu read locks for interior nodes (and others).

this means that there's basically zero lock contention on interior nodes, or even cacheline bouncing. lock contention only comes up when you've got different keys in the same leaf nodes.

foota - 3 months ago

I'm not familiar with bcachefs, so maybe I'm misunderstanding, but the b-tree discussed here also doesn't need to talk locks except for when data is written?
I'm curious about the per-CPU locks you mention though, do you mean that e.g., nodes are owned by certain CPUs and can only be written from those CPUs (and hence can get away with a CPU local lock?), or something else?
- koverstreet - 3 months ago
  
  > I'm not familiar with bcachefs, so maybe I'm misunderstanding, but the b-tree discussed here also doesn't need to talk locks except for when data is written?
  Updates to the b-tree in memory require locks too :)
  > I'm curious about the per-CPU locks you mention though, do you mean that e.g., nodes are owned by certain CPUs and can only be written from those CPUs (and hence can get away with a CPU local lock?), or something else?
  https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/...
  This is the code for btree node locks, as you can see there's a mode where the reader count is stored in a percpu counter. It makes it more expensive to take a write lock, but it means there's zero cacheline contention when just taking read locks (which is by far the common case for interior nodes!).
  - foota - 3 months ago
    
    Interesting, I'm not certain, but I think the same thing (e.g., no contention on cache lines) would be achieved through the versions used in optimistic lock coupling, since reads don't need to take a lock?
nullpoint420 - 3 months ago

bcachefs may use zero lock contention, but does anyone use bcachefs?
(kidding of course, i do)
edit: just realized this is _the_ Kent Overstreet!
Although my original post was in jest, thank you so much for all your hard work on bcachefs.
bcachefs has personally saved me so much time with its copy-on-write functionality, and for that i am very grateful
- foota - 3 months ago
  
  My new hobby, labelling my unused code as zero contention, because if it's not running there's zero contention :-)

emmanueloga_ - 3 months ago

I wonder about the angle of the article, starting with "B-Trees stand the test of time" ending with "everything else has seriously diminishing returns".

I never ask myself why we still use hashmaps or heaps or whatnot, so makes me wonder if this is really an article about why Cedar does not use something else? (LSM-trees the elephant in the room?)

foota - 3 months ago

This is primarily for in-memory data. You need something for the base layer of a LSM-tree no matter how you do it.

NightMKoder - 3 months ago

Talking of trees and caches, back in school I remember learning about splay trees. I’ve never actually seen one used in a production system though, I assume because of the poor concurrency. Has anyone heard of any systems with tree rebalancing based on the workload (ie reads too not just writes)?

selamtux - 3 months ago

maybe weighted trees? hot paths are lower weight so it will be more close the root.
splay tree are good if you are not accessing concurrently and ordered items. next item always be in root

uds5501 - 3 months ago

[not-author] but found it really fascinating. My open questions - is the version monotonically increasing though? - how would one handle wrap arounds once the versions don't fit a datatype? - who decides the version to be assigned to a thread's attempt?

lvogel - 3 months ago

The version is just an atomic integer assigned to that btree node which is monotonically increasing. Each writer increases the version when it releases the lock IF it has modified the node.
Wraparounds are only a theoretical issue. There would have to be exactly UINT64_MAX writers between a reader first checking the version and verifying the version.
- adrian_b - 3 months ago
  
  "Optimistic locking" is a bad choice of words because no locking is optimistic.
  In the well-known access method described here, the writers access the shared data with mutual exclusion, i.e. they use locking.
  The readers use no locking, but they access the shared data concurrently and optimistically, hoping that the access will succeed at the first attempt.
  When the readers are unlucky, they must retry the access.
  So there is locking used by writers for mutual exclusion and there is optimistic access used by the readers.
  There is no "optimistic locking", which is a contradiction in terms (locking is pessimistic).
  In general, there are only 3 methods for accessing shared data: mutual exclusion (a.k.a. pessimistic access), where locking forces the accesses to be sequential, and 2 methods where accesses may be concurrent, optimistic access (a.k.a. lock-free), where retries may be necessary, and dynamic partitioning of the shared data (typically used for shared arrays or for shared buffers/queues), where neither locking nor retries are needed.
  The method described here for accessing B-trees employs a combination of all 3 methods, because the release of the locks at higher levels is a consequence of restricting the future accesses to only a part of the shared data, i.e. the writers that access the shared B-tree start by accessing sequentially the root, but then they partition the tree between themselves, so the next accesses that fall in distinct subtrees may proceed concurrently.
  - layer8 - 3 months ago
    
    “Optimistic locking” is a well-established and widespread terminology though, not the least due to its catchiness. The more factual, but unwieldy term is “optimistic concurrency control”.
    
    adrian_b - 3 months ago
    
    I agree that it is widespread, but whenever you see such illogical terms you have to wonder whether the authors who use them do not understand what they are really doing or they understand, but they succumb to conforming with a widespread inappropriate usage in the hope to be better understood by naive readers.
    Understanding the difference between pessimistic access (mutual exclusion implemented by locking) and optimistic access (concurrent accesses with retries when necessary) is absolutely critical for the correct and efficient implementation of algorithms that use shared data structures.
    It is frequent to combine both methods in an algorithm, like here, but in English that is not "optimistic locking", but at most "locking and optimistic access" or "optimism and locking".
    Pessimistic access means that you expect that another process will attempt to access concurrently the shared data, so you must use a lock to prevent this. Optimistic access means that you expect that no other process will attempt to access concurrently the shared data, so you may proceed to access it immediately, but then you must have some means to detect that your assumption has been wrong and another process has interfered, when the transaction must be retried.
    Depending on the application, either pessimistic access or optimistic access results in a better performance, neither is always better than the other. Optimistic access (lock-free access) makes better the best case, but it makes much worse the worst case. Depending on the frequency distribution of such cases optimistic access increases or decreases the performance.
    Pessimistic access and optimistic access have the advantage of being applicable to any kind of shared data structure, but dynamic data partitioning, where applicable, like for this shared B-tree example, which can be partitioned in sub-trees accessed concurrently, normally results in better performance than both pessimistic access and optimistic access, by being deterministic and avoiding both locking and retries. Dynamic data partitioning may require locking for a very short time in order to partition the shared resource before accessing the allocated part, though the mutual exclusion provided by atomic instructions may be sufficient for this purpose. It is frequent that an atomic fetch-and-add instruction is enough to partition a shared data structure, like a shared array or a shared message queue, between concurrent processes attempting to access it.
- xxs - 3 months ago
  
  I chuckle every time anyone has tried to handle long(64bit) overflows while incrementing by one.
  Other than that, the version checks + retries have been a thing since forever (they are the most bog standard way to do any lock-free datastructures, along with database updates in the same manner). They do need a back off, though.
- pfent - 3 months ago
  
  When incrementing the version with 4 GHz, it takes over 100 years non-stop for a 64bit wraparound.
  - xxs - 3 months ago
    
    likely even more as it has to be an atomic operation which causes extra latency and coherency traffic

vander_elst - 3 months ago

A bit of a tangential question. Does anyone have any experience with cedars?

kikimora - 3 months ago

How do you implement a concurrency-safe copy operation? Especially if data structure you copy is 64Kb.

mrkeen - 3 months ago

The first question I always ask about concurrency is: can I exploit immutability?
Once something doesn't change, your copy is a no-op, so size doesn't really matter.