Case study: recovery of a corrupted 12 TB multi-device pool
github.com79 points by salt4034 10 hours ago
79 points by salt4034 10 hours ago
> This is not a bug report. [...] The goal is constructive, not a complaint.
Er, I appreciate trying to be constructive, but in what possible situation is it not a bug that a power cycle can lose the pool? And if it's not technically a "bug" because BTRFS officially specifies that it can fail like that, why is that not in big bold text at the start of any docs on it? 'Cuz that's kind of a big deal for users to know.
EDIT: From the longer write-up:
> Initial damage. A hard power cycle interrupted a commit at generation 18958 to 18959. Both DUP copies of several metadata blocks were written with inconsistent parent and child generations.
Did the author disable safety mechanisms for that to happen? I'm coming from being more familiar with ZFS, but I would have expected BTRFS to also use a CoW model where it wasn't possible to have multiple inconsistent metadata blocks in a way that didn't just revert you to the last fully-good commit. If it does that by default but there's a way to disable that protection in the name of improving performance, that would significantly change my view of this whole thing.
As far as I can see, no, the author disabled nothing of the sort that he documented.
I suspect that the author's intent is less "I do not view this as a bug" and more "I do not think it's useful to get into angry debates over whether something is a bug". I do not know whether this is a common thing on btrfs discussions, but I have certainly seen debates to that effect elsewhere.
(My personal favorite remains "it's not a data loss bug if someone could technically theoretically write something to recover the data". Perhaps, technically, that's true, but if nobody is writing such a tool, nobody is going to care about the semantics there.)
> I suspect that the author's intent is less "I do not view this as a bug" and more "I do not think it's useful to get into angry debates over whether something is a bug".
Agreed, and I appreciate the attempt to channel things into a productive conversation.
btrfs's reputation is not great in this regard.
As far as I understand, single device and RAID1 is solid, but as soon as you want to do RAID1+0 or RAID5/6 you’re entering dangerous territory with BTRFS.
Unless I missed it the writeup never identifies a causal bug, only things that made recovery harder.
Welp. Guess I need to figure out another fs to use for a few drives in a nonraid pool I haven't gotten around to setting up yet. I forget why zfs seemed out. xfs?
Added to my list of reasons to never use btrfs in production.
Using DUP as the metadata profile sounds insane.
Changing the metadata profile to at least raid1 (raid1, raid1c3, raid1c4) is a good idea, especially for anyone, against recommendations, using raid5 or raid6 for a btrfs array (raid1c3 is more appropriate for raid6). That would make it very difficult for metadata to get corrupted, which is the lion's share of the higher-impact problems with raid5/6 btrfs.
check:
btrfs fi df <mountpoint>
convert metadata: btrfs balance start -mconvert=raid1c3,soft <mountpoint>
(make sure it's -mconvert — m is for metadata — not -dconvert which would switch profiles for data, messing up your array)People swear btrfs is "safe" now, but I've personally been bitten by data corruption more than once, so I stay away from it now.
This is obviously LLM output, but perhaps LLM output that corresponds to a real scenario. It's plausible that Claude was able to autonomously recover a corrupted fs, but I would not trust its "insights" by default. I'd love to see a btrfs dev's take on this!
This is also my first impulse. The second was, if this happened to me, I would not be able to recover it. All the custom c tool talk... If you ask Claude Code it will code something up.
Well that he recovered the disks is amazing in itself. I would have given up and just pulled a backup.
However, I would like to see a Dev saying: why didn't you use the --<flag> which we created for this Usecase
I was assuming real scenario with heavy LLM help to recover. Would be nice for the author to clarify. And, separately, for BTRFS devs to weigh in, though I'd somewhat prefer to get some indication that it's real before spending their time.
> Case study: recovery of a severely corrupted 12 TB multi-device pool, plus constructive gap analysis and reference tool set #1107
Please don't be btrfs please don't be btrfs please don't be btrfs...
Where are all of the ZFS corruption stories? Or are there simply fewer of those?
Most of them are from new features that didn't get a proper shakedown test, like encryption.
Not sure about the stats, but it does feel like there are fewer. So from what I know encryption and sending fs state had bugs in ZFS.
And on btrfs anything above raid1 (5,6 etc) has had very serious bugs. Actually read an opinion somewhere (don't remember where) raid5,6 on btrfs cannot work due to on-disk format being just bad for the case. I guess this is why raid1c3/c4 is being promoted and worked on now?
I mean, the only other option was bcachefs, which might have been funny if this LLM-generated blogpost were written by the OpenClaw instance the developer has decided is sentient:
https://www.reddit.com/r/bcachefs/comments/1rblll1/the_blog_...
But no. It was btrfs.
As a side note, it's somewhat impressive that an LLM agent was able to produce a suite of custom tools that were apparently successfully used to recover some data from a corrupted btrfs array, even ad-hoc.
It could be ZFS. I'd be much more surprised, but it can still have bugs.
ZFS on Linux has had many bugs over the years, notably with ZFS-native encryption and especially sending/receiving encrypted volumes. Another issue is that using swap on ZFS is still guaranteed to hang the kernel in low memory scenarios, because ZFS needs to allocate memory to write to swap.
The zero copy that zero copied unencrypted blocks onto encrypted file systems was genius. It’s almost like they don’t test.
To theal author: did you continue using btrfs after this ordeal? An FS that will not eat (all) your data upon a hard powercycle only at the cost of 14 custom C tools is a hard pass from me no matter how many distros try to push it down my throat as 'production-ready'...
Also, impressive work!
What are the alternatives to btrfs? At 12 TB data checksums are a must unless the data tolerate bit-rot. And if one wants to stick with the official kernel without out-of-tree modules, btrfs is the only choice.
I tried btrfs on three different occasions. Three times it managed to corrupt itself. I'll admit I was too enthousiastic the first time, trying it less than a year after it appeared in major distros. But the latter two are unforgiveable (I had to reinstall my mom's laptop).
I've been using ZFS for my NAS-like thing since then. It's been rock solid ().
(): I know about the block cloning bug, and the encryption bug. Luckily I avoided those (I don't tend to enable new features like block cloning, and I didn't have an encrypted dataset at the time). Still, all in all it's been really good in comparison to btrfs.
Good thing all disks these days have data checksums, then!
(50TB+ on ext4 and xfs, and no, no bit rot. Yes, I've checked most of it against separate sha256sum files now and then. As long as you have ECC RAM, disks just magically corrupting your data is largely a myth.)
> if one wants to stick with the official kernel without out-of-tree modules
I wonder how could a requirement like that possibly arise. Especially with an obvious exception for zfs.
Bcachefs also fulfills the requirement of checksums (and multi device support).
Also out of tree.
Isn't bcachefs even younger and less polished than btrfs? It does show more promise as btrfs seems to have fundamental design issues... but still I wouldn't use that for my important data.
I don't disagree. Gotta backups for important data either way too!
Just talking about filesystems with checksumming (and multidevice). Any new filesystem to support these features is going to be newer.
I've had both btrfs and bcachefs multidevice filesystems lock up read-only on me. So no real data loss, just a pain to get the data into a new file system, the time it was an 8 drive array on btrfs.
lvm offers lvmraid, integrity, and snapshots as one example. It's old unsexy tech, but losing data is not to my taste lately...
lvm only supports checksums for metadata. It does not checksum the data itself. For checksums with arbitrary filesystems one can have dm-integrity device rather than LVM. But the performance suffer due to separated journal writes by the device.
What devices are you talking about, what's the UBER, over what period of time?
RAID and logical block redundancy has scaled to petabytes for years in serious production use, before btrfs was even developed.
Could try ZFS or CephFS... even if several host roles are in VM containers (45Drives has a product setup that way.)
The btrfs solution has a mixed history, and had a lot of the same issues DRBD could get. They are great until some hardware/kernel-mod eventually goes sideways, and then the auto-heal cluster filesystems start to make a lot more sense. Note, with cluster based complete-file copy/repair object features the damage is localized to single files at worst, and folks don't have to wait 3 days to bring up the cluster on a crash.
Best of luck, =3
[dead]