Jennifer Aniston and Friends Cost Us 377GB and Broke Ext4 Hardlinks
blog.discourse.org22 points by speckx 2 hours ago
22 points by speckx 2 hours ago
In short: Deduplication efforts frustrated by hardlink limits per inode — and a solution compatible with different file systems.
The real problem is they aren't deduplicating at the filesystem level like sane people do.
And I thought this was a reference to a Win95 problem https://www.slashgear.com/1414245/jennifer-aniston-matthew-p...
Yeah Block level dedupe has been an industry standard for decades. Tracking file hashes? Why?
And I see above that this is a self-hosted platform and I still don’t get it. I was running terabytes of ZFS with dedupe=on on cheap supermicro gear in 2012
We were on a break...of your filesystem!
The Problem. The fix. The Limit.
Is it just me or is everybody else just as fed up with always the same AI tropes?
I've reached a point where I just close the tab the moment I read a headline "The problem". At least use tropes.fyi please
This makes them look rather incompetent. Storing the exact same file 246,173 times is just stupid. Dedupe at the filesystem level and make your life easier.
As is always the case, short vs long term... but I think I'd put effort into migrating to a filesystem that is aware of duplication instead of trying to recreate one with links [while retaining duplicates, just fewer]. For backups and the live data. Users are wild.
Effectiveness is debatable, I'd say this approach still has duplication. Absolutely an 'insignificant' amount... in this instance! The filesystem handling this at the block level is probably less problematic or prone to rework.
edit: Eh, ignore me. I see this is preparing for [whatever filesystem hosts chose] thanks to 'ameliaquining' below. Originally thought this was all Discourse-proper, processing data they had.
Discourse is self-hostable; they can't require their users to use a filesystem that supports deduplication. (Or, well, they could, but it would greatly complicate installation and maintenance and whatnot, and also there would need to be some kind of story for existing installations.)
Fair, I'm confused by the model/presentation. This is a nice User-preparation/consideration, I guess. I still maintain using a backup store/filesystem unaware of duplication at the block level is a mistake. If nothing else, it will strengthen this approach and the live data.
Completely missed the shipping-of-tarballs. Links make sense, here. I had 'unpacked' data in mind. Absolutely would not go as far to suggest their scheme pick up 'zfs {send,receive}'/equivalent, lol.