Internet Archive's Storage

blog.dshr.org

224 points by zdw 4 days ago


dr_dshiv - 7 hours ago

> Li correctly points out that the Archive's budget, in the range of $25-30M/year, is vastly lower than any comparable website: By owning its hardware, using the PetaBox high-density architecture, avoiding air conditioning costs, and using open-source software, the Archive achieves a storage cost efficiency that is orders of magnitude better than commercial cloud rates.

That’s impressive. Wikipedia spends $185m per year and the Seattle public library spends $102m. Maybe not comparable exactly, but $30m per year seems inexpensive for the memory of the world…

mrexroad - 8 hours ago

> This "waste heat" system is a closed loop of efficiency. The 60+ kilowatts of heat energy produced by a storage cluster is not a byproduct to be eliminated but a resource to be harvested.

Are there any other data centers harvesting waste heat for benefit?

arjie - 10 hours ago

This is very cool. One thing I am curious about is the software side of things and the details of the hardware. What is the filesystem and RAID (or lack of) layer to deal with this optimally? Looking into it a little:

* power budget dominates everything: I have access to a lot of rack hardware from old connections, but I don't want to put the army of old stuff in my cabinet because it will blow my power budget for not that much performance in comparison to my 9755. What disks does the IA use? Any specific variety or like Backblaze a large variety?

* magnetic is bloody slow: I'm not the Internet Archive so I'm just going to have a couple of machines with a few hundred TiB. I'm planning on making them all a big zfs so I can deduplicate but it seems like if I get a single disk failure I'm doomed to a massive rebuild

I'm sure I can work it out with a modern LLM, but maybe someone here has experience with actually running massive storage and the use-case where tomorrow's data is almost the same as today's - as is the case with the Internet Archive where tomorrow's copy of wiki.roshangeorge.dev will look, even at the block level, like yesterday's copy.

The last time I built with multi-petabyte datasets we were still using Hadoop on HDFS, haha!

arcade79 - 6 hours ago

While reading this kind of articles, I'm always surprised by how small the storage described is. Given that Microsoft released their paper on LRCs in 2012, Google patented a bunch in 2010, facebook talked about their stuff around the 2010-2014 era too. CEPH started getting good erasure codes around 2016-2020.

Has any of the big ones released articles on their storage systems in the last 5-10 years?

ranger_danger - 11 hours ago

I was hoping an article about IA's storage would go into detail about how their storage currently works, what kind of devices they use, how much they store, how quickly they add new data, the costs etc., but this seems to only talk about quite old stats.

tylerchilds - 11 hours ago

Why’s Wendy’s Terracotta moved?

badlibrarian - 9 hours ago

[flagged]

JavohirXR - 5 hours ago

I saw the word "delve" and already knew it was redacted or written by ai