JuiceFS is a distributed POSIX file system built on top of Redis and S3

github.com

153 points by tosh 17 hours ago


wgjordan - 16 hours ago

Related, "The Design & Implementation of Sprites" [1] (also currently on the front page) mentioned JuiceFS in its stack:

> The Sprite storage stack is organized around the JuiceFS model (in fact, we currently use a very hacked-up JuiceFS, with a rewritten SQLite metadata backend). It works by splitting storage into data (“chunks”) and metadata (a map of where the “chunks” are). Data chunks live on object stores; metadata lives in fast local storage. In our case, that metadata store is kept durable with Litestream. Nothing depends on local storage.

[1] https://news.ycombinator.com/item?id=46634450

staticassertion - 13 hours ago

Do people really trust Redis for something like this? I feel like it's sort of pointless to pair Redis with S3 like this, and it'd be better to see benchmarks with metadata stores that can provide actual guarantees for durability/availability.

Unfortunately, the benchmarks use Redis. Why would I care about distributed storage on a system like S3, which is all about consistency/durability/availability guarantees, just to put my metadata into Redis?

It would be nice to see benchmarks with another metadata store.

willbeddow - 16 hours ago

Juice is cool, but tradeoffs around which metadata store you choose end up being very important. It also writes files in it's own uninterpretable format to object storage, so if you lose the metadata store, you lose your data.

When we tried it at Krea we ended up moving on because we couldn't get sufficient performance to train on, and having to choose which datacenter to deploy our metadata store on essentially forced us to only use it one location at a time.

eerikkivistik - 3 hours ago

I've had to test out various networked filesystems this year for a few use cases (satellite/geo) on a multi petabyte scale. Some of my thoughts:

* JuiceFS - Works well, for high performance it has limited use cases where privacy concerns matter. There is the open source version, which is slower. The metadata backend selection really matters if you are tuning for latency.

* Lustre - Heavily optimised for latency. Gets very expensive if you need more bandwidth, as it is tiered and tied to volume sizes. Managed solutions available pretty much everywhere.

* EFS - Surprisingly good these days, still insanely expensive. Useful for small amounts of data (few terabytes).

* FlexFS - An interesting beast. It murders on bandwidth/cost. But slightly loses on latency sensitive operations. Great if you have petabyte scale data and need to parallel process it. But struggles when you have tooling that does many small unbuffered writes.

weinzierl - 7 hours ago

The consistency guarantees are what makes this interesting in my opinion.

> * Close-to-open consistency. Once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client. Within the same mount point, all the written data can be read immediately.*

> Rename and all other metadata operations are atomic, which are guaranteed by supported metadata engine transaction.

This is a lot more than other "POSIX compatible" overlays claim, and I think similar to what NFSv4 promises. There are lots of subtitles there, though, and I doubt you could safely run a database on it.

tuhgdetzhh - 13 hours ago

If tested various Posix FS projects over the years and everyone has their shortcomings in one way or the other.

Although the maintainers of these projects disagree, I mostly consider them as a workaround for smaller projects. For big data (PB range) and critical production workloads I recommend to bite the bullet and make your software nativley S3 compatible without going over a POSIX mounted S3 proxy.

mattbillenstein - 12 hours ago

The key I think with s3 is using it mostly as a blobstore. We put the important metadata we want into postgres so we can quickly select stuff that needs to be updated based on other things being newer. So, we don't need to touch s3 that often if we don't need the actual data.

When we actually need to manipulate or generate something in Python, we download/upload to S3 and wrap it all in a tempfile.TemporaryDirectory() to cleanup the local disk when we're done. If you don't do this, you end up with a bunch of garbage eventually in /tmp/ you need to deal with.

We also have some longer-lived disk caches and using the data in the db and a os.stat() on the file we can easily know if the cache is up to date without hitting s3. And this cache, we can just delete stuff that's old wrt os.stat() to manage the size of it since we can always get it from s3 again if needed in the future.

hsn915 - 9 hours ago

This is upside down.

We need a kernel native distributed file system so that we can build distributed storage/databases on top of it.

This is like building an operating system on top of a browser.

sabslikesobs - 14 hours ago

See also their User Stories: https://juicefs.com/en/blog/user-stories

I'm not an enterprise-storage guy (just sqlite on a local volume for me so far!) so those really helped de-abstractify what JuiceFS is for.

eru - 10 hours ago

Distributed filesystem and POSIX don't go together well.

Plasmoid - 16 hours ago

I was actually looking at using this to replace our mongo disks so we could easily cold store our data

jeffbee - 14 hours ago

It is not clear that pjdfstest establishes full POSIX semantic compliance. After a short search of the repo I did not see anything that exercises multiple unrelated processes atomically writing with O_APPEND, for example. And the fact that their graphic shows applications interfacing with JuiceFS over NFS and SMB casts further doubt, since both of those lack many POSIX semantic properties.

Over the decades I have written test harnesses for many distributed filesystems and the only one that seemed to actually offer POSIX semantics was LustreFS, which, for related reasons, is also an operability nightmare.

Eikon - 16 hours ago

ZeroFS [0] outperforms JuiceFS on common small file workloads [1] while only requiring S3 and no 3rd party database.

[0] https://github.com/Barre/ZeroFS

[1] https://www.zerofs.net/zerofs-vs-juicefs

IshKebab - 16 hours ago

Interesting. Would this be suitable as a replacement for NFS? In my experience literally everyone in the silicon design industry uses NFS on their compute grid and it sucks in numerous ways:

* poor locking support (this sounds like it works better)

* it's slow

* no manual fence support; a bad but common way of distributing workloads is e.g. to compile a test on one machine (on an NFS mount), and then use SLURM or SGE to run the test on other machines. You use NFS to let the other machines access the data... and this works... except that you either have to disable write caches or have horrible hacks to make the output of the first machine visible to the others. What you really want is a manual fence: "make all changes to this directory visible on the server"

* The bloody .nfs000000 files. I think this might be fixed by NFSv4 but it seems like nobody actually uses that. (Not helped by the fact that CentOS 7 is considered "modern" to EDA people.)