S3 Files

allthingsdistributed.com

269 points by werner 11 hours ago


https://aws.amazon.com/blogs/aws/launching-s3-files-making-s...

MontyCarloHall - 9 hours ago

This is essentially S3FS using EFS (AWS's managed NFS service) as a cache layer for active data and small random accesses. Unfortunately, this also means that it comes with some of EFS's eye-watering pricing:

— All writes cost $0.06/GB, since everything is first written to the EFS cache. For write-heavy applications, this could be a dealbreaker.

— Reads hitting the cache get billed at $0.03/GB. Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.

— Cache is charged at $0.30/GB/month. Even though everything is written to the cache (for consistency purposes), it seems like it's only used for persistent storage of small files (<128kB), so this shouldn't cost too much.

- a minute ago
[deleted]
tao_oat - 2 minutes ago

See also: https://github.com/Barre/ZeroFS

jamesblonde - 2 hours ago

S3 Files was launched today without support for atomic rename. This is not something you can bolt on. Can you imagine running Claude Code on your S3 Files and it just wants to do a little house cleaning, renaming a directory and suddenly a full copy is needed for every file in that directory?

The hardest part in building a distributed filesystem is atomic rename. It's always rename. Scalable metadata file systems, like Collosus/Tectonic/ADLSv2/HopsFS, are either designed around how to make rename work at scale* or how work around it at higher levels in the stack.

* https://www.hopsworks.ai/post/scalable-metadata-the-new-bree...

everfrustrated - 4 hours ago

The best way to think of the architecture of this is it's EFS with a bidirectional sync to S3.

You can write into one and read out from the other and vice versa. Consistency guarantees kept within each but not between.

wbl - 8 hours ago

"NFS provides the semantics your applications expect" is one of the funniest things I have ever read.

rdtsc - 9 hours ago

Synchronization bits is what I was wondering about: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-fil...

> For example, suppose you edit /mnt/s3files/report.csv through the file system. Before S3 Files synchronizes your changes back to the S3 bucket, another application uploads a new version of report.csv directly to the S3 bucket. When S3 Files detects the conflict, it moves your version of report.csv to the lost and found directory and replaces it with the version from the S3 bucket.

> The lost and found directory is located in your file system's root directory under the name .s3files-lost+found-file-system-id.

dabinat - 5 hours ago

The problem with using S3 as a filesystem is that it’s immutable, and that hasn’t changed with S3 Files. So if I have a large file and change 1 byte of it, or even just rename it, it needs to upload the entire file all over again. This seems most useful for read-heavy workflows of files that are small enough to fit in the cache.

abidlabs - 7 hours ago

Hugging Face Buckets also recently added support for mounting Buckets as a filesystem: https://huggingface.co/changelog/hf-mount

jitl - 9 hours ago

I wish they offered some managed bridging to local NVMe storage. AWS NVMe is super fast compared to EBS, and EBS (node-exclusive access as block device) is faster than EFS (multi-node access). I imagine this can go fast if you put some kind of further-cache-to-NVMe FS on top, but a completely vertically integrated option would be much better.

koolba - 9 hours ago

If you though locking semantics over NFS were wonky, just wait till we through a remote S3 backend in the mix!

nyc_pizzadev - 9 hours ago

This is very close to its first official release: https://fiberfs.io/

Built in cache, CDN compatible, JSON metadata, concurrency safe and it targets all S3 compatible storage systems.

gonzalohm - 10 hours ago

I cannot 100% confirm this, but I believe AWS insisted a lot in NOT using S3 as a file system. Why the change now?

curt15 - 5 hours ago

How does this compare with ZFS's object storage backend? https://news.ycombinator.com/item?id=46620673

znpy - 2 hours ago

As usual, everything except pricing is very well explained.

hk1337 - 4 hours ago

This could be useful. We use EFS, I like the benefits but I think it’s overkill for what we need. I’ve been thinking of switching to s3 but not looking forward to completely changing how we upload and download.

nvartolomei - 10 hours ago

> changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT

Single PUT per file I assume?

miguel_martin - 8 hours ago

Dumb Q: what would happen if you used this to store a SQLite database? Would it just... work?

My guess is this would only enable a read-replica and not backups as Litestream currently does?

mgaunard - 10 hours ago

Zero mention of s3fs which already did this for decades.

PunchyHamster - 10 hours ago

Eagerly awaiting on first blogpost where developers didn't read the eventually consistent part, lost the data and made some "genius" workaround with help of the LLM that got them in that spot in the first place

dang - 8 hours ago

Since this is the thread that got attention, I've added the announcement link to the toptext and made the title work for both.

themafia - 11 hours ago

> we locked a bunch of our most senior engineers in a room and said we weren’t going to let them out till they had a plan that they all liked.

That's one way to do it.

> When you create or modify files, changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT. Sync runs in both directions, so when other applications modify objects in the bucket, S3 Files automatically spots those modifications and reflects them in the filesystem view automatically.

That sounds about right given the above. I have trouble seeing this as something other than a giant "hack." I already don't enjoy projecting costs for new types of S3 access patterns and I feel like has the potential to double the complication I already experience here.

Maybe I'm too frugal, but I've been in the cloud for a decade now, and I've worked very hard to prevent any "surprise" bills from showing up. This seems like a great feature; if you don't care what your AWS bill is each month.

mbana - 9 hours ago

Werner Vogels is awesome. I first discovered about his writing when I learnt about Dynamo DB.

up2isomorphism - 9 hours ago

This why today’s sales pitch are often disguised as a tech blog.

goekjclo - 11 hours ago

the "under the hood uses EFS" part is the most interesting bit here

gervwyk - 10 hours ago

any recommendations for a lambda based sftp sever setup?

Centigonal - 8 hours ago

Terrible day for people who sloppily use filesystem vocabulary when referring to S3 objects and prefixes.

minutesmith - 9 hours ago

[flagged]

devnotes77 - 7 hours ago

[dead]

ovaistariq - 10 hours ago

TLDR: EFS as a eventually consistent cache in front of S3.

mritchie712 - 9 hours ago

tldr: this caches your S3 data in EFS.

we run datalakes using DuckLake and this sounds really useful. GCP should follow suit quickly.

DenisM - 11 hours ago

TLDR: Eventually consistent file system view on top of s3 with read/write cache.

CrzyLngPwd - 10 hours ago

If there is ever a post that needs a TLDR or an AI summary it is that one.

Sell the benefits.

I have around 9 TB in 21m files on S3. How does this change benefit me?

- 10 hours ago
[deleted]