We replaced H.264 streaming with JPEG screenshots (and it worked better)
blog.helix.ml523 points by quesobob 3 months ago
523 points by quesobob 3 months ago
Setting aside the various formatting problems and the LLM writing style, this just seems all kinds of wrong throughout.
> “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10Mbps should be way more than enough for a mostly static image with some scrolling text. (And 40Mbps are ridiculous.) This is very likely to be caused by bad encoding settings and/or a bad encoder.
> “What if we only send keyframes?” The post goes on to explain how this does not work because some other component needs to see P-frames. If that is the case, just configure your encoder to have very short keyframe intervals.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
A single H.264 keyframe can be whatever size you want, *depending on how you configure your encoder*, which was apparently never seriously attempted. Why are we badly reinventing MJPEG instead of configuring the tools we already have? Lower the bitrate and keyint, use a better encoder for higher quality, lower the frame rate if you need to. (If 10 fps JPEGs are acceptable, surely you should try 10 fps H.264 too?)
But all in all the main problem seems to be squeezing an entire video stream through a single TCP connection. There are plenty of existing solutions for this. For example, this article never mentions DASH, which is made for these exact purposes.
> Why are we badly reinventing MJPEG instead of configuring the tools we already have?
Is it much of a stretch to assume that in the AI gold rush, there will be products made by people who are not very experienced engineers, but just push forward and assume the LLM will fix all their problems? :-)
I built a little tool using AI recently and it worked great but it was brittle as hell and I was constantly waiting for it to fail. A few days later I realized there was a much better way of writing it. I'd boxed the LLM in by proposing the way to code it.
I've changed my AGENTS.md now so it basically says "Assume user is ignorant to other better solutions to the problem they are asking. Don't assume their given solution to the problem is the best one, look at the problem itself and propose other ways to solve it."
*Why are we badly reinventing MJPEG instead of configuring the tools we already have?*
Getting to know and understand existing tools costs time/money. If it less expensive or more expensive than reinventing something badly is very complicated to judge and depends on loads of factors.
Might be that reinventing something badly - but good enough for the case is best use of resources.
From TFA:
Implementation complexity:
h264 Stream: 3 months of rust
JPEG Spam: fetch() in a loop
I don't see how it could have taken 3 months to read up on existing technologies. And that "3 month" number is before we start factoring in time spent on:* Writing code for JPEG Spam / "fetch() in a loop" method
* Mechanisms to switch between h264 / jpeg modes
* Debugging implementation of 2 modes
* Debugging switching back and forth between the 2 modes
* Maintenance of 2 modes into the future
>Setting aside...the LLM writing style
I don't want to set that aside either. Why is AI generated slop getting voted to the top of HN? If you can't be bothered to spend the time writing a blog post, why should I be bothered spending my time reading it? It's frankly a little bit insulting.
Don’t assume something you cannot prove. It was great writing
Normally the 1 sentence per para LinkedIn post for dummies writing style bugs me to no end, but for a technical article that's continually hopping between questions, results, code, and explanations, it fits really well and was a very easy article to skim and understand.
It's action thriller writing for something that's in reality is super dull (my question is loaded with outdated cliches, but would you be telling a girl you're trying to impress at a party about this problem you faced of trying to push some data over the network?). I had to skim over it, like watching a YouTube video at 2x so I don't start evaluating how obnoxious the narrator is.
>Don’t assume something you cannot prove.
Well it's an inherently unprovable accusation, so assumption will have to do. It reeks of LLM-ese in certain word choices, phrases, and structure, though. I thought it was quite clear.
>It was great writing
Err... no accounting for taste, I suppose.
Just saying but LLM-ese as the common dominator of how people wrote, it is likely the writing style of a lot of people
How many people do you know that use em dashes —- none.
Em-dash has a trivial and mnemonic shortcut on Macs (Option-shift-hyphen), so I've been an em-dash user for as long as I've had one.
I committed typing en-dashes and ellipses on Windows to muscle memory. Alt+0150, Alt+0133. Bam!
I'm sure there are easier ways this can be set up. But, as I said, muscle memory.
Although I'll have to admit that wanting to use proper typography in the first place probably started when I was typesetting a print magazine on a Mac, where it's super easy to do it the proper way.
(I'm also never going to let AI slop discourage me from trying to use proper punctuation.)
Do yourself a favour and use typography layout
You mean other than this being AI slop company, usage is monitoring AI slop output and author confirming blog is AI slop? https://news.ycombinator.com/item?id=46372060
Looked like typical medium.com slop but with a bit more technical detail. Not sure where you see greatness
> For example, this article never mentions DASH, which is made for these exact purposes.
DASH isn't supported on Apple AFAIK. HLS would be an idea, yes...
But in either case: you need ffmpeg somewhere in your pipeline for that experience to be even remotely enjoyable. No ffmpeg? No luck, good luck implementing all of that shit yourself.
Or Gstreamer, which the article says they were using.
> DASH isn't supported on Apple AFAIK. HLS would be an idea, yes...
They said they implemented a WebCodecs websocket custom implementation, surely they can use Dash.js here. Or rather, their LLM can since it's doubtful they are writing any actual code.
They would need to use LL-DASH or HLS low latency but it's quite achievable.
Huh? This is the least LLM writing style I've encountered. Extraordinary claims require extraordinary proof.
It's not an extraordinary claim, it's a mundane and plausible one. This is exactly what you get when you ask an LLM to write in a "engaging conversational" style, and skip any editing after the fact. You could never prove it but there are a LOT of tells.
"The key insight" - llms love key insights! "self-contained corruption-free" - they also love over-hypenating, as much as they love em-dashing. Both abundant here. "X like it's 2005" and also "Y like it's 2009" - what a cool casual turn of phrase, so natural! The architecture diagram is definitely unedited AI, Claude always messes up the border alignment on ascii boxes
I wouldn't mind except the end result is imprecise and sloppy, as pointed out by the GP comment. And the tone is so predictable/boring at this point, I'd MUCH rather read poorly written human output with some actual personality.
ai detectors are never totally accurate but this one is quite good and it suggests something like 80% of this article is llm generated. honestly idk how you didn't get that just by reading it tho, maybe you haven't been exposed to much modern llm-generated content?
https://www.pangram.com/history/5cec2f02-6fd6-4c97-8e71-d509...
> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
This would make sense... if they were using UDP, but they are using TCP. All the JPEGs they send will get there eventually (unless the connection drops). JPEG does not fix your buffering and congestion control problems. What presumably happened here is the way they implemented their JPEG screenshots, they have some mechanism that minimizes the number of frames that are in-flight. This is not some inherent property of JPEG though.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.
h.264 has better coding efficiency than JPEG. For a given target size, you should be able to get better quality from an h.264 IDR frame than a JPEG. There is no fixed size to an IDR frame.
Ultimately, the problem here is a lack of bandwidth estimation (apart from the sort of binary "good network"/"cafe mode" thing they ultimately implemented). To be fair, this is difficult to do and being stuck with TCP makes it a bit more difficult. Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.
WebRTC will do this for you if you can use it, which actually suggests a different solution to this problem: use websockets for dumb corporate network firewall rules and just use WebRTC everything else
They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading. UDP is not necessary to write a loop.
> They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading.
You're right, I don't know how I managed to skip over that.
> UDP is not necessary to write a loop.
True, but this doesn't really have anything to do with using JPEG either. They basically implemented a primitive form of rate control by only allowing a single frame to be in flight at once. It was easier for them to do that using JPEG because they (to their own admission) seem to have limited control over their encode pipeline.
> have limited control over their encode pipeline.
Frustratingly this seems common in many video encoding technologies. The code is opaque, often has special kernel, GPU and hardware interfaces which are often closed source, and by the time you get to the user API (native or browser) it seems all knobs have been abstracted away and simple things like choosing which frame to use as a keyframe are impossible to do.
I had what I thought was a simple usecase for a video codec - I needed to encode two 30 frame videos as small as possible, and I knew the first 15 frames were common between the videos so I wouldn't need to encode that twice.
I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.
A 15 frame min anf max GOP size would do the trick, then you'd get two 15 frame GOPs. Each GOP can be concatenated with another GOP with the same properties (resolution, format, etc) as if they were independent streams. So there is actually a way to do this. This is how video splitting and joining without re encoding works, at GOP boundary.
In my case, bandwidth really mattered, so I wanted all one GOP.
Ended up making a bunch of patches o libx264 to do it, but the compute cost of all the encoding on CPU is crazy high. On the decode side (which runs on consumer devices), we just make the user decode the prefix many times.
> I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.
fork()? :-)
But most software, video codec or not, simply isn't written to serialize its state at arbitrary points. Why would it?
A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!
In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.
However this is not the case with video codecs - but this is just one of many examples of where the video codec landscape is limiting.
Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content. There is no reasonable way to avoid that - but doing so would reduce the latency to play videos by quite a lot!
> A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!
No, they generally can't save their whole internal state to be resumed later, and definitely not in the document you were editing. For example, when you save a document in vim it doesn't store the mode you were in, or the keyboard macro step that was executing, or the search buffer, or anything like that.
> In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.
Serializable in principle, maybe. Actually serializable in the sense that the code contains a way to dump to a file and back, absolutely not. It's extremely rare for programs to expose a way to save and restore from a mid-state in the algorithm they're implementing.
> Another for example is that on the internet lots of videos have a 'poster frame' - often the first frame of the video. That frame for nearly all usecases ends up downloaded twice - once as a jpeg, and again inside the video content.
Actually, it's extremely common for a video thumbnail to contain extra edits such as overlayed text and other graphics that don't end up in the video itself. It's also very common for the thumbnail to not be the first frame in the video.
> Serializable in principle, maybe. Actually serializable in the sense that the code contains a way to dump to a file and back, absolutely not. It's extremely rare for programs to expose a way to save and restore from a mid-state in the algorithm they're implementing.
If you should ever look for an actual example; Cubemap, my video reflector (https://manpages.debian.org/testing/cubemap/cubemap.1.en.htm...), works like that. It supports both config change and binary upgrade by serializing its entire state down to a file and then re-execing itself.
It's very satisfying; you can have long-running HTTP connections and upgrade everything mid-flight without a hitch (serialization, exec and deserialization typically takes 20–30 ms or so). But it means that I can hardly use any libraries at all; I have to use a library for TLS setup (the actual bytes are sent through kTLS, but someone needs to do the asymmetric crypto and I'm not stupid enough to do that myself), but it was a pain to find one that could serialize its state. TLSe, which I use, does, but not if you're at certain points in the middle of the key exchange.
So yes, it's extremely rare.
Why not hand off the fd to the new process spawned as a child? That’s how a lot of professional 0 downtime upgrades work: spawn a process, hand off fd & state, exit.