Apple Releases Open Weights Video Model
starflow-v.github.io427 points by vessenes 20 hours ago
427 points by vessenes 20 hours ago
Apple has a video understanding model too. I can't wait to find out what accessibility stuff they'll do with the models. As a blind person, AI has changed my life.
> As a blind person, AI has changed my life.
Something one doesn't see in news headlines. Happy to see this comment.
Like many others, I too would very much like to hear about this.
I taught our entry-level calculus course a few years ago and had two blind students in the class. The technology available for supporting them was abysmal then -- the toolchain for typesetting math for screen readers was unreliable (and anyway very slow), for braille was non-existent, and translating figures into braille involved sending material out to a vendor and waiting weeks. I would love to hear how we may better support our students in subjects like math, chemistry, physics, etc, that depend so much on visualization.
For a physical view on this see:
https://www.reddit.com/r/openscad/comments/1p6iv5y/christmas...
The creator, https://www.reddit.com/user/Mrblindguardian/ has asked for help a few times in the past (I provided feedback when I could), but hasn't needed to as often of late, presumably due to using one or more LLMs.
+1 and I would be curious to read and learn more about it.
A blind comedian / TV personality in the UK has just done a TV show on this subject - I haven't seen it, but here's a recent article about it: https://www.theguardian.com/tv-and-radio/2025/nov/23/chris-m...
Chris McCausland is great. A fair bit of his material _does_ reference his visual impairment, but it's genuinely witty and sharp, and it never feels like he's leaning on it for laughs/relying on sympathy.
He did a great skit with Lee Mack at the BAFTAs 2022[0], riffing on the autocue the speakers use for announcing awards.
Hilariously, he beat the other teams in the “Say What You See” round (yes, really) of last year’s Big fat Quiz. No AI involved.
Haha that's great!
I'm not a fan of his (nothing against him, just not my cup of tea when it comes to comedy and mostly not been interested in other stuff he's done), but the few times I have seen him as a guest on shows it's been clear that he's a generally clever person.
I remembered he was once a techie, and Wikipedia confirms that he (Chris McCausland) has a BSc Honours in Software Engineering.
If you want to see more on this topic, check out (google) the podcast I co-host called Accessibility and Gen. AI.
Honestly, that’s such a great example of how to share what you do on the interwebs. Right timing, helpful and on topic. Since I’ve listened to several episodes of the podcast, I can confirm it definitely delivers.
Same! @devinprater, have you written about your experiences? You have an eager audience...
What other accessibility features do you wish existed in video AI models? Real-time vs post-processing?
Mainly realtime processing. I play video games, and would love to play something like Legend of Zelda and just have the AI going, then ask it "read the menu options as I move between them," and it would speak each menu option as the cursor moves to it. Or when navigating a 3D environment, ask it to describe the surroundings, then ask it to tell me how to get to a place or object, then it guide me to it. That could be useful in real-world scenarios too.
Weird question, but have you ever tried text adventures? It seems like it's inherently the ideal option, if you can get your screen reader going.
> Something one doesn't see in news headlines.
I hope this wasn't a terrible pun
No pun intended but it's indeed an unfortunate choice of words on my part.
My blind friends have gotten used to it and hear/receive it not as a literal “see“ any more. They would not feel offended by your usage.
One cool feature they added for deaf parents a few years ago was a notification when it detects a baby crying.
My wife is deaf, and we had one kid in 2023 and twins in 2025. There's been a noticeable improvement baby cry detection! In 2023, the best we could find was a specialized device that cost over $1,000 and has all sorts of flakiness/issues. Today, the built-in detection on her (android) phone + watch is better than that device, and a lot more convenient.
I also got notification on my apple watch, while being away from the house, that the homepod mini heard our fire alarm going off.
A call home let us know that our son had set it off learning to reverse-sear his steak.
I live across the street from a fire station. Thank for you for diligence, little HomePod Mini, but I'm turning your notifications off now.
Is that something you actually need AI for though? A device with a sound sensor and something that shines/vibrate a remote device when it detects sound above some threshold would be cheaper, faster detection, more reliable, easier to maintain, and more.
But your solution costs money in addition to the phone they already own for other purposes. And multiple things can make loud noises in your environment besides babies; differentiating between a police siren going by outside and your baby crying is useful, especially if the baby slept through the siren.
The same arguments were said for blind people and the multitude of one-off devices that smartphones replaced, OCR to TTS, color detection, object detection in photos/camera feeds, detecting what denomination US bills are, analyzing what's on screen semantically vs what was provided as accessible text (if any was at all), etc. Sure, services for the blind would come by and help arrange outfits for people, and audiobook narrators or braille translator services existed, and standalone devices to detect money denominations were sold, but a phone can just do all of that now for much cheaper.
All of these accessibility AI/ML features run on-device, so the knee-jerk anti-AI crowd's chief complaints are mostly baseless anyways. And for the blind and the deaf, carrying all the potential extra devices with you everywhere is burdensome. The smartphone is a minimal and common social and physical burden.
> more reliable
I've worked on some audio/video alert systems. Basic threshold detectors produce a lot of false positives. It's common for parents to put white noise machines in the room to help the baby sleep. When you have a noise generating machine in the same room, you need more sophisticated detection.
False positives are the fastest way to frustrate users.
You are talking about a device of smart phone complexity. You need enough compute power to run a model that can distinguish noises. You need a TCP/IP stack and a wireless radio to communicate the information. At that point you have a smart phone. A simple sound threshold device would have too many false positives/negatives to be useful.
>Is that something you actually need AI for though?
Need? Probably not. I bet it helps though (false positives, etc.)
>would be cheaper, faster detection, more reliable, easier to maintain, and more.
Cheaper than the phone I already own? Easier to maintain than the phone that I don't need to do maintenance on?
From a fun hacking perspective, a different sensor & device is cool. But I don't think it's any of the things you mentioned for the majority of people.
> As a blind person, AI has changed my life.
I know this is a low quality comment, but I'm genuinely happy for you.
Can you share some ways AI has changed your life?
I guess that auto-generated audio descriptions for (almost?) any video you want is a very, very nice feature for a blind person.
My two cents, this seems like a case where it’s better to wait for the person’s response instead of guessing.
Fair enough. Anyway I wasn't trying to say what actually changed GP's life, I was just expressing my opinion on what video models could potentially bring as an improvement to a blind person.
My two cents, this seems like a comment it should be up to the OP to make instead of virtue signaling.
Yall could have gotten a serviceable answer about this topic out of ChatGPT. 2025 version of "let me google that for you"
> Can you share some ways AI has changed your life?
A question directed to GP, directly asking about their life and pointing this out is somehow virtue signalling, OK.
You can safely assume that anyone who uses “virtue signaling” unironically has nothing substantive to say.
>[People who call out performative bullshit should be ignored because they’re totally wrong and I totally mean it.]
Maybe you’re just being defensive? I’m sure he didn’t mean an attack at you personally.
It’s presumptuous of you to assume I was offended.
Accusing someone of “virtue signaling” is itself virtue signaling, just for a different in-group to use as a thought terminating cliche. It has been for decades. “Performative bullshit” is a great way to put it, just not in the way you intended.
If the OP had a substantive point to make they would have made it instead of using vague ad hominem that’s so 2008 it could be the opening track on a Best of Glenn Beck album (that’s roughly when I remember “virtue signaling” becoming a cliche).
...you know, people can have opinions about the best way to behave outside of self-aggrandizement, even if your brain can't grasp this concept.
From the list of virtues, which one was this signaling?
I’d guess: Respect, consideration, authenticity, fairness.
Or should I too perhaps wait for OP to respond.
That list needs updating. Lots of things became virtuous in scenario. During Covid, fear was a virtue. You had to prove how scared you were of it, all the masks you wore because it made you “one of the good ones” to be fearful.
[flagged]
The two cents are not literally monetary - your opinion is literally the two cents. You're contributing your understanding to the shared pot of understanding and that's represented by putting money into the pot, showing you have skin in the game. It's contributing to a larger body of knowledge by putting your small piece in - the phrases you suggest don't have that context behind them and in my opinion are worse for it. The beauty of the phrase is because the two cents are your opinion, everyone has enough, because everyone can have an opinion.
The lens through which you're analyzing the phrase is coloring how you see it negatively, and the one I'm using is doing the opposite. There is no need to change the phrase, just how it's viewed, I think.
people put too much weight onto words, the first lesson I learned on the internet is that words are harmless, might be deeply painful for some, but because people as my self put no weight behind them we don't even have a concept of keeping such things mindful since it never crosses our minds and it's really difficult to see if any other way even if we try to since it just seems like a bad joke.
And when I say 'it never crosses our minds' I really mean it, there's zero thoughts between thinking about a message and having it show up in a text box.
A really great example are slurs, for a lot of people they have to double take, but there's zero extra neurons fired when I read them. I guess early internet culture is to blame since all kinds of language was completely uncensored and it was very common to run into very hostile people/content.
> The metaphor of assigning a literal monetary value to one's opinion reinforces the idea that contributions are transactional and that their "worth" is measured through an economic lens. That framing can be exclusionary, especially for people who have been historically marginalized by economic systems. It subtly normalizes a worldview where only those with enough "currency" - social, financial, or otherwise - deserve to be heard.
No. It’s acknowledging that that perhaps one’s opinion may not be as useful as somebody else’s in that moment. Which is often true!
Your first and third paragraphs are true, but they don’t apply to every bloody phrase.
guessing that being able to hear a description of what the camera is seeing (basically a special case of a video) in any circumstances is indeed life changing if you're blind...? take a picture through the window and ask what's the commotion? door closed outside that's normally open - take a picture, tell me if there's a sign on it? etc.
Not the gp, but currently reading a web novel with a card game where the author didn't include alt text in the card images. I contacted them about it and they started, but in the meantime ai was a big help. all kinds of other images on the internet as well when they are significant to understanding the surrounding text. better search experience when Google, DDG, and the like make finding answers difficult. I might use smart glasses for better outdoor orientation, though a good solution might take some time. phone camera plus ai is also situationally useful.
As a (web app) developer I never quite sure what to put in alt. Figured you might have some advice here?
> As a (web app) developer I never quite sure what to put in alt.
Are you making these five mistakes when writing alt text? [1] Images tutorial [2] Alternative Text [3]
[1]: https://www.a11yproject.com/posts/are-you-making-these-five-...
I'm gonna flip this around... have you tried pasting the image (and the relevant paragraph of text) and asking ChatGPT (or another LLM) to generate the alt text for the image and see what it produces?
For example... https://chatgpt.com/share/692f1578-2bcc-8011-ac8f-a57f2ab6a7...
> I'm gonna flip this around... have you tried pasting the image (and the relevant paragraph of text) and asking ChatGPT (or another LLM) to generate the alt text for the image and see what it produces?
There's a great app by an indie developer that uses ML to identify objects in images. Totally scriptable via JavaScript, shell script and AppleScript. macOS only.
Could be 10, 100 or 1,000 images [1].
The question to ask is, what a sighted person learns after looking at the image? The answer is the alt text. E.g if the image is a floppy, maybe you communicate that this is the save button. If it shows a cat sleeping on the windowsill, the alt text is yep: "my cat looking cute while sleeping on the windowsill".
I really like how you framed this as the takeaway or learning that needs to happen as what should be in the alt and not a recitation of the image. Where I've often had issues is more for things like business charts and illustrations and less cute cat photos.
"A meaningless image of a chart, from which nevertheless emanates a feeling of stonks going up"
It might be that you’re not perfectly clear on what exactly you’re trying to convey with the image and why it’s there.
What would you put for this? "Graph of All-Transactions House Price Index for the United States 1975-2025"?
Charts are one I've wondered about, do I need to try to describe the trend of the data, or provide several conclusions that a person seeing the chart might draw?
Just saying "It's a chart" doesn't feel like it'd be useful to someone who can't see the chart. But if the other text on the page talks about the chart, then maybe identifying it as the chart is enough?