Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

the-decoder.com

84 points by thm 3 days ago


djmips - 3 hours ago

Does anyone else worry about this technology used for Big Brother type surveillance?

visioninmyblood - 3 hours ago

I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:

link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52

Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo

eurekin - 2 hours ago

Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists

- 2 hours ago
[deleted]
thot_experiment - 3 days ago

anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time

- 44 minutes ago
[deleted]
moralestapia - 3 hours ago

To me, this qualifies as some sort ASI already.

spwa4 - 3 hours ago

It's so weird how that works with transformers.

Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.

And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.