Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

84 points by thm 3 days ago

Does anyone else worry about this technology used for Big Brother type surveillance?

reactordev - 3 hours ago

Where have you been the last decade? It’s already in use, or models like it, by companies selling access to The State
https://deflock.me
Not to mention cloud platforms that collect evidence and process it with all the models and store that information for searching…
https://www.revir.ai
- eurekin - 2 hours ago
  
  No mention of palantir?
  - bilbo0s - 2 hours ago
    
    >It’s already in use, or models like it, by companies selling access to The State
    Doesn't that pretty much cover Palantir as well?
- mptest - 2 hours ago
  
  or if you prefer your depression in book format: surveillance capitalism by zuboff pegasus: a spy in your pocket laurent richard
basilgohar - an hour ago

How do you think this tech was developed in the first place? It's probably trained and used in the surveillance bid for a decade before it comes to consumers, and this probably isn't the SoA stuff that governments have access to, we're probably 5-10 years behind what's on the cutting edge.
protocolture - an hour ago

We got Facial Rec and LPR first, those are more dangerous for surveillance.
g-mork - 2 hours ago

warmly encourage you avoid reading the header files of the dahua camera SDK
ants_everywhere - 25 minutes ago

Big Brother is a reference to George Orwell's critique of Communism in Nineteen Eighty-Four.
Qwen is a video model trained by a Communist government, or technically by a company with very close ties to the Chinese government. The Chinese government also has laws requiring AI be used to further the political goals of China in particular and authoritarian socialism in general.
In the light of all this, I think it's reasonable to conclude that this technology will be used for Big Brother type surveillance and quite possible that it was created explicitly for that purpose.

I was using this for video understanding with inference form vlm.run infra. It definitely has outperformed Gemini which generally is much better than openai or Claude on videos. The detailed extraction is pretty good. With agents you can also crop into a segment and do more operations on it. have to see how the multi modal space progresses:

link to results: https://chat.vlm.run/c/82a33ebb-65f9-40f3-9691-bc674ef28b52

Quick demo: https://www.youtube.com/watch?v=78ErDBuqBEo

colechristensen - an hour ago

I found it pretty funny how bad Claude was at cropping an image. It was a cute little character with some text off to the side on a white background, all very clean cartoon vibes and it COULD NOT just select the character. I pursued it for 20 minutes because I thought it was funny. Of course it was 45 seconds to do it myself.
A lot of my side projects involve UIs and almost all of my problems with getting LLMs to write them for me involve "The UI isn't doing what you say it's doing" and struggling to get A) a reliable way to get it to look at the UI so it can continue its loop and B) getting it to understand what it's looking at well enough to do something about it
- visioninmyblood - 17 minutes ago
  
  I agree claude and chatgpt and even gemini does a poor job in detecting and cropping into a region. Some of the simplest tasks, Qwen also is great at summerization but not into solving simple vision tasks like cropping, segmentetation and detection. Here is an examples where we compared claude, gemini, chatgpt and other frontier models for simple(and complicated) visual tasks https://chat.vlm.run/showdown#:~:text=Crop%20into%20the%20cl...

eurekin - 2 hours ago

Insane if true... now I wonder, if I use it to go through some old dance routing video catalogue to recognize and write individual move lists

- 2 hours ago

[deleted]

thot_experiment - 3 days ago

anyone have a tl;dr for me on what the best way to get the video comprehension stuff going is? i use qwen-30b-vl all the time locally as my goto model because it's just so insanely fast, curious to mess with the video stuff, the vision comprehension works great and i use it for OCR and classification all the time

xrd - 3 hours ago

How much VRAM do you need for local usage may I ask?

- 44 minutes ago

[deleted]

moralestapia - 3 hours ago

To me, this qualifies as some sort ASI already.

spwa4 - 3 hours ago

It's so weird how that works with transformers.

Finetuning an LLM "backbone" (if I understand correctly: a fully trained but not instruction tuned LLM, usually small because students) with OCR tokens bests just about every OCR network out there.

And it's not just OCR. Describing images. Bounding boxes. Audio, both ASR and TTS, all works better that way. Now many research papers are only really about how to encode image/audio/video to feed it into a Llama or Qwen model.

zmmmmm - 2 hours ago

It is fascinating. Vision language models are unreasonably good compared to dedicated OCR and even the language tasks to some extent.
My take is it fits into the general concept that generalist models have significant advantages because so much more latent structure maps across domains than we expect. People still talk about fine tuning dedicated models being effective but my personal experience is it's still always better to use a larger generalist model than a smaller fine tuned one.
- kgeist - an hour ago
  
  >People still talk about fine tuning dedicated models being effective
  >it's still always better to use a larger generalist model than a smaller fine tuned one
  Smaller fine-tuned models are still a good fit if they need to run on-premises cheaply and are already good enough. Isn't it their main use case?
  - bangaladore - 41 minutes ago
    
    Latency and size. Otherwise pretty much useless.
- jepj57 - an hour ago
  
  Now apply that thinking to human-based neural nets...