Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

static.stepfun.com

130 points by kristianp 11 hours ago

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

networked - 42 minutes ago

51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.
Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.
pitched - an hour ago

That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.
- varispeed - an hour ago
  
  Gemini 3 Flash is pure rubbish. It can easily get into loop mode and spout information no different than Markov chain and repeat it over and over.
YetAnotherNick - 29 minutes ago

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

tarruda - 35 minutes ago

This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:

- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).

This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.

There are a few drawbacks though:

- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

Hopefully StepFun will do a new release which addresses these issues.

BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

culi - 5 hours ago

It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem

tankenmate - 5 hours ago

And not just token usage, expensive token usage; when it comes to tokens/joule not all tokens are equal. Efficient use of MoE architectures does have an impact on tokens/joule and tokens/sec.
- mzl - an hour ago
  
  I like the intelligence per watt and intelligence per joule framing in https://arxiv.org/abs/2511.07885 It feels like a very useful measure for thinking about long-term sustainable variants of AI build-outs.

danieltanfh95 - 8 hours ago

Hallucinates like crazy. use with caution. Tested it with a simple "Find me championship decks for X pokemon", "How does Y deck work". Opus 4.6, Deepseek and Kimi all performed well as expected.

mickeyp - 4 hours ago

I mean, is it possible the latter models used Search? Not saying Stepfun's perfect (it is not.) Gemini especially and unsurprisingly uses search a lot and it is ridiculously fast, too.

kristianp - 11 hours ago

Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.

[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...

mycall - 24 minutes ago

Q4_K_S @ 116 GB
IQ4_NL @112 GB
Q4_0 @ 113 GB
Which of these would be technically better?
[1] https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-G...
Alifatisk - 3 hours ago

> Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.
Does this really mean anything? I for example, tend to ignore certain benchmarks that are focused towards agentic tasks because that is not my use case. Instruction following, long context reasoning and non-hallucinations has more weight to me.

mohsen1 - 4 hours ago

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

cbracketdash - 3 hours ago

Help us out with Terminal Bench 3.0!
https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...

tallesborges92 - an hour ago

I’ve been using this model for a while, and it’s very fast. It spent some time thinking but does fewer calls. For example, yesterday I asked the agent to find the Gemini quota limit for their API, and it took 27 seconds and just 2 calls, Opus 4.6 took 33 seconds, but 5 calls with less thinking

janalsncm - 4 hours ago

Number of params isn’t really the relevant metric imo. Top models don’t support local inference. More relevant is tokens per dollar or per second.

dakolli - 3 hours ago

Its an open source model, why wouldn't it be relevant for people who want to self host.....
lm28469 - 2 hours ago

It does since you can run this model locally on a < $3k machine

Mashimo - 3 hours ago

Holy moly, I made a simple coding promt and the amount of reasoning output could fill a small book.

> create a single html file with a voxel car that drives in a circle.

Compared to GLM 4.7 / 5 and kimi 2.5 it took a while. The output was fast, but because it wrote so I had to wait longer. Also output was .. more bare bones compared to others.

Tepix - 2 hours ago

That's been my experience as well. Huge amounts of reasoning. The model itself is good but even if you get twice as many tokens as with another model, the added amount of reasoning may make it slower in the end.

wmf - 10 hours ago

That reverse x axis sure is confusing.

esafak - 7 hours ago

I imagine they thought they'd look better this way. I don't think they do.

amelius - 2 hours ago

Does it pass the carwash test?

earth2mars - an hour ago

Yes it did well. Also some other word problems it did well too. Reasoning seems good. But maybe not a great code model
oblio - 2 hours ago

What's that?
- amelius - an hour ago
  
  https://news.ycombinator.com/item?id=47031580

prmph - 4 hours ago

Interesting.

Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?

rester324 - 3 hours ago

You would be surprised to see how much the japanese IT industry is behind the times (a decade at least IMO). There is only a very limited startup culture here (both in size and talentpool and business ideas), there is no real risk taking venture capital market here (maybe Masayoshi Son is the exception here, but again he tends to invest in the US mostly) and most software companies use very very very outdated management practices. On top of that most software development had been/has been outsourced to India, Vietnam, China, etc, so management see no value in software talent... SW engineers' social recognition here are mostly on the level of accountants. Under such circumstances japan will never have a chance to contribute to AI meaningfully (other than niche academic research)
citrin_ru - an hour ago

1. The US and China are two biggest economies by GDP. 2. The US is the default destination for worldwide investors (because of historically good returns). China has huge state economy and the state can direct investments into this area.
jstummbillig - 4 hours ago

Have you heard of Mistral? I would consider Mistral major, albeit not frontier.
Tepix - 2 hours ago

The Koreans have released some good models lately. And Mistral is also release open weights models that aren't too shabby.
wazoox - 2 hours ago

Have you heard of Pleias ? Their SML baguettotron is blazingly fast, and surprisingly good at reasoning (but it's not programming-oriented).
tonis2 - 3 hours ago

Cause Europe only good at writing fines for other tech companies

sinenomine - 4 hours ago

Works impressively well with pi.dev minimal agent.

SilverElfin - 8 hours ago

So who exactly is StepFun? What is their business (how do they make money)? Each time I click “About Stepfun” somewhere on their website, it sends me to a generic landing page in a loop.

kristopolous - 4 hours ago

They've been around a couple years. This is the first model that has really broken into the anglosphere.
Keep a tab on aihubmix, the Chinese openrouter, if you want to stay on top of the latest models. They keep track of things like the Baichuan, Doubao, baai (beijing academy), Meituan, 01.AI (yi), xiaomi, etc...
Much larger chinese coverage than openrouter
- tarruda - 31 minutes ago
  
  > This is the first model that has really broken into the anglosphere.
  Before Step 3.5 Flash, I've been hearing a lot about ACEStep as being the only open weights competitor to Suno.
- Havoc - 4 hours ago
  
  >first model that has really broken into the anglosphere.
  Do you know of a couple of interesting ones that haven't yet?
  - kristopolous - 4 hours ago
    
    doubao (bytedance) seed models are interesting
    Keep your eye on Baidu's Ernie https://ernie.baidu.com/
    Artificial analysis is generally on top of everything
    https://artificialanalysis.ai/leaderboards/models
    Those two are really the new players
    Nanbeige which they haven't benchmarked just put out a shockingly good 3b model https://huggingface.co/Nanbeige - specifically https://huggingface.co/Nanbeige/Nanbeige4.1-3B
    You have to tweak the hyper parameter like they say but I'm getting quality output, commensurate with maybe a 32b model, in exchange for a huge thinking lag
    It's the new LFM 2.5
    
    admiralrohan - 42 minutes ago
    
    Never heard of Nanbeige, thanks for sharing. "Good" is subjective though, in which tasks can I use it and where to avoid?
    
    kristopolous - 40 minutes ago
    
    it's a 3b model. Fire it up. If you have ollama just do this:
    ollama create nanbeige-custom -f <(curl day50.dev/Nanbeige4.1-params.Modelfile)
    That has the hyperparameters already in there. Then you can try it out
    It's taking up like 2GB ram on my mac mini
    my test query is always "compare rust and go with code samples". I'm telling you, the thinking token count is ... high...
    Here's what I got https://day50.dev/rust_v_go.md
    I'm going to go try this on a pi. I'll report back.
    
    Havoc - 15 minutes ago
    
    Thanks!
tarruda - 33 minutes ago

They seem to be the same company that released ACEStep music generation model: https://acestep.io/
Though the only mention I found was in ComfyUI docs: https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1
0x1997 - 8 hours ago

https://en.wikipedia.org/wiki/StepFun
- SilverElfin - 8 hours ago
  
  Thanks. Do they sell any of these products today or is it more like research? I am not able to find anything relating to pricing on their website. Just a chatbot.
  - 0x1997 - 7 hours ago
    
    Princing can be found on their docs website https://platform.stepfun.ai/docs/en/pricing/details
deaux - 8 hours ago

Might want to give it a search.

agentifysh - 7 hours ago

what country is behind this one ?

personalcompute - 7 hours ago

Step 3.5 Flash was made by Chinese company StepFun - https://en.wikipedia.org/wiki/StepFun