Phi 4 available on Ollama
ollama.com271 points by eadz 4 days ago
271 points by eadz 4 days ago
Over the holidays, we published a post[1] on using high-precision few-shot examples to get `gpt-4o-mini` to perform similar to `gpt-4o`. I just re-ran that same experiment, but swapped out `gpt-4o-mini` with `phi-4`.
`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!
By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).
[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like
I like the direction, but have a pretty different experience in practice. This spans legal analytics, social media analytics, code synthesis, news analysis, cyber security LLMs, etc:
1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.
2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.
I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.
re: 90% – this particular case is a fairly subjective and creative task, where humans (and the LLM) are asked to follow a 22 page SOP. They've had a team of humans doing the task for 9 years, with exceptionally high variance in performance. The blended performance of the human team is meaningfully below this 90% threshold (~76%) – which speaks to the difficulty of the task.
It's, admittedly, a tough task to measure objectively though, in that it's like a code review. If a Principal Engineer pointed out 20 deficiencies in a code change and another Principal Engineer pointed out 18 of the same 20 things, but also pointed out 3 other things that the first reviewer didn't, it doesn't necessarily mean either review is wrong – they just meaningfully deviate from each other.
In this case, we chose an expert that we treat as an objective "source of truth".
re: simple tasks – We run hundreds of thousands of tasks every month with more-or-less deterministic behavior (in that, we'll reliably do it correctly a million out of a million times). We chose a particularly challenging task for the case-study though.
re: in a paying business context – FWIW, most industries are filled with humans doing tasks where the rate of perfection is far below 90%.
I'm more confused now. If this is a tough and high-value task, we would not use gpt-4o-mini on its own, eg, add more steps like a verifier & retry, or just do gpt-4o to begin with, and would more seriously consider fine-tuning in addition to the prompt engineering. The blog argued against that, but maybe I read too quickly.
And agreed, people expect $ they invest into computer systems to do much better than their bad & avg employees. AI systems get the added challenge where they must do ~100% on what non-AI rules would catch ("why are you using AI?") + extra lift from AI ("what did this add?"). We generally get evaluated on matching experts (low bar), and exceeding them (high bar). Comparing to average staff is, frustratingly, a breakout.
Each scenario is different obviously..
One point of confusion might be that this is a tough but relatively low-value task (on a per-unit basis). The budget per item moderated is measured in small double-digit cents, but there's hundreds of thousands of items regularly being ingested.
FWIW – across all of these, we already do automated prompt rewriting, self-reflection, verification, and a suite of other things that help maximize reliability / quality, but those tokens add up quickly and being able to dynamically switch over to a smaller model without degrading performance improves margin substantially.
Fine-tuning is a non-starter for a number of reasons, but that's a much longer post.
I feel like LLMs are going to be a skill to have similar to the ability to google or type since it can get good answers pretty well but bad answers when you don't know the subject manner.
Agreed, and that's where teams like the OP come in
OpenAI does great at training for general tasks, and we should not be disappointed when specialized tasks fail. Interestingly, openai advertises increasingly many subjects they are special casing like math, code, & law, and so holding them to standards is fair there IMO.
For specialized contexts openai doesn't eval on, these merit hiring consultants / product to add the last-mile LLM data & tuning for the specific task. And at least in my experience, people paying money for AI experts & tech expect expert-level performance to be met, and ultimately, exceeded..
What's your loop for prompt engineering with GPT-4o? Do you feed the meta-prompter the misclassified examples? Also does the evaluation drive the synthetic data production almost like boosting?
'it varies' b/c we do everything from an interactive analytics chat agent (loiue.ai UI) to data-intensive continuous-monitoring (louie.ai pipelines) to one-off customer assists like $B court cases
1. Common themes in our development-time loop:
* We don't do synthetic data. We do real data or anonymized data. When we lack data, we go and get some. That may mean paying people, doing it ourselves, setting up simulation environments, etc.
* We start with synthetic judges, esp for scale tasks that are simple and thus considering smaller models like gpt-4o-mini (the topic here). Before we worry about expert agreement, we worry about gpt-4o agreement, and make evals that cover concerns like sample size and class imbalance...
* ... When the task is high value, e.g., tied closely to a paying customer deliverable or core product workflow, we invest more on expert evals, making calls like on how many experts and of what caliber. Informally, we've learned multiple of our teammates, despite good at what they do, can be lousy experts, while others are known for precision, even if not data people (ex: our field staff can be great!). Likewise, we hire subject matter experts as full-timers (ex: former europol/fbi equivs!), source as contractors, and, partner with our customers here.
* After a year+ of prompt engineering with different tasks, models, data, and prompt styles, there's a lot of rote tricks & standard practices we know. Most are 'static' -- you can audit a prompt for gotchas & top examples to fill in -- and a smaller number are like in the OP's suggestion of dynamic prompts where we include elements like RAG.
On the last point, it seems incredibly automatable, so I keep trying tools. I've found automatic prompt optimizers like dspy to be disappointing in being unable to match what our prompt engineers can do here: they did not do better then prompts we wrote as experts with bare bones iteration, and leaning into the tools failed to get noticeable lift. I don't think this is inherent, just they're probably eval'ing against people we would consider trainees. Ex: I see what stanford medical fellows+phds are doing for their genai publications, and they would probably benefit from dspy if it was easier, but again, we would classify them as 'interns' wrt the quality of prompt engineering I see them doing behind-the-scenes. I'm optimistic that by 2026, tools here will be useful for skilled AI engineers too, just they're not there yet.
2. It's a lot more murky when we get into online+active learning loops for LLMs & agentic pipelines.
E.g., louie.ai works with live operational databases, where there is a lot wrt people + systems you can learn from, and issues like databases changing, differences in role & expertise, data privacy, adverserial data, and even the workflows change. Another area we deal with is data streams where the physical realities they're working with changes (questions+answers about logs, news, social, etc).
IMO these are a lot harder and one of the areas a lot of our 2025 energy is going. Conversely, 'automatic prompt engineering' seems like something PhDs can make big strides in a vacuum...
Thanks! I love your focus on evaluation, it's missing in a lot of LLM products. I worked in the medical field and we valued model validation with similar importance. Our processes sound similar, too. One difference is that our customers still saw utility in models with much lower F1 than 90%. Rare events are hard to predict.
This is really nice. I loved the detailed process and I'm definitely gonna use it. One nit though: I didn't understand what the graphs mean, maybe you should add the axes names.
Thanks! Great suggestion for improving the graphs – I just updated the post with axis labels.
Have you also tried using the large model as FSKD model?
We have, and it works great! We currently do this in production, though we use it to help us optimize for consistency between task executions (vs the linked post, which is about improving the capabilities of a model).
Phrased differently, when a task has many valid and correct conclusions, this technique allows the LLM to see "How did I do similar tasks before?" and it'll tend to solve new tasks by making similar decisions it made for previous similar tasks.
Two things to note:
- You'll typically still want to have some small epsilon where you choose to run the task without few-shots. This will help prevent mistakes from propagating forward indefinitely.
- You can have humans correct historical examples, and use their feedback to improve the large model dynamically in real-time. This is basically FSKD where the human is the "large model" and the large foundation model is the "small model".
Is anyone blown away by how fast we got to running something this powerful locally? I know it's easy to get burnt out on llms but this is pretty incredible.
I genuinely think we're only 2 years away from full custom local voice to voice llm assistants that grow with you like JOI in BR2049 and it's going to change how we think about being human and being social, and how we grow up.
It's incredible.
I've been experimenting with running local LLMs for nearly two years now, ever since the first LLaMA release back in March 2023.
About six months ago I had mostly lost interest in them. They were fun to play around with but the quality difference between the ones I could run on my MacBook and the ones I could access via an online API felt insurmountable.
This has completely changed in the second half of 2024. The models I can run locally had a leap in quality - they feel genuinely GPT-4 class now.
They're not as good as the best hosted models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) but they're definitely good enough to be extremely useful.
This started with the Qwen 2 and 2.5 series, but I also rate Llama 3.3 70B and now Phi-4 as GPT-4 class models that run on my laptop.
I wrote more about this here: https://simonwillison.net/2024/Dec/31/llms-in-2024/#some-of-...
I'm in complete agreement with your more recent timeline piece (the negative one), and as a younger user (22 year old student) I'm actively relocating this year to somewhere slightly more rural with a focus on physical/knowledge combined work to secure a good quality of life nearly solely because of how fast our timelines are.
A 'word calculator' this effective is the best substitute that we have for a logic calculator. And the fact that it's enough in 90% of situations is terrifying as it is transformative, as is the fact no one is awake to it.
Exponential power scaling in an unstable world feels like it only makes it exponentially more unstable though.
I should emphasize that I really don't think the dystopian version of this is likely to happen - the one where "AGI/ASI" puts every human out of work and society collapses.
Human beings have agency, and we are very good at rolling with the punches. We've survived waves of automation for hundreds of years. I'm much more confident that we will continue to find ways to use these things as tools that elevate us, not replace us.
I really hope the dystopian version doesn't come to pass!
I agree people are way more agentic than we give them credit for in these situations. We tend to 'petri-dish' ourselves and act like we're just the products of our environments, being swept along when top-down analysing large situations like this when that really isn't the case.
That being said, I can't see any world where there isn't mass ontological shock/hysteria, mass unemployment, and unrest at least for a few years, and I feel like it is definitely the kind of event you take active measures and preparations for beforehand.
And so do I, but like the golden rule of camping you should prepare for the worst and hope for the best!
> We've survived waves of automation for hundreds of years. I'm much more confident that we will continue to find ways to use these things as tools that elevate us, not replace us.
The difference with past technological breakthroughs is that they augmented what humans could do, but didn't have the potential to replace human labor altogether as AI does. They were disruptive, but humans were able to adapt to new career paths as they became available. Lamplighters were replaced by electrical lighting, but that created jobs for electrical engineers. Carriage drivers were replaced by car drivers; human computers by programmers, and so on.
The invention of AI is a tipping point for technology. The only jobs where human labor will be valued over machine labor are those that machines are not good at yet, and those where human creativity is a key component. And the doors are quickly closing on both of those as well.
Realistically, the only jobs that will have some form of longevity (~20 years?) are those of humans that build and program AI machines. But eventually even those will be better accomplished by other AI machines.
So, I'm really curious why you see AI as the same kind of technology we've invented before, and why you're so confident that humanity will be able to overcome the key existential problems AI introduces, which we haven't even begun to address. I don't see myself as a pessimist, but can't help noticing that we're careening towards a future we're not prepared to handle.
As many forums say, with other tech inventions they replaced the horse not the rider. With AI; they are replacing the rider - that makes it a unique technology that does not compare to previous technology being introduced. Other forms of technology typically enabled use cases which didn't seem possible (e.g. electricity, cooking food faster, flying, etc) - this one at present is just about making existing cases more efficient/removing the need for labor. As many non-techies mention - other than doing my assignment/email/etc what benefit does it have on my daily life other than threaten some jobs and generate some worthless online content?
The cost/benefit for the labor/middle/low classes is at best low right now. I define that as someone who needs to trade time to continue surviving as an ongoing concern even if they have some wealth behind them.
I think the outcome where any form of meritocratic society gives way to old fashioned resource acquisition based societies is definitely one believable outcome. Warfare, land and resource ownership - the old will become the new again.
You truly believe we’re on a timeline that involves the replacement of anaesthesiologists, emergency medicine physicians, trauma surgeons, and so on, within a 20 year timeframe? AI progress in the last few years has been astounding, but the gaps between where we are and a true all-human-labour-is-inferior scenario is almost unfathomable.
I could be wrong on the timeline. But are we not moving towards a future where even those professions are replaced by AI? The current wave of ML might not be the one to get us there, but there is an unprecedented level of interest and resources working to make that a reality. Regardless if they succeed or not, there is still a mountain of societal problems we need to address with even the current generation of this technology.
But my main argument is against the notion that this technology is the same as the ones that came before it, and that it will undoubtedly lead to a net better future. I think that is far from certain, and the way things are developing only leads me to believe that we're not ready for what we're building.
I’m on the opposite end of the spectrum. I’m almost certain that this is going to end extremely badly for the majority of humanity, and for programmers in particular.
I think there’s a less than 5% chance that this goes well, and that’s only if we get a series of things to go extremely well. And frankly, we’re tracking along the extremely bad path so far.
We barely survived one nuclear arms race, and this could give every nation state a new type of power weapon every 5ish years through the inevitable scaling in energy and weapons. I agree we're on one of the worst timelines for AI/AGI/ASI with the world actively being run into the ground by short sighted 'dementia-ocracies' and every security risk about to increase dramatically.
I am blown away: a year ago I bought a M2 32G Mac to run local models. It seems like what I can run locally now just one year later is 10x more useful for NLP, data wrangling, RAG, experimenting with agents, etc.
BTW, a few days ago I published a book on using Ollama. Here is a link to read it online https://leanpub.com/ollama/read
Which models do you recommend for that amount of memory?
I asked the same question a few days back and I'm keeping the responses here: https://bsky.app/profile/potato.horse/post/3lejngewfmc2n
Not related to local LLMs, but JOI from BR2049 is essentially what Replika is striving for: https://replika.com/
Infact during the onboarding process they ask the user to choose which AI companion movie they related to the most: Her, BR2049 or Ex-Machina. The experience is then tailored to align closer to the movie chosen.
It's quite a terrible app from a product design perspective: filled with dark patterns (like sending the user blurred images to "unlock*) and upsells, but it's become successful amongst the masses that have adopted it, which I find fascinating. 30m+ users https://en.wikipedia.org/wiki/Replika#:~:text=Replika%20beca....
How can a model "grow with you"? Do current models do this other than adding the full conversation to the context window?
Yes, and for image and video generation too.
Hunyuan (open source video) has been remarkable. Flux dev makes some incredible images.
The fact that it's still only going to get better from here is hard to imagine.
I’ve thought for a while that Joi in BR2049 was less dystopian than what we will probably do with AI. She doesn’t constantly prompt K to buy more credits (like a mobile game) to continue engaging with her or deepen their relationship. (“If you really love me…”) I’ve been expecting that this is how our industry would operate given the customer hostile psychologically abusive hellscape of social and mobile. Of course there’s still time.
She appears to be a local model runnable on a small device without cloud.
I expect AI to be like any other tech: some fantastic uses that advance humanity and improve the world, some terrible uses that abuse, manipulate, oppress.
I don’t see anything in the tech that indicates a singular pattern that will be “good” or “bad”.
The enshitification trend seems pretty dominant, and pretty bad for users / good for investors
It’s odd that MS is releasing models they are competitors to OA. This reinforce the idea that there is no real strategic advantage in owning a model. I think the strategy is now offer cheap and performant infra to run the models.
> This reinforce the idea that there is no real strategic advantage in owning a model
For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.
MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing
> I have no idea what game Meta is playing
Based on their business moves in recent history, I’d guess most of them are playing Farmville.
Meta's entire business model is to own users and their content.
Whether it be Facebook, Instagram, Threads, Messenger, WhatsApp, etc. their focus is to acquire users, keep them in their platforms, and own their content - because /human attention is fundamentally valuable/.
Meta owns 40% of the most popular social media platforms today, but their attention economies face great threats: YouTube, TikTok, Telegram, WeChat, and many more threaten to unseat them every year.
Most importantly, the quality of content on these platforms greatly influences their popularity. If Meta can accelerate AI development in all forms, then it means the content quality across all apps/platforms can be equalized - video on YouTube or TikTok will be no more high quality than on Facebook or Instagram. Messages on Threads will be no more engaging than that on Twitter. Their recent experiments with AI generated profiles[0] signals this is the case.
Once content quality - and luring creators to your platform - are neutralized as business challenges that affect end users lurking on the platform and how effectively they can be retained, then it becomes easier for Meta to retain any user that enters their platforms and gain an effective attention monopoly without needing to continue to buy apps that could otherwise succeed theirs.
And so, it is in their benefit to give away their models 'for free', 'speed up' the industry's development efforts in general, de-risk other companies surpassing their efforts, etc.
[0] https://thebaynet.com/meta-faces-backlash-over-ai-generated-...
Or, to put it another way:
Meta makes money from ads. To make more money, they either need to capture more of their users' time and show more ads, or show better ads that users click more often.Meta is betting on AI models making it easier to do both.
Better generative AI means you can make more ads faster, which means there are more ad variants to a/b test across, which means it's easier to find an ad that users will click.
To make users stay on their platforms, Meta figures out what content will keep them there, and then shows them that content. Before gen AI, they were only able to show existing content from real users, but sometimes the "ideal" thing for you hasn't been created yet. They bet on the fact that they'll be able to use AI to create hyper-personalized content for their users that engages them better than human-made content.
Word. I was mostly just making a joke about FarmVille— the classic engagement-vampire facebook game.
Can you explain how development of better generative AI (which I assume is what you mean when you say AI) will mean that “content quality across all apps/platforms can be equalized”? Unless you mean the content quality will go to shit equally everywhere (as it did in their AI profile experiment) I’m not sure I understand what you’re saying.
Meta’s definition of quality is not the same as your definition of quality. For them, quality is (within reason) what drives “engagement” (aka time spent in their apps).
It might be that many people’s aesthetic sensibility is that AI-generated content is slop, but I’d still bet that tailored-perfectly-to-you content (and ads) will be highly engaging
> I have no idea what game Meta is playing
I think they're commoditizing their complement [1]. Engaging content helps Meta, and LLMs make it easier to create that content. Their business model has never been selling API access and releasing the model enables the community to improve it for them.
> It’s odd that MS is releasing models they are competitors to OA.
> I think the strategy is now offer cheap and performant infra to run the models.
Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?
That's exactly what they're saying: it's interesting that Microsoft came to the same conclusion that Meta did, that models are generally not worth keeping locked down. It suggests that OpenAI has a very fragile business model, given that they're wholly dependent on large providers for the infra, which is apparently the valuable part of the equation.
To be fair, OpenAI's products are not really models, they are... products. So it's debatable if they really do have anything special.
I don't really think they do, because to me it seemed pretty much since GPT-1, that having callbacks to run python and query google, having "inner dialog" before summarizing an answer and a dozen more simple improvements like this are quite obvious things to do, that nobody just actually implemented (yet). And if some of them are not obvious per se, they are pretty obvious in the hindsight. But, yeah, it's debatable.
I must admit though, that I doubt that this obvious weakness is not obvious to the stakeholders. I have no idea what the plan is, maybe what they gonna have that Anthropic doesn't is gonna be a nuclear reactor. Like, honestly, all we are pretending to be forward-thinking analysts here, but in reality I couldn't figure out that Musk's "investment" into Twitter is literally politics at the time of it happening. Even though I was sure there is some plan, I couldn't say what it is, and I don't remember anybody in these threads expressing clearly enough what is quite obvious in the hindsight. Neither did all these people like Matt Levine, who are actually paid for their shitposting: I mostly remember them making fun of Musk "doing stupid stuff and finding out" and calling it a "toy".
> To be fair, OpenAI's products are not really models, they are... products
What's the distinction? What kind of functionality do they offer that other models don't?
A model is an ingredient in an AI product. The product includes the UI, tools / RAG, apps on various platforms, system prompts and personality, and so on.
Lots of products have been successful without a technical moat. Facebook has network effects, Apple has UX (though silicon has become a technical advantage if not moat), Adobe has “everyone knows how to use these tools” switching costs, Google has brand synonymous with search.
Companies are betting that models will be commodities but AI products will be sticky.
ChatGPT.com does much more than for example Llama3.2-vision. It can search the web automatically, write code and run it just to answer you, much more agency.
What product? A chat window? I'm not trying to be rude btw, but if the product isn't the LLM itself, that's all they have.
I regularly use several features within ChatGPT that are well beyond a chat window. Advanced Voice, DALL-E integration, Projects, and GPTs (mostly a couple private ones I created for my own use). There are other features that I don't use, like Canvas. Perhaps the sum of these still isn't an impressive product in your eyes, but it's surely more than just a chat window.
Interesting, so apparently UI and UX and responsiveness and polish all don’t matter for products? We can just ship shittily drawn interfaces now?
They aren’t that good. It’s mostly well rounded now, but on nacOS it’s often impossible to select parts of code sections.
That’s not just on macOS, and I’m pretty sure that’s a deliberate dark pattern to prevent users from taking their query to claude or gemini after gpt shits the bed.
I use OpenAI not just because it has decent models that work decently by default but because I don't need to care on how to setup a model on a cloud provider and their API is straight forward. They are quite affordable too (e.g. TTS is one of the cheapest I found for its quality)
I could switch to a different provider if I needed to maybe with cheaper pricing or better models but that doesn't mean OpenAI doesn't offer a "product".
Open AI is the only company that really matters in the consumer conversational AI space.
Their unique value-adds are the Chat GPT brand, being the "default destination" when people want AI, as well as all the "extra features" they add on top of raw LLMs, like the ability to do internet searches, recall facts about you from previous conversations, present data in a nice, interactive way by writing a react app, call down to Python or Wolfram Alpha for arithmetic etc.
I wouldn't be surprised if they eventually stop developing their own models and start using the best ones available at any given time.
> in the consumer conversational AI space.
The "consumer conversational AI space" only exists right now as a novelty, not a long-term market segment. In the not too distant future that space will be covered for most users for free by their hardware manufacturers, and the number of people willing to pay a monthly subscription to a third party will drop even further than it already has.
I think it will be at least a few years until your average Joe can run a speech to speech model on their phone.
I mean they have name recognition and a userbase, but they're hardly the best at doing any of those features.
Default destination for many is still just Google, and they've added AI to their searches. AI chat boxes are shoehorned into a ton of applications and at the end of the day it'll go to the most accessible one for people. This is why AI in Windows or in your Web Browser or on your phone is a huge goal.
As far as extra features, chat GPT is a good default, but they're severely lacking compared to most other solutions out there.
> It suggests that OpenAI has a very fragile business model
That is the reason they are making products so that people stay on the platform.
Their big risk there as I see it is that the market for "I need an AI" is much much smaller than they thought it would be. People don't generally need or want to pay for "AI", they want to pay for solutions to specific problems.
This means that in a world where AWS/Azure/GCP all compete in the compute and the models themselves are commodities, AI isn't a product, it's a feature of every product. In that world, what is OpenAI doing besides being an unnecessary middleman to Azure?
The ones at the forefront of the "I need an AI" hype are selling agents, or tools that integrate in your email workflow, or some other tool with AI in the name. OpenAI is selling the shovels, the backend API those services are using. AWS/Azure/GCP are selling factory space and are providing blue-prints for shovels. Which is compelling at scale, but if you are busy selling AI tools to people who don't know better it's faster to just use an API to whatever OpenAI offering is SOTA or close to SOTA.
I'd agree there isn't much money in it. OpenAI should probably milk the revenue they get now and make hay while the sun is shining. But their apparent strategy is to bet it all on finding another breakthrough similar to the switch from text completion to a chat interface
Yeah, the problem with selling shovels where shovels=APIs is that APIs cost almost nothing to replicate and are not copyrightable. Tools like Ollama and LiteLLM already offer APIs that are drop-in replacements for OpenAI.
OpenAI isn't losing yet because their models are still marginally better and they have a lot of inertia, but their API isn't going to save them.
> But their apparent strategy is to bet it all on finding another breakthrough similar to the switch from text completion to a chat interface
I'm still convinced that their strategy is to find an exit ASAP and let Altman cash out. He's playing up AGI because it's the only possible way that "AI" becomes a product in its own right so investors need to hear that that's the goal, but I think he knows full well it's not in reach and he can only keep the con going so long. An exit is the most profitable way out for him.
They are the useful idiots that attracted the funding to take the risks and make the technology emerge but didn't have the right marketing and political power. They will disappear as fast as they appeared. It is a common tale in technologies, a lot of companies who invented and/or developed something and did all the hard work just couldn't compete when it got comoditized.
OpenAI has infrastructure and a product around serving to people plus they have SOTA models. Joe blow can’t just take a Qwen or whatever and start making money at scale
According to many press stories in the past year, the relationship between Microsoft and OpenAI has been very strained. It looks more and more like that both sides are looking for opportunity to jump ship.
This is a very clever move by Microsoft. OpenAI has no technological moat and a very unreliable partner.
> This reinforce the idea that there is no real strategic advantage in owning a model.
Yes, because you can't build a moat. Open source will very quickly catch up.
I think they want/need a plan b in case OpenAI falls apart like it almost did when Sam got fired.
I was going to ask if this or other Ollama models support structured output (like JSON).
Then a quick search revealed you can as of a free weeks ago
For structured output from anywhere I'm finding https://github.com/BoundaryML/baml good. It's more accurate than gpt-04-mini will do on its own, and any of the other JSON schema approaches I've tried.
Yeah it's not as strong as constrained beam search like OpenAI uses (at least afaik) but it works on any models that support tool calling. Just keep it simple, don't have a lot of deep nested structures or complicated rules.
Lots of other models will work nearly as well though if you just give them a clear schema to follow and ask them to output json only, then parse it yourself. Like I've been using gemma2:9b to analyze text and output a json structure and it's nearly 100% reliable despite it being a tiny model and not supporting tools or structured output officially.
Was disappointed in all the Phi models before this, whose benchmark results scored way better than it worked in practice, but I've been really impressed with how good Phi-4 is at just 14B. We've run it against the top 1000 most popular StackOverflow questions and it came up 3rd beating out GPT-4 and Sonnet 3.5 in our benchmarks, only behind DeepSeek v3 and WizardLM 8x22B [1]. We're using Mixtral 8x7B to grade the quality of the answers which could explain how WizardLM (based on Mixtral 8x22B) took 2nd Place.
Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.
Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].
[1] https://pvq.app/leaderboard
Interesting eval but my first reaction is "using Mixtral as a judge doesn't sound like a good idea". Have you tested how different its results are from GPT-4 as a judge (on a small scale) or how stuff like style and order can affect its judgements?
Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper
Yeah we evaluated several models for grading ~1 year ago and concluded Mixtral was the best choice for us, as it was the best model yielding the best results that we could self-host and distribute the load of grading 1.2M+ answers over several GPU Servers.
We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.
[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...
There are a lot of interesting options. Gemini 2 Flash isn't ready yet (the current limits are 10 RPM and 1500 RPD) but it could definitely work. An alternative might be using a fine tuned model - I've heard good things about OpenAI fine tuning with even a few examples.
Honestly, the fact that you used an LLM to grade the answers at all is enough to make me discount your results entirely. That it showed obvious preference to the model with which it shares weights is just a symptom of the core problem, which is that you had to pick a model to trust before you even ran the benchmarks.
The only judges that matter at this stage are humans. Maybe someday when we have models that humans agree are reliably good you could use them to judge lesser-but-cheaper models.
Yup, I did an experiment a long time ago, where I wanted best of 2. I had Wizard, Mistral & Llama. They would generate responses and I would pass the response to all 3 models to vote. I would pass it in to a new prompt without reference to previous prompt, 95%+ of the time, they all voted for their own response even when it was clear there was a better response. LLM as a judge is a joke.
The Mixtral grading model calculates the original starting votes which can be further influenced by Users voting on their preferred answer which affects the leaderboard standings.
It should be noted that Mixtral 8x7B didn't grade its own model very high at 11th, it's standout was grading Microsoft's WizardLM2 model pretty high at #2. Although it's not entirely without merit as at the time of release it was Microsoft's most advanced model and the best opensource LLM available [1]. Which we also found generated great high quality answers which I'm surprised it's not more used as it's only OpenRouter's 15th most used model this month [2], although it's received very little marketing behind it, essentially just an announcement blog post.
Whilst nothing is perfect we're happy with the Grading system as it's still able to identify good answers from bad ones, good models from bad ones and which topics models perform poorly on. Some of the grades are surprising since we have prejudices on where models should rank before the results are concluded, which is also why it's important to have multiple independent benchmarks, especially benchmarks that LLMs aren't optimized for as I've often been disappointed by how some models perform in practice vs how well they perform in benchmarks.
Either way you can inspect the different answers from the different models yourself by paging through the popular questions [3]:
[1] https://wizardlm.github.io/WizardLM2/
I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]
The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...
[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...
[2] https://github.com/google-research/google-research/blob/mast...
IMO SO questions is not a good evaluation. These models were likely trained on the top 1000 most popular StackOverflow questions. You'd expect them to have similar results and perform well when compared to the original answers.
> but luckily now that it's MIT licensed it's available on OpenRouter
Did it have a different license before? If so, why did they change it?
FWIW, Phi-4 was converted to Ollama by the community last month:
And adopted unsloth's bug fixes a few days ago. https://ollama.com/vanilj/phi-4-unsloth
The template doesn't match Unsloth's recommendation: https://news.ycombinator.com/item?id=42662106
We ended up not publishing it as a library model just because it was leaked and not the official weights.
How come models can be so small now? I don't know a lot about AI, but is there an ELI5 for a software engineer that knows a bit about AI?
For context: I've made some simple neural nets with backprop. I read [1].
I’ve seen on the localllama subreddit that some GGUFs have bugs in them. The one recommended was by unsloth. However, I don’t know how the Ollama GGUF holds up.
Ollama can pull directly from HF, you just provide the URL and add to the end :Q8_0 (or whatever) to specify your desired quant. Bonus: use the short form url of `hf` instead of `huggingface` to shorten the model name a little in the ollama list table.
Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:
`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`
(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)
Is it true that non-gguf models are basically all Q4-equivalent? I'm always not sure which one to download to get the "default score".
You still need to make sure the modelfile works so this method will not run out of the box on a vision GGUF or anything with special schemas. Thats why mostly a good idea to pull from ollama directly.
Phi-4's architecture changed slightly from Phi-3.5 (it no longer uses a sliding window of 2,048 tokens [1]), causing a change in the hyperparameters (and ultimately an error at inference time for some published GGUF files on Hugging Face, since the same architecture name/identifier was re-used between the two models).
For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well
In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".
Here's[1] a recent submission on that.
[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes
Related Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning (439 points, 24 days ago, 144 comments) https://news.ycombinator.com/item?id=42405323
Also on hugging face https://huggingface.co/microsoft/phi-4
Does it include the unsloth fixes yet?
I’ve pulled and ran it. It launches fine, but when I actually ask it anything I constantly get just a blank line. Does anyone else experience this?
I would guess on your hardware you're getting <1 token/time-you've-bothered-waiting?
not sure what it means. I've got macbook pro M1 Max with 64Gb. Any other model runs perfectly fine. Only Phi4 blanks on me
"built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets"
Does this mean the model was trained without copyright infringements?
This is a presumptive question, as training AI models may fall under fair use.
Can this run on a macbook m1? What is the performance like? Or would I need an m3? Thanks
Yeah as long as it has 16GB+ RAM. I've got a newer cpu and it's very fast, so I expect on an M1 it would be at least bearable.
Does this include some of the config fixes that the sloth guys pointed out?
I have unfortunately been disappointed with the llama.cpp/ollama ecosystem of late, and thinking about moving my things to vllm instead.
llama.cpp basically dropped support for multimodal visual models. ollama still does support them, but only a handful. Also ollama still does not support vulkan eventhough llama.cpp had vulkan support for a long long time now.
This has been very sad to watch. I'm more and more convinced that vllm is the way to go, not ollama.
But can you run llms that easily with vllm? do you have to fiddle with formats to get it to run?
I'm still in early stages of exploration, but vllm seems to be compatible with most models on huggingface.
[flagged]
The ollama application has zero value; it’s just an easy to use front end to their model hosting which is both what this is and why they’re important.
Only having one model host (hugging face) is bad for obvious reasons (and good in others, yes, but still)
Ollama offering an alternative as a model host seems quite reasonable and quite well implemented.
The frontend really is nothing; it’s just llama.cpp in a go wrapper. It has no value and it’s not really interesting, it’s simple stable technology that is perfectly fine to rely on and be totally unexcited or interested in, technically.
…but, they do a lot more than that; and I think it’s a little unfair to imply that trivial piece of their stack is all they do.
The software that controls the front-end has enormous value, it becomes the central point and brand to manage and self-host LLMs that's used to manage 100 GB catalog of models which acts like a moat inhibiting switching to alternatives. Awareness and user-base are the hardest things to obtain with new Software products and it has both - right now it doesn't look it's monetizing its user base, but it could easily attract millions in VC funding to spin off a company to sell support contracts and "higher value" SaaS hosting or enterprise management features.
Whilst it's now a UX friendly front-end for llama.cpp, it's also working on adding support for other backends like MLX [1].
I’m not seeing what the issue is with ollama. Can you elaborate? There are tons of open source projects that other stuff gets built upon: that’s part of the point of open source.
I might be wrong about this but doesn't ollama do some work to ensure the model runs efficiently given your hardware? Like choosing between how much gpu memory to consume so you don't oom. Does llama.cpp do that for you with zero config?
Yes, Ollama automatically determines the number of layers to offload based on available VRAM.
I would even say that Ollama is a step back. For example llama.cpp supports vulkan, which is a huge gamechanger for consumer grade hardware. Ollama does not support vulkan, eventhough it's probably fairly easy to do so.
If you care about running efficiently on your hardware, then llama.cpp is they way to go, not ollama.
I thought ollama was just a convenience wrapper around llama.cpp?
That might be how it started, but there are differences. For example support for LLama 3.2 Vision was added to Ollama[1], but not upstreamed[2] to llama.cpp due to image processing requirements AFIAK.
Looks like you’re being downvoted. It’d be nice if somebody could explain the difference, cause I’m also kinda out of the loop on this
It's very hard to put into words without coming off as being unfair to one side or the other, but the ollama project really does provide little-to-no _innovative_ value over simply running components of llama.cpp directly from the command line. 100% of the heavy lifting (from an LLM perspective) is in the llama.cpp codebase. The ollama parts are all simple, well understood, commodity components that most any developer could have produced.
Now, applications like ollama obviously need to exist, as not everyone can run CLI utilities, let alone clone a git repo and compile themselves. Easy to use GUIs are essential for the adoption of new tech (much like how there are many apps that wrap ffmpeg and are mostly UI).
However, if ollama are mostly doing commodity GUI things over a fully fleshed-out, _unique_ codebase to which their very existence is owed, they should do everything in their power to point that out. I'm sure they're legally within their rights because of the licensing, but just from an ethical perspective.
I think there is a lot of ill-will towards ollama in some hard-core OG LLM communities because ollama appears to be attempting to capture the value that ggerganov has provided to the world in this tool without adequate attribution (although there is a small footnote, iirc). Basically, the debt that ollama owes to llama.cpp is so immense that they need to do a much better job recognizing it imo.
Does Ollama offer a GUI? I don't think they do.
I use them because they run as a systemd service with a convenient HTTP API. That's been extremely helpful for switching between GUIs.
I also like their model orgazation scheme, and the modelfile paradigm. It's also really handy that it loads and unloads models when called, which is handy for experimentation and some complex workflows eg embedding followed by inference.
Is llama.cpp doing 100% of the "heavy lifting"? Sure, but some light lifting is also needed to lower the activation threshold and bring the value to life.
I would not use llama.cpp, it's simply too cumbersome.
If Ollama did not exist, I would have to invent it.
Is it not "innovative"? Who cares! I want it. Commodity GUI? Again, I don't think they have a GUI at all. Are you maybe thinking of OpenWebUI?
I think we agree on almost all points, but I thought ollama-gui was an official gui, so I’m even more baffled as to what the draw is. Running as a llama.cpp as a service/API endpoint is trivial (I do just that). Maybe you can outline for me what the value proposition of ollama is so I can better understand what it does that plain llama.cpp doesn’t.
Ollama allows me to use a single podman command, which uses the latest version of ollama, downloads a model of my choosing, and starts a local http endpoint widely supported by different clients. I can just run this one command to chat with a local model through a web interface, get code completions in VSCode, ask about the content of my local Markdown notes.
Now, I don't use AI that much, I could totally live without this. But if it weren't for the robust one-liner I probably wouldn't use local LLMs at all.
My experience of ollama is that it makes it super easy to pull various models and use them locally. Sure, I could do this myself but it's helpful not to have to.
I'm not going to pretend to know everything about Ollama, llama.cpp or llamafile, but my experiences using llama.cpp and llamafile (llama.cpp based) were both negative. Web UI frontends aren't relevant here, this is just purely about whether I can load a model and get it to produce coherent results that are in the realm of what the people who created the model intended.
With llama.cpp or llamafile, I was constantly having to look up a model's paper, documentation or other pages to see what the recommended parameters were, recommended templates were, and so on. My understanding is that GGUFs were supposed to solve that, yet still I was getting poor results.
You know, I don't know all the details or if there's any difference between what Modelfiles are for versus what GGUF metadata is for, but my experiences with Ollama have been that it just worked. It took me a while to even try Ollama, because the expectation is that it would simply be another interface on top of the same issues.
There are things I don't like about Ollama, but mostly they were easy to work around by writing a few scripts. Not using any web UI with it at all.
Translation, Phi-4 available on llmacpp