GPT-4.1 in the API
openai.com678 points by maheshrijal 5 days ago
678 points by maheshrijal 5 days ago
As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between
- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)
- o3-mini (web search, CoT, canvas, but no image generation)
- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)
- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)
Why do I have to figure all of this out myself?
> - Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).
The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.
Wow, I wondered what the limit was. I never checked, but I've been using it hesitantly since I burn up OpenAI's limit as soon as it resets. Thanks for the clarity.
I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.
I like Deep Research but as a historian I have to tell you. I've used it for history themes to calibrated my expectations and it is a nice tool but... It can easily brush over nuanced discussions and just return folk wisdom from blogs.
What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.
I read Being and Time recently and it has a load of concepts that are defined iteratively. There's a lot wrong with how it's written but it's an unfinished book written a 100 years ago so, I cant complain too much.
Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.
But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.
It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.
*it's just what I have access to.
If you're asking an LLM about a particular text, even if it's a well-known text, you might get significantly better results if you provide said text as part of your prompt (context) instead of asking a model to "recall it from memory".
So something like this: "Here's a PDF file containing Being and Time. Please explain the significance of anxiety (Angst) in the uncovering of Being."
When I've wanted it to not do things like this, I've had good luck directing it to... not look at those sources.
For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)
Deep Search is pretty good for current news stories. I've had it analyze some legal developments in a European nation recently and it gave me a great overview.
That use case seems pretty self defeating when a good news source will usually try to at least validate first-party materials which an llm cannot do.
LLMs seem fantastic at generalizing broad thought and is not great at outliers. It sort of smooths over the knowledge curve confidently, which is a bit like in psychology where only CBT therapy is accepted, even if there are many much more highly effectual methodologies on individuals, just not at the population level.
Interesting use case. My problem is that for niche subjects the crawled pages probably have not captured the information and the response becomes irrelevant. Perhaps gemini will produce better results just because it takes into account much more pages
I also like Perplexity’s 3/day limit! If I use them up (which I almost never do) I can just refresh the next day
I've only ever had to use DeepResearch for academic literature review. What do you guys use it for which hits your quotas so quickly?
I use it for mundane shit that I don’t want to spend hours doing.
My son and I go to a lot of concerts and collect patches. Unfortunately we started collecting long after we started going to concerts.
I had a list of about 30 bands I wanted patches for.
I was able to give precise instructions on what I wanted. Deep research came back with direct links for every patch I wanted.
It took me two minutes to write up the prompt and it did all the heavy lifting.
I use them as follows:
o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.
Phind was fine-tuned specifically to produce inline Mermaid diagrams for technical questions (I'm the founder).
I really loved Phind and always think of it as the OG perplexity / RAG search engine.
Sadly stopped my subscription, when you removed the ability to weight my own domains...
Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.
Gemini 2.5 Pro is quite good at code.
Has become my go to for use in Cursor. Claude 3.7 needs to be restrained too much.
Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
This. sonnet 3.7 is a wild horse. Gemini 2.5 Pro is like a 33 yo expert. o1 feels like a mature, senior colleague.
I find that Gemini 2.5 Pro tends to produce working but over-complicated code more often than Claude 3.7.
Which might be a side-effect of the reasoning.
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
You probably know this but it can already generate accurate diagrams. Just ask for the output in a diagram language like mermaid or graphviz
My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.
I'm not really certain a text output model can ever do well here.
FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.
Agreed. Even simply I'm sure a service like this already exists (or could easily exist) where the workflow is something like:
1. User provides information
2. LLM generates structured output for whatever modeling language
3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.
4. LLM generates structured output based on the feedback.
5. etc...
But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.
I had a latex tikz diagram problem which sonnet 3.7 couldn't handle even after 10 attempts. Gemini 2.5 Pro solved it on the second try.
Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)
I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.
You probably know this and are looking for consistency but, a little trick I use is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.
Gemini 2.5 is very good. Since you have to wait for reasoning tokens, it takes longer to come back, but the responses are high quality IME.
re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers
Ha! That's the funniest and best description of 4.5 I've seen.
> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
Is that an LLM hallucination?
It’s a tongue in cheek reference to how audiophiles claim to hear differences in audio quality.
Pretty dark times on HN, when a silly (and obvious) joke gets someone labeled as AI.
Switch to Gemini 2.5 Pro, and be happy. It's better in every aspect.
It's somehow not, I've been asking it the same questions as ChatGPT and the answers feel off.
Warning to potential users: it's Google.
Not sure how or why OpenAI would be any better?
It's not. It's closed source. But Google is still the worst when it comes to privacy.
I prefer to use only open source models that don't have the possibility to share my data with a third party.
The notion that Google is worse at carefully managing PII than a Wild West place like OpenAI (or Meta, or almost any major alternative) is…not an accurate characterization, in my experience. Ad tech companies (and AI companies) obsessively capture data, but Google internally has always been equally obsessive about isolating and protecting that data. Almost no one can touch it; access is highly restricted and carefully managed; anything that even smells adjacent to ML on personal data has gotten high-level employees fired.
Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.
For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.
My thoughts is this model release is driven by the agentic app push if this year. Since to my knowledge all the big agentic apps (cursor, bolt, shortwave) that I know of use claude 3.7 because it’s so much better at instruction following and tool calling than GPT 4o so this model feels like GPT 4o (or distilled 4.5?) with some post training focusing on what these agentic workloads need most
Hey also try out Monday, it did something pretty cool. Its a version of 4o which switched between reasoning and plain token generation on the fly. My guess is that is what GPT V will be.
I'm also very curious of each limit for each model. Never thought about limit before upgrading my plan
Disagree. It's really not complicated at all to me. Not sure why people make a big fuss over this. I don't want an AI automating which AI it chooses for me. I already know through lots of testing intuitively which one I want.
If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.
I do like the vinyl and analog amplifiers. I certainly hear the warmth in this case.
what's hilarious to me is that I asked ChatGPT about the model names and approachs and it did a better job than they have.
Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year:
SWE Aider Cost Fast Fresh
Claude 3.7 70% 65% $15 77 8/24
Gemini 2.5 64% 69% $10 200 1/25
GPT-4.1 55% 53% $8 169 6/24
DeepSeek R1 49% 57% $2.2 22 7/24
Grok 3 Beta ? 53% $15 ? 11/24
I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.Is it available in Cursor yet?
I just finished updating the aider polyglot leaderboard [0] with GPT-4.1, mini and nano. My results basically agree with OpenAI's published numbers.
Results, with other models for comparison:
Model Score Cost
Gemini 2.5 Pro Preview 03-25 72.9% $ 6.32
claude-3-7-sonnet-20250219 64.9% $36.83
o3-mini (high) 60.4% $18.16
Grok 3 Beta 53.3% $11.03
* gpt-4.1 52.4% $ 9.86
Grok 3 Mini Beta (high) 49.3% $ 0.73
* gpt-4.1-mini 32.4% $ 1.99
gpt-4o-2024-11-20 18.2% $ 6.74
* gpt-4.1-nano 8.9% $ 0.43
Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)? There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case
What model are you personally using in your aider coding? :)
Mostly Gemini 2.5 Pro lately.
I get asked this often enough that I have a FAQ entry with automatically updating statistics [0].
Model Tokens Pct
Gemini 2.5 Pro 4,027,983 88.1%
Sonnet 3.7 518,708 11.3%
gpt-4.1-mini 11,775 0.3%
gpt-4.1 10,687 0.2%
[0] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro?
Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.
Aider author here.
Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.
Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.
Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.
Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview.
Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned.
I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort.
There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.
The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)
Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.
[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...
The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320
This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.
OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...
They just pick the best performer out of the built-in modes they offer.
Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.
I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.
Yes, it is available in Cursor[1] and Windsurf[2] as well.
[1] https://twitter.com/cursor_ai/status/1911835651810738406
[2] https://twitter.com/windsurf_ai/status/1911833698825286142
Its available for free in Windsurf so you can try it out there.
Edit: Now also in Cursor
Yup GPT 4.1 isn't good at all compared to the others. I tried a bunch of different scenarios, for me the winners:
Deepseek for general chat and research Claude 3.7 for coding Gemini 2.5 Pro experimental for deep research
In terms of price Deepseek is still absolutely fire!
OpenAI is in trouble honestly.
One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.).
GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.
I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.
don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:
- telling the model to be persistent (+20%)
- dont self-inject/parse toolcalls (+2%)
- prompted planning (+4%)
- JSON BAD - use XML or arxiv 2406.13121 (GDM format)
- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD
- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work
source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...
As an aside, one of the worst aspects of the rise of LLMs, for me, has been the wholesale replacement of engineering with trial-and-error hand-waving. Try this, or maybe that, and maybe you'll see a +5% improvement. Why? Who knows.
It's just not how I like to work.
I think trial-and-error hand-waving isn't all that far from experimentation.
As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.
No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.
Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.
Yes, it was the generation of the X360 and PS3. X360 was 3 core and the PS3 was 1+7 core (sort of a big.little setup).
Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.
Even today you'll usually only see 2-4 cores actually getting significant load.
Performance optimization is different, because there's still some kind of a baseline truth. Every knows what a FPS is, and +5% FPS is +5% FPS. Even the tricky cases have some kind of boundary (+5% FPS on this hardware but -10% on this other hardware, +2% on scenes meeting these conditions but -3% otherwise, etc).
Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.
there probably was still a structured way to test this through cross hatching but yeah like blind guessing might take longer and arrive at the same solution
I feel like this a common pattern with people who work in STEM. As someone who is used to working with formal proofs, equations, math, having a startup taught me how to rewire myself to work with the unknowns, imperfect solutions, messy details. I'm going on a tangent, but just wanted to share.
The disadvantage is that LLMs are probabilistic, mercurial, unreliable.
The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.
If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.
Software engineering has involved a lot of people doing trial-and-error hand-waving for at least a decade. We are now codifying the trend.
Out of curiosity, what do you work on where you don’t have to experiment with different solutions to see what works best?
Usually when we’re doing it in practice there’s _somewhat_ more awareness of the mechanics than just throwing random obstructions in and hoping for the best.
LLMs are still very young. We'll get there in time. I don't see how it's any different than optimizing for new CPU/GPU architectures other than the fact that the latter is now a decades-old practice.
Not to pick on you, but this is exactly the objectionable handwaving. What makes you think we'll get there? The kinds of errors that these technologies make have not changed, and anything that anyone learns about how to make them better changes dramatically from moment to moment and no one can really control that. It is different because those other things were deterministic ...
In comp sci it’s been deterministic, but in other science disciplines (eg medicine) it’s not. Also in lots of science it looks non-deterministic until it’s not (eg medicine is theoretically deterministic, but you have to reason about it experimentally and with probabilities - doesn’t mean novel drugs aren’t technological advancements).
And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.
The problem has always been that every token is suspect.
It's the whole answer being correct that's the important thing, and if you compare GPT 3 vs where we are today only 5 years later the progress in accuracy, knowledge and intelligence is jaw dropping.