GPT-4.1 in the API

680 points by maheshrijal 3 months ago

As a ChatGPT user, I'm weirdly happy that it's not available there yet. I already have to make a conscious choice between

- 4o (can search the web, use Canvas, evaluate Python server-side, generate images, but has no chain of thought)

- o3-mini (web search, CoT, canvas, but no image generation)

- o1 (CoT, maybe better than o3, but no canvas or web search and also no images)

- Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)

- 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)

- 4o "with scheduled tasks" (why on earth is that a model and not a tool that the other models can use!?)

Why do I have to figure all of this out myself?

throwup238 - 3 months ago

> - Deep Research (very powerful, but I have only 10 attempts per month, so I end up using roughly zero)
Same here, which is a real shame. I've switched to DeepResearch with Gemini 2.5 Pro over the last few days where paid users have a 20/day limit instead of 10/month and it's been great, especially since now Gemini seems to browse 10x more pages than OpenAI Deep Research (on the order of 200-400 pages versus 20-40).
The reports are too verbose but having it research random development ideas, or how to do something particularly complex with a specific library, or different approaches or architectures to a problem has been very productive without sliding into vibe coding territory.
- qingcharles - 3 months ago
  
  Wow, I wondered what the limit was. I never checked, but I've been using it hesitantly since I burn up OpenAI's limit as soon as it resets. Thanks for the clarity.
  I'm all-in on Deep Research. It can conduct research on niche historical topics that have no central articles in minutes, which typically were taking me days or weeks to delve into.
  - namaria - 3 months ago
    
    I like Deep Research but as a historian I have to tell you. I've used it for history themes to calibrated my expectations and it is a nice tool but... It can easily brush over nuanced discussions and just return folk wisdom from blogs.
    What I love most about history is it has lots of irreducible complexity and poring over the literature, both primary and secondary sources, is often the only way to develop an understanding.
    
    fullofbees - 3 months ago
    
    I read Being and Time recently and it has a load of concepts that are defined iteratively. There's a lot wrong with how it's written but it's an unfinished book written a 100 years ago so, I cant complain too much.
    Because it's quite long, if I asked Perplexity* to remind me what something meant, it would very rarely return something helpful, but, to be fair, I cant really fault it for being a bit useless with a very difficult to comprehend text, where there are several competing styles of reading, many of whom are convinced they are correct.
    But I started to notice a pattern of where it would pull answers from some weird spots, especially when I asked it to do deep research. Like, a paper from a University's server that's using concepts in the book to ground qualitative research, which is fine and practical explications are often useful ways into a dense concept, but it's kinda a really weird place to be the first initial academic source. It'll draw on Reddit a weird amount too, or it'll somehow pull a page of definitions from a handout for some University tutorial. And it wont default to the peer reviewed free philosophy encyclopedias that are online and well known.
    It's just weird. I was just using it to try and reinforce my actual reading of the text but I more came away thinking that in certain domains, this end of AI is allowing people to conflate having access to information, with learning about something.
    *it's just what I have access to.
    
    laggyluke - 3 months ago
    
    If you're asking an LLM about a particular text, even if it's a well-known text, you might get significantly better results if you provide said text as part of your prompt (context) instead of asking a model to "recall it from memory".
    So something like this: "Here's a PDF file containing Being and Time. Please explain the significance of anxiety (Angst) in the uncovering of Being."
    
    tekacs - 3 months ago
    
    When I've wanted it to not do things like this, I've had good luck directing it to... not look at those sources.
    For example when I've wanted to understand an unfolding story better than the news, I've told it to ignore the media and go only to original sources (e.g. speech transcripts, material written by the people involved, etc.)
    
    namaria - 3 months ago
    
    Deep Search is pretty good for current news stories. I've had it analyze some legal developments in a European nation recently and it gave me a great overview.
    
    iamacyborg - 3 months ago
    
    That use case seems pretty self defeating when a good news source will usually try to at least validate first-party materials which an llm cannot do.
    
    taurath - 3 months ago
    
    LLMs seem fantastic at generalizing broad thought and is not great at outliers. It sort of smooths over the knowledge curve confidently, which is a bit like in psychology where only CBT therapy is accepted, even if there are many much more highly effectual methodologies on individuals, just not at the population level.
  - antman - 3 months ago
    
    Interesting use case. My problem is that for niche subjects the crawled pages probably have not captured the information and the response becomes irrelevant. Perhaps gemini will produce better results just because it takes into account much more pages
- chrisshroba - 3 months ago
  
  I also like Perplexity’s 3/day limit! If I use them up (which I almost never do) I can just refresh the next day
  - behnamoh - 3 months ago
    
    I've only ever had to use DeepResearch for academic literature review. What do you guys use it for which hits your quotas so quickly?
    
    jml78 - 3 months ago
    
    I use it for mundane shit that I don’t want to spend hours doing.
    My son and I go to a lot of concerts and collect patches. Unfortunately we started collecting long after we started going to concerts.
    I had a list of about 30 bands I wanted patches for.
    I was able to give precise instructions on what I wanted. Deep research came back with direct links for every patch I wanted.
    It took me two minutes to write up the prompt and it did all the heavy lifting.
    
    sunnybeetroot - 3 months ago
    
    Write a comparison between X and Y
    
    szundi - 3 months ago
    
    [dead]
resters - 3 months ago

I use them as follows:
o1-pro: anything important involving accuracy or reasoning. Does the best at accomplishing things correctly in one go even with lots of context.
deepseek R1: anything where I want high quality non-academic prose or poetry. Hands down the best model for these. Also very solid for fast and interesting analytical takes. I love bouncing ideas around with R1 and Grok-3 bc of their fast responses and reasoning. I think R1 is the most creative yet also the best at mimicking prose styles and tone. I've speculated that Grok-3 is R1 with mods and think it's reasonably likely.
4o: image generation, occasionally something else but never for code or analysis. Can't wait till it can generate accurate technical diagrams from text.
o3-mini-high and grok-3: code or analysis that I don't want to wait for o1-pro to complete.
claude 3.7: occasionally for code if the other models are making lots of errors. Sometimes models will anchor to outdated information in spite of being informed of newer information.
gemini models: occasionally I test to see if they are competitive, so far not really, though I sense they are good at certain things. Excited to try 2.5 Deep Research more, as it seems promising.
Perplexity: discontinued subscription once the search functionality in other models improved.
I'm really looking forward to o3-pro. Let's hope it's available soon as there are some things I'm working on that are on hold waiting for it.
- rushingcreek - 3 months ago
  
  Phind was fine-tuned specifically to produce inline Mermaid diagrams for technical questions (I'm the founder).
  - underlines - 3 months ago
    
    I really loved Phind and always think of it as the OG perplexity / RAG search engine.
    Sadly stopped my subscription, when you removed the ability to weight my own domains...
    Otherwise the fine-tune for your output format for technical questions is great, with the options, the pro/contra and the mermaid diagrams. Just way better for technical searches, than what all the generic services can provide.
  - bsenftner - 3 months ago
    
    Have you been interviewed anywhere? Curious to read your story.
- shortcord - 3 months ago
  
  Gemini 2.5 Pro is quite good at code.
  Has become my go to for use in Cursor. Claude 3.7 needs to be restrained too much.
  - artdigital - 3 months ago
    
    Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
    And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
    But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
  - valenterry - 3 months ago
    
    Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
  - behnamoh - 3 months ago
    
    This. sonnet 3.7 is a wild horse. Gemini 2.5 Pro is like a 33 yo expert. o1 feels like a mature, senior colleague.
  - benhurmarcel - 3 months ago
    
    I find that Gemini 2.5 Pro tends to produce working but over-complicated code more often than Claude 3.7.
    
    torginus - 3 months ago
    
    Which might be a side-effect of the reasoning.
    In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
    In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
- motoboi - 3 months ago
  
  You probably know this but it can already generate accurate diagrams. Just ask for the output in a diagram language like mermaid or graphviz
  - bangaladore - 3 months ago
    
    My experience is it often produces terrible diagrams. Things clearly overlap, lines make no sense. I'm not surprised as if you told me to layout a diagram in XML/YAML there would be obvious mistakes and layout issues.
    I'm not really certain a text output model can ever do well here.
    
    resters - 3 months ago
    
    FWIW I think a multimodal model could be trained to do extremely well with it given sufficient training data. A combination of textual description of the system and/or diagram, source code (mermaid, SVG, etc.) for the diagram, and the resulting image, with training to translate between all three.
    
    bangaladore - 3 months ago
    
    Agreed. Even simply I'm sure a service like this already exists (or could easily exist) where the workflow is something like:
    1. User provides information
    2. LLM generates structured output for whatever modeling language
    3. Same or other multimodal LLM reviews the generated graph for styling / positioning issues and ensure its matches user request.
    4. LLM generates structured output based on the feedback.
    5. etc...
    But you could probably fine-tune a multimodal model to do it in one shot, or way more effectively.
    
    behnamoh - 3 months ago
    
    I had a latex tikz diagram problem which sonnet 3.7 couldn't handle even after 10 attempts. Gemini 2.5 Pro solved it on the second try.
    
    gunalx - 3 months ago
    
    Had the same experience. o3-mini failing misreably, claude 3.7 as well, but gemini 2.5 pro solved it perfectly. (image of diagram without source to tikz diagram)
  - resters - 3 months ago
    
    I've had mixed and inconsistent results and it hasn't been able to iterate effectively when it gets close. Could be that I need to refine my approach to prompting. I've tried mermaid and SVG mostly, but will also try graphviz based on your suggestion.
  - antman - 3 months ago
    
    Plantuml (action) diagrams are my go to
- wavewrangler - 3 months ago
  
  You probably know this and are looking for consistency but, a little trick I use is to feed the original data of what I need as a diagram and to re-imagine, it as an image “ready for print” - not native, but still a time saver and just studying with unstructured data or handles this surprisingly well. Again not native…naive, yes. Native, not yet. Be sure to double check triple check as always. give it the ol’ OCD treatment.
- barrkel - 3 months ago
  
  Gemini 2.5 is very good. Since you have to wait for reasoning tokens, it takes longer to come back, but the responses are high quality IME.
- czk - 3 months ago
  
  re: "grok-3 is r1 with mods" -- do you mean you believe they distilled deepseek r1? that was my assumption as well, though i thought it more jokingly at first it would make a lot of sense. i actually enjoy grok 3 quite a lot, it has some of the most entertaining thinking traces.
StephenAshmore - 3 months ago

> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers
Ha! That's the funniest and best description of 4.5 I've seen.
cafeinux - 3 months ago

> 4.5 (better in creative writing, and probably warmer sound thanks to being vinyl based and using analog tube amplifiers, but slower and request limited, and I don't even know which of the other features it supports)
Is that an LLM hallucination?
- cheschire - 3 months ago
  
  It’s a tongue in cheek reference to how audiophiles claim to hear differences in audio quality.
- SadTrombone - 3 months ago
  
  Pretty dark times on HN, when a silly (and obvious) joke gets someone labeled as AI.
  - netdevphoenix - 3 months ago
    
    Obvious to you perhaps not to everyone. Self-awareness goes a long way
- lxgr - 3 months ago
  
  Possibly, but it's running on 100% wetware, I promise!
- divan - 3 months ago
  
  Looks like NDA violation )
SweetSoftPillow - 3 months ago

Switch to Gemini 2.5 Pro, and be happy. It's better in every aspect.
- exadeci - 3 months ago
  
  It's somehow not, I've been asking it the same questions as ChatGPT and the answers feel off.
- miroljub - 3 months ago
  
  Warning to potential users: it's Google.
  - tomalbrc - 3 months ago
    
    Not sure how or why OpenAI would be any better?
    
    miroljub - 3 months ago
    
    It's not. It's closed source. But Google is still the worst when it comes to privacy.
    I prefer to use only open source models that don't have the possibility to share my data with a third party.
    
    jrk - 3 months ago
    
    The notion that Google is worse at carefully managing PII than a Wild West place like OpenAI (or Meta, or almost any major alternative) is…not an accurate characterization, in my experience. Ad tech companies (and AI companies) obsessively capture data, but Google internally has always been equally obsessive about isolating and protecting that data. Almost no one can touch it; access is highly restricted and carefully managed; anything that even smells adjacent to ML on personal data has gotten high-level employees fired.
    Fully private and local inference is indeed great, but of the centralized players, Google, Microsoft, and Apple are leagues ahead of the newer generation in conservatism and care around personal data.
    
    miroljub - 3 months ago
    
    I'm not convinced Google is the gold standard for protecting PII. Data breaches can still happen despite internal controls, and their ad-based business model incentivizes data collection. The "high-level employees getting fired" story sounds like PR - how often does that actually happen? I'm not buying that they're leagues ahead of everyone else in data protection.
cr4zy - 3 months ago

For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.
rockwotj - 3 months ago

My thoughts is this model release is driven by the agentic app push if this year. Since to my knowledge all the big agentic apps (cursor, bolt, shortwave) that I know of use claude 3.7 because it’s so much better at instruction following and tool calling than GPT 4o so this model feels like GPT 4o (or distilled 4.5?) with some post training focusing on what these agentic workloads need most
anshumankmr - 3 months ago

Hey also try out Monday, it did something pretty cool. Its a version of 4o which switched between reasoning and plain token generation on the fly. My guess is that is what GPT V will be.
lucaskd - 3 months ago

I'm also very curious of each limit for each model. Never thought about limit before upgrading my plan
youssefabdelm - 3 months ago

Disagree. It's really not complicated at all to me. Not sure why people make a big fuss over this. I don't want an AI automating which AI it chooses for me. I already know through lots of testing intuitively which one I want.
If they abstract all this away into one interface I won't know which model I'm getting. I prefer reliability.
yousif_123123 - 3 months ago

I do like the vinyl and analog amplifiers. I certainly hear the warmth in this case.
xnx - 3 months ago

This sounds like whole lot of mental overhead to avoid using Gemini.
guillaume8375 - 3 months ago

What do you mean when you say that 4o doesn’t have chain-of-thought?
fragmede - 3 months ago

what's hilarious to me is that I asked ChatGPT about the model names and approachs and it did a better job than they have.
chrisandchris - 3 months ago

Just ask the first AI that comes to mind which one you could ask.
konart - 3 months ago

Must be weird to not have an "AI router" in this case.

modeless - 3 months ago

Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, output tokens per second, and knowledge cutoff month/year:

             SWE  Aider Cost Fast Fresh
 Claude 3.7  70%  65%   $15  77   8/24
 Gemini 2.5  64%  69%   $10  200  1/25
 GPT-4.1     55%  53%   $8   169  6/24
 DeepSeek R1 49%  57%   $2.2 22   7/24
 Grok 3 Beta ?    53%   $15  ?    11/24

I'm not sure this is really an apples-to-apples comparison as it may involve different test scaffolding and levels of "thinking". Tokens per second numbers are from here: https://artificialanalysis.ai/models/gpt-4o-chatgpt-03-25/pr... and I'm assuming 4.1 is the speed of 4o given the "latency" graph in the article putting them at the same latency.

Is it available in Cursor yet?

anotherpaulg - 3 months ago
I just finished updating the aider polyglot leaderboard [0] with GPT-4.1, mini and nano. My results basically agree with OpenAI's published numbers.
Results, with other models for comparison:
```
    Model                       Score   Cost

    Gemini 2.5 Pro Preview 03-25 72.9%  $ 6.32
    claude-3-7-sonnet-20250219   64.9%  $36.83
    o3-mini (high)               60.4%  $18.16
    Grok 3 Beta                  53.3%  $11.03
  * gpt-4.1                      52.4%  $ 9.86
    Grok 3 Mini Beta (high)      49.3%  $ 0.73
  * gpt-4.1-mini                 32.4%  $ 1.99
    gpt-4o-2024-11-20            18.2%  $ 6.74
  * gpt-4.1-nano                  8.9%  $ 0.43
```
Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.
[0] https://aider.chat/docs/leaderboards/
[1] https://aider.chat/HISTORY.html
- pzo - 3 months ago
  
  Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)? There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case
- purplerabbit - 3 months ago
  
  What model are you personally using in your aider coding? :)
  - anotherpaulg - 3 months ago
    
    Mostly Gemini 2.5 Pro lately.
    I get asked this often enough that I have a FAQ entry with automatically updating statistics [0].
    Model Tokens Pct Gemini 2.5 Pro 4,027,983 88.1% Sonnet 3.7 518,708 11.3% gpt-4.1-mini 11,775 0.3% gpt-4.1 10,687 0.2%
    [0] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...
jsnell - 3 months ago

https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro?
Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.
- anotherpaulg - 3 months ago
  
  Aider author here.
  Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.
  Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.
  Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.
  - BonoboIO - 3 months ago
    
    Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview.
  - modeless - 3 months ago
    
    Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned.
    
    jmtulloss - 3 months ago
    
    I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort.
- modeless - 3 months ago
  
  There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.
  - jsnell - 3 months ago
    
    The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)
    
    modeless - 3 months ago
    
    Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.
    [1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...
    
    jsnell - 3 months ago
    
    The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320
    This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.
    
    modeless - 3 months ago
    
    OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...
  - tcdent - 3 months ago
    
    They just pick the best performer out of the built-in modes they offer.
    Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.
    I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.
meetpateltech - 3 months ago

Yes, it is available in Cursor[1] and Windsurf[2] as well.
[1] https://twitter.com/cursor_ai/status/1911835651810738406
[2] https://twitter.com/windsurf_ai/status/1911833698825286142
- cellwebb - 3 months ago
  
  And free on windsurf for a week! Vibe time.
tomjen3 - 3 months ago

Its available for free in Windsurf so you can try it out there.
Edit: Now also in Cursor
ilrwbwrkhv - 3 months ago

Yup GPT 4.1 isn't good at all compared to the others. I tried a bunch of different scenarios, for me the winners:
Deepseek for general chat and research Claude 3.7 for coding Gemini 2.5 Pro experimental for deep research
In terms of price Deepseek is still absolutely fire!
OpenAI is in trouble honestly.
- torginus - 3 months ago
  
  One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.).
  GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.
  I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.
soheil - 3 months ago

Yes on both Cursor and Windsurf.
https://twitter.com/cursor_ai/status/1911835651810738406
- 3 months ago

[deleted]
- 3 months ago

[deleted]

swyx - 3 months ago

don't miss that OAI also published a prompting guide WITH RECEIPTS for GPT 4.1 specifically for those building agents... with a new recommendation for:

- telling the model to be persistent (+20%)

- dont self-inject/parse toolcalls (+2%)

- prompted planning (+4%)

- JSON BAD - use XML or arxiv 2406.13121 (GDM format)

- put instructions + user query at TOP -and- BOTTOM - bottom-only is VERY BAD

- no evidence that ALL CAPS or Bribes or Tips or threats to grandma work

source: https://cookbook.openai.com/examples/gpt4-1_prompting_guide#...

pton_xd - 3 months ago

As an aside, one of the worst aspects of the rise of LLMs, for me, has been the wholesale replacement of engineering with trial-and-error hand-waving. Try this, or maybe that, and maybe you'll see a +5% improvement. Why? Who knows.
It's just not how I like to work.
- zoogeny - 3 months ago
  
  I think trial-and-error hand-waving isn't all that far from experimentation.
  As an aside, I was working in the games industry when multi-core was brand new. Maybe Xbox-360 and PS3? I'm hazy on the exact consoles but there was one generation where the major platforms all went multi-core.
  No one knew how to best use the multi-core systems for gaming. I attended numerous tech talks by teams that had tried different approaches and were give similar "maybe do this and maybe see x% improvement?". There was a lot of experimentation. It took a few years before things settled and best practices became even somewhat standardized.
  Some people found that era frustrating and didn't like to work in that way. Others loved the fact it was a wide open field of study where they could discover things.
  - jorvi - 3 months ago
    
    Yes, it was the generation of the X360 and PS3. X360 was 3 core and the PS3 was 1+7 core (sort of a big.little setup).
    Although it took many, many more years until games started to actually use multi-core properly. With rendering being on a 16.67ms / 8.33ms budget and rendering tied to world state, it was just really hard to not tie everything into eachother.
    Even today you'll usually only see 2-4 cores actually getting significant load.
  - Nullabillity - 3 months ago
    
    Performance optimization is different, because there's still some kind of a baseline truth. Every knows what a FPS is, and +5% FPS is +5% FPS. Even the tricky cases have some kind of boundary (+5% FPS on this hardware but -10% on this other hardware, +2% on scenes meeting these conditions but -3% otherwise, etc).
    Meanwhile, nobody can agree on what a "good" LLM in, let alone how to measure it.
  - hackernewds - 3 months ago
    
    there probably was still a structured way to test this through cross hatching but yeah like blind guessing might take longer and arrive at the same solution
- barrkel - 3 months ago
  
  The disadvantage is that LLMs are probabilistic, mercurial, unreliable.
  The advantage is that humans are probabilistic, mercurial and unreliable, and LLMs are a way to bridge the gap between humans and machines that, while not wholly reliable, makes the gap much smaller than it used to be.
  If you're not making software that interacts with humans or their fuzzy outputs (text, images, voice etc.), and have the luxury of well defined schema, you're not going to see the advantage side.
- pclmulqdq - 3 months ago
  
  Software engineering has involved a lot of people doing trial-and-error hand-waving for at least a decade. We are now codifying the trend.
- brokencode - 3 months ago
  
  Out of curiosity, what do you work on where you don’t have to experiment with different solutions to see what works best?
  - FridgeSeal - 3 months ago
    
    Usually when we’re doing it in practice there’s _somewhat_ more awareness of the mechanics than just throwing random obstructions in and hoping for the best.
    
    RussianCow - 3 months ago
    
    LLMs are still very young. We'll get there in time. I don't see how it's any different than optimizing for new CPU/GPU architectures other than the fact that the latter is now a decades-old practice.
    
    th0ma5 - 3 months ago
    
    Not to pick on you, but this is exactly the objectionable handwaving. What makes you think we'll get there? The kinds of errors that these technologies make have not changed, and anything that anyone learns about how to make them better changes dramatically from moment to moment and no one can really control that. It is different because those other things were deterministic ...
    
    Closi - 3 months ago
    
    In comp sci it’s been deterministic, but in other science disciplines (eg medicine) it’s not. Also in lots of science it looks non-deterministic until it’s not (eg medicine is theoretically deterministic, but you have to reason about it experimentally and with probabilities - doesn’t mean novel drugs aren’t technological advancements).
    And while the kind of errors hasn’t changed, the quantity and severity of the errors has dropped dramatically in a relatively short span of time.
    
    th0ma5 - 3 months ago
    
    The problem has always been that every token is suspect.
    
    Closi - 3 months ago
    
    It's the whole answer being correct that's the important thing, and if you compare GPT 3 vs where we are today only 5 years later the progress in accuracy, knowledge and intelligence is jaw dropping.