Anthropic's original take home assignment open sourced

221 points by myahio 5 hours ago

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m

Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".

ponyous - 30 minutes ago

Very interesting thanks! I wonder what would happen if you kept running Gemini in a loop for a while. Considering how much faster it ended it seems like there is a lot more potential.
lawrencechen - 13 minutes ago

codex cli + gpt-5-2-codex-xhigh got to 1606 with the prompt "beat 1487 cycles. go." ~53 minutes.
- jstummbillig - 5 minutes ago
  
  Will you watch this man's prompt skills?!
- - 5 minutes ago
  
  [deleted]
forgotpwd16 - 26 minutes ago

Could you make a repo with solutions given by each model inside a dir/branch for comparison?
- kitrak95 - 18 minutes ago
  
  Are you giving instructions to a stranger on the internet?
giancarlostoro - 28 minutes ago

I do wonder how Grok would compare, specifically their Claude Code Fast model.

pvalue005 - 2 hours ago

I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)

bjackman - 18 minutes ago

Lately with Gemini CLI / Jules it doesn't seem like time spent is a good proxy for difficulty. It has a big problem with getting into loops of "I am preparing the response for the user. I am done. I will output the answer. I am confident. Etc etc".
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
bird0861 - an hour ago

Which Gemini model did you use? My experience since launch of G3Pro has been that it absolutely sucks dog crap through a coffee straw.
- pvalue005 - 20 minutes ago
  
  /model: Auto (Gemini 3) Let Gemini CLI decide the best model for the task: gemini-3-pro, gemini-3-flash
  After ~40 minutes, it got to:
  The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
  It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
  I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
- Mashimo - 41 minutes ago
  
  > sucks dog crap through a coffee straw.
  That would be impressive.
  - anematode - 29 minutes ago
    
    New LLM benchmark incoming? I bet once it's done, people will still say it's not AGI.
    
    dotancohen - 24 minutes ago
    
    When they get the hardware capable of that, a different industry will be threatened by AI. The oldest industry.
    
    cess11 - a minute ago
    
    Textile?

lbreakjai - an hour ago

I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.

TrackerFF - 12 minutes ago

Well it is a specialized problem. If you've never worked on anything similar previously, it is going to take time. Don't even need to interview for selective billion dollar companies like Anthropic to encounter these types of problems - after college I interviewed for various electronics/hardware companies where you'd get asked to optimize low-level code - which would have looked quite foreign, if you had never actually worked on such problems before.
fergie - 11 minutes ago

I'm 30 years in, and literally don't understand the question.
- drdaeman - 3 minutes ago
  
  As I get it from a quick glance (so, could be awfully wrong): There’s a VLIW VM implementation but it’s simple, naive and slow. The task is to squeeze out some performance of it.
  Just a niche task that requires specialized knowledge, it’s totally fine if you aren’t familiar with its particular domain.
apsurd - 42 minutes ago

disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.
the hot take is, there are other games.
- sevenzero - 13 minutes ago
  
  Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team. Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life. Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?

bytesandbits - an hour ago

Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.

petters - an hour ago

And the answer to the obvious follow-up question is...?
- reader9274 - 13 minutes ago
  
  fries

sureglymop - 3 hours ago

Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.

As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.

forgotpwd16 - 22 minutes ago

Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.

avaer - 3 hours ago

It's pretty interesting how close this assignment looks to demoscene [1] golf [2].

[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf

It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...

nice_byte - 3 hours ago

it's designed to select for people who can be trusted to manually write ptx :-)

NitpickLawyer - 2 hours ago

The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.

If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.

spencerflem - 11 minutes ago

Oh wow it’s by Tristan Hume, still remember you from EyeLike!

Maro - 2 hours ago

> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.

Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?

Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?

kristianpaul - 2 hours ago

“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”

sevenzero - 2 minutes ago

The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.
afro88 - an hour ago

> at launch
Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
- mediaman - an hour ago
  
  No, they later updated the harness for this and it subsequently got better scores.

koolba - 4 hours ago

What is the actual assignment here?

The README only gives numbers without any information on what you’re supposed to do or how you are rated.

glalonde - 3 hours ago

"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator." from perf_takehome.py
vermilingua - 3 hours ago

Think that means you failed :(
- nice_byte - 3 hours ago
  
  +1
  being cryptic and poorly specified is part of the assignment
  just like real code
  in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.
  - throwaway81523 - 2 hours ago
    
    I didn't see much cryptic except having to click on "perf_takehome.py" without being told to. But, 2 hours didn't seem like much to bring the sample code into some kind of test environment, debug it enough to works out details of its behaviour, read through the reference kernel and get some idea of what the algorithm is doing, read through the simulator to understand the VM instruction set, understand the test harness enough to see how the parallelism works, re-code the algorithm in the VM's machine language while iterating performance tweaks and running simulations, etc.
    Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.
    
    ithkuil - 23 minutes ago
    
    My instinct to read about the problem was to open the "problem.py" file, which states "Read the top of perf_takehome.py for more introduction"
    So yeah. They _could_ have written it much more clearly in the readme.
    
    tayo42 - an hour ago
    
    2 hours does seem short. It took me a half hour to get through all you listed and figure out how to get the valu instruction working.
    I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?
    Idk maybe I'm slow or really not qualified.
    
    nice_byte - 2 hours ago
    
    it's "cryptic" for an interview problem. e.g. the fact that you have to actually look at the vm implementation instead of having the full documentation of the instruction set from the get go.
    
    throwaway81523 - an hour ago
    
    That seems normal for an interview problem. They put you in front of some already-written code and you have to fix a bug or implement a feature. I've done tons of those in live interviews. So that part didn't bother me. It's mostly the rather large effort cost in the case where the person is a job applicant, vs an unknown and maybe quite low chance of getting hired.
    With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.
    There's another thread about this article, which explains an analogous situation about being asked to read AI slop: https://zanlib.dev/blog/reliable-signals-of-honest-intent/
  - avaer - 2 hours ago
    
    It's definitely cleaner than what you will see in the real world. Research-quality repositories written in partial Chinese with key dependencies missing are common.
    IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.
- 4 hours ago

[deleted]

mips_avatar - 3 hours ago

Going through the assignment now. Man it’s really hard to pack the vectors right

tayo42 - an hour ago

I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.

dhruv3006 - 2 hours ago

I wonder if OpenAI follows suit.

rvz - 2 hours ago

They should.

greesil - 3 hours ago

This is a knowledge test of GPU architecture?

avaer - 3 hours ago

Kind of, but not any particular GPU.
The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...
But presumably similar principles apply.
benreesman - an hour ago

It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.
This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.

tucnak - 2 hours ago

The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?

riffraff - 2 hours ago

I feel that came out wrong but the "maybe" was intended to be a way of saying "no guarantees", to avoid giving people the idea "solve this, get hired".
- maerch - 21 minutes ago
  
  In that case, removing „perhaps“ would have helped a lot. It is not about maybe being hired, but about maybe being interviewed.
- Bootvis - an hour ago
  
  Should have asked Claude how to write it better.
ahussain - 2 hours ago

They wrote:
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
- 0x3f - 2 hours ago
  
  I suppose you could interpret it either way, but having dealt with their interview pipeline I'd choose the snark.
  - - 2 hours ago
    
    [deleted]
- lovich - 2 hours ago
  
  That paraphrases to
  "do better than we have publicly admitted most of humanity can do, and we may deign to interview you"
  It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.
  - miki123211 - an hour ago
    
    I suspect this is partially legal CYA.
    There's more to employees than their raw ability to go below some performance threshold. If somebody passes the test, but lives in an US sanctioned country with no plans to move, is well known for using the n-word on social media or has previously broken an NDA, Anthropic probably doesn't want to interview them.
  - andruby - 26 minutes ago
    
    I understand how it can be interpreted as snarky, but how could it have been written better? It's a hard path to walk and recruiting/interviewing is inherently sensitive it seems.
    
    lovich - 20 minutes ago
    
    The original
    >If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
    Not condescending
    > If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code so we can schedule an interview.
  - throwaway743 - 2 hours ago
    
    I took the "perhaps" as a decision to be considered by the applicant, considering they'd be competent enough to get in at a place of their choice, not just anthropic.
    
    lovich - 2 hours ago
    
    Does the applicant or the employer decide if an interview happens in your experience?
    Do you think if the applicants are really in that level of demand that they would be getting a take home test instead of being actively recruited?
    Legitimately lay out your understanding of a world where an employer is chasing after employees who are high in demand, give them a test that is expected to take hours, and have a hedged bet in their wording, instead of saying we will absolutely hire you if you pass X bar?
NewJazz - an hour ago

They may not be able to hire folks in certain jurisdictions. Or even interview them. (Iran, NK)
kristopolous - 2 hours ago

If you're an asshole that wants millions of dollars...i mean there's still places to say no
sourcegrift - 2 hours ago

Pride comes before fall thankfully
altmanaltman - 2 hours ago

its anthrophic. their entire marketing is just being an pompous ass and AI fear mongering.

Incipient - an hour ago

>so we can be appropriately impressed and perhaps discuss interviewing.

Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.

I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.

sponnath - an hour ago

I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.

zeroCalories - 3 hours ago

It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.

djmips - 16 minutes ago

If you look at it as a puzzle game then it's not any different than the time you use to play other games.
pclmulqdq - 2 hours ago

I generally have a policy of "over 4 hours and I charge for my time." I did this in the 4-hour window, and it was a lot of fun. Much better than many other take-home assignments.
- heavyset_go - 2 hours ago
  
  I don't do take home assignments, but when I did, I would offer to do it at my hourly rate, even if it was just an hour. It's time I would otherwise spend making money.
  Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.
  Does help that I have a very public web presence and portfolio, though.
  - theptip - 2 hours ago
    
    For many reasons, you’re not gonna get into Anthropic with that attitude.
    
    PlanksVariable - 2 hours ago
    
    And Anthropic will never land heavyset_go with their attitude. I guess we’re at an impasse.
  - dheera - an hour ago
    
    Time is the issue, not money.
    I couldn't care less about getting paid for a few hours, what's truly annoying when you're job hunting is the company having an extremely high rejection rate even at the take-home stage. That's an inordinate waste of time multiplied by a lot of companies.
    If you have a >50% chance of rejecting, don't even give the candidate a take-home. Be at least 90% sure you want them before you get to that stage.
- whateveracct - 2 hours ago
  
  4 hours continuous or no? I can't imagine finding 4 hours of straight focus.
  - ryanjshaw - an hour ago
    
    These kinds of roles are for youngsters with minimal commitments who are looking for their shot to break into a wild industry. It’s not for the middle aged single parent with FTE and just enough free time to do an extra load of laundry.
browningstreet - 3 hours ago

I’ve been sent the Anthropic interview assignments a few times. I’m not a developer so I don’t bother. At least at the time they didn’t seem to have technical but not-dev screenings. Maybe they do now.
- throwa356262 - an hour ago
  
  Care to elaborate the first part?
  Did you apply for a position? Did they send you the assignment without prior discussion?
sealeck - 3 hours ago

Why is writing code to execute a program using the fewest instructions possible on a virtual machine a waste of time?
- 0x3f - 2 hours ago
  
  The expected time you spend on it is much less than the expected time they'll spend on it.
- efilife - 16 minutes ago
  
  you don't get paid for it
mips_avatar - 3 hours ago

It’s kind of an interesting problem.

OhNoNotAgain_99 - 33 minutes ago

[dead]

myahio - 5 hours ago

[flagged]

jackblemming - 3 hours ago

Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.

tmule - 3 hours ago

Your comments history suggests you’re rather bitter about “nerds” who are likely a few standard deviations smarter than you (Anthropic OG team, Jeff Dean, proof nerds, Linus, …)
- jackblemming - 3 hours ago
  
  And they’re all dumber than John von Neumann, who cares?
  - margalabargala - 2 hours ago
    
    Transitively, you haven't thought the most thoughts or cared the most about anything, therefore we should disregard what you think and care about?
    
    jackblemming - 2 hours ago
    
    The person replying was trying to turn the conversation into some sort of IQ pissing contest. Not sure why, that seems like their own problem. I was reminding them that there is always someone smarter.
    
    wiseowise - 17 minutes ago
    
    Your comment history is littered with “nerds”, “elite”, “better” and all sorts of comparisons.
    > I was reminding them that there is always someone smarter.
    And even with this comment you literally do not understand that you have some skewed view of the world. Do you have some high school trauma?
    
    efilife - 14 minutes ago
    
    > Do you have some high school trauma?
    I am not sure ad personam is appropriate here
    
    wiseowise - 4 minutes ago
    
    This is a thread about their personality.
    https://news.ycombinator.com/item?id=46701378
muglug - 3 hours ago

If they're hiring performance engineers then they're hiring for exactly these sets of skills.
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
onion2k - 3 hours ago

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
Analemma_ - 3 hours ago

This would be an inappropriate assignment for a web dev position, but I'm willing to bet that a 1% improvement in cycles per byte in inference (or whatever) saves Anthropic many millions of dollars. This is one case where the whiteboard assignment is clearly related to the actual job duties.
rvz - 3 hours ago

> Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.
Good. That should be the minimum requirement.
Not another Next.js web app take home project.