Claude Advanced Tool Use
anthropic.com285 points by lebovic 5 hours ago
285 points by lebovic 5 hours ago
A couple points from this I'm trying to understand:
- Is the idea that MCP servers will provide tool use examples in their tool definitions? I'm assuming this is the case but it doesn't seem like this announcement is explicit about it, I assume because Anthropic wants to at least maintain the appearance of having the MCP steering committee have its independence from Anthropic.
- If there is tool use examples and programmatic tool calling (code mode), it could also make sense for tools to specify example code so the codegen step can be skipped. And I'm assuming the reason this isn't done is just that it's a security disaster to be instructing a model to run code specified by a third party that may be malicious or compromised. I'm just curious if my reasoning about this seems to be correct.
The Programmatic Tool Calling has been an obvious next step for a while. It is clear we are heading towards code as a language for LLMs so defining that language is very important. But I'm not convinced of tool search. Good context engineering leaves the tools you will need so adding a search if you are going to use all of them is just more overhead. What is needed is a more compact tool definition language like, I don't know, every programming language ever in how they define functions. We also need objects (which hopefully Programatic Tool Calling solves or the next version will solve). In the end I want to drop objects into context with exposed methods and it knows the type and what is callable on they type.
Exactly, instead of this mess, you could just give it something like .d.ts.
Easy to maintain, test etc. - like any other library/code.
You want structure? Just export * as Foo from '@foo/foo' and let it read .d.ts for '@foo/foo' if it needs to.
But wait, it's also good at writing code. Give it write access to it then.
Now it can talk to sql server, grpc, graphql, rest, jsonrpc over websocket, or whatever ie. your usb.
If it needs some tool, it can import or write it itself.
Next realisation may be that jupyter/pluto/mathematica/observable but more book-like ai<->human interaction platform works best for communication itself (too much raw text, I'd take you days to comprehend what it spit out in 5 minutes - better to have summary pictures, interactive charts, whatever).
With voice-to-text because poking at flat squares in all of this feels primitive.
For improved performance you can peer it with other sessions (within your team, or global/public) - surely others solved similar problems to yours where you can grab ready solutions.
It already has ablity to create tool that copies itself and can talk to a copy so it's fair to call this system "skynet".
Why exactly do we need a new language? The agents I write get access to a subset of the Python SDK (i.e. non-destructive), packages, and custom functions. All this ceremony around tools and pseudo-RPC seems pointless given LLMs are extremely capable of assembling code by themselves.
The latest MCP specifications (2025-06-18+) introduced crucial enhancements like support for Structured Content and the Output Schema.
Smolagents makes use of this and handles tool output as objects (e.g. dict). Is this what you are thinking about?
Details in a blog post here: https://huggingface.co/blog/llchahn/ai-agents-output-schema
We just need simple language syntax like python and for models to be trained on it (which they already mostly are):
class MyClass(SomeOtherClass):
def my_func(a:str, b:int) -> int:
#Put the description (if needed) in the body for the llm.
That is way more compact than the json schema out there. Then you can have 'available objects' listed like: o1 (MyClass), o2 (SomeOtherClass) as the starting context. Combine this with programatic tool calling and there you go. Much much more compact. Binds well to actual code and very flexible. This is the obvious direction things are going. I just wish Anthropic and OpenAI would realize it and define it/train models to it sooner rather than later.edit: I should also add that inline response should be part of this too: The model should be able to do ```<code here>``` and keep executing with only blocking calls requiring it to stop generating until the block frees up. so, for instance, the model could ```r = start_task(some task)``` generate other things ```print(r.value())``` (probably with various awaits and the like here but you all get the point).
This is heading in the wrong direction.
> The future of AI agents is one where models work seamlessly across hundreds or thousands of tools.
Says who? I see it going the other way - less tools, better skills to apply those tools.
To take it to an extreme, you could get by with ShellTool.
Yeah I kind of agree. I think there's demand for an connector ecosystem because it's something we can understand and market, but I think it's the wrong paradigm
While maybe the model could do everything from first principles every time, once you have a known good tool that performs a single action perfectly, why not use that tool for that action? Maybe as part of training, the model could write, test, and learn to trust its own set of tools, rather than rely on humans to set them up afterwards.
The whole time while reading over this, I was thinking how a small orchestrator local model might help with somewhat known workflows. Programmatic orchestration is ideal, but can be impractical for all cases. In the interest of reducing context pollution, improving speed, and providing a better experience; I would think the ideal hierarchy for orchestration would be programmatic > tiny local LLM > frontier LLM. The tiny model doesn't need to be local as computers have varying resources.
I would think there would be some things a tiny model would be capable of competently managing and faster. The tiny model's context could be regularly cleared, and only relevant outputs could be sent to the larger model's context.
Our agentic builder has a single tool.
It is called graphql.
The agent writes a query and executes it. If the agent does not know how to do particular type of query then it can use graphql introspection. The agent only receives the minimal amount of data as per the graphql query saving valuable tokens.
It works better!
Not only we don't need to load 50+ tools (our entire SDK) but it also solves the N+1 problem when using traditional REST APIs. Also, you don't need to fall back to write code especially for query and mutations. But if you need to do that, the SDK is always available following graphql typed schema - which helps agents write better code!
While I was never a big fan of graphql before, considering the state of MCP, I strongly believe it is one of the best technologies for AI agents.
I wrote more about this here if you are interested: https://chatbotkit.com/reflections/why-graphql-beats-mcp-for...
Isn't the challenge that introspecting graphql will lead to either a) a very long set of definitions consuming many tokens or b) many calls to drill into the introspection?
In my experience, this was the limitation we ran into with this approach. If you have a large API this will blow up your context.
I have had the best luck with hand-crafted tools that pre-digest your API so you don't have to waste tokens or deal with context rot bugs.
This is actually a really good use of graphql!
IMO the biggest pain points of graphql are authorization/rate limiting, caching, and mutations... But for selective context loading none of those matter actually. Pretty cool!
That is also the approach we took with Exograph (https://exograph.dev). Here is our reasoning (https://exograph.dev/blog/exograph-now-supports-mcp#comparin...). We found that LLMs do a very good job of crafting GraphQL queries for the given schema. While they do make mistakes, returning good descriptive error messages make is easy for them fix queries.
1000%
2 years ago I gave a talk on Vector DB's and LLM use.
https://www.youtube.com/watch?v=U_g06VqdKUc
TLDR but it shows how you could teach an LLM your GraphQL query language to let it selectively load context into what were very small context windows at the time.
After that the MCP specification came out. Which from my vantage point is a poor and half implemented version of what GraphQL already is.
> It works better!
> I strongly believe it is one of the best technologies for AI agents
Do you have any quantitative evidence to support this?
Sincere question. I feel it would add some much needed credibility in a space where many folks are abusing the hype wave and low key shilling their products with vibes instead of rigor.
I've seen a similar setup with an llm loop integrated with clojure. In clojure, code is data, so the llm can query, execute, and modify the program directly
I have thought about this for all of thirty seconds, but it wouldn't shock me if this was the case. The intuition here is about types, and the ability to introspect them. Agents really love automated guardrails. It makes sense to me that this would work better than RESTish stuff, even with OpenAPI.
Same in terms of time spent. The hypothesis graphql is superior passes the basic sniff test. Assuming graphql does what it says on the tin, which my understanding is it does based on my work with Ent, then the claim it’s better for tool and api use by agents follows from common sense.
Better than rest is a low bar though. Ultimately agents should rarely be calling raw rest and graphql apis, which are meant for programmatic use.
Agents should be calling one level of abstraction higher.
Eg calling a function to “find me relevant events in this city according to this users preferences” instead of “list all events in this city”.
If you knew GraphQL, you may immediately see it - you ask for specific nested structure of the data, which can span many joins across different related collections. This is not the case with common REST API or CLI for example. And introspection is another good reason.
I do think that using graphql will solve a lot of problems for people but it's super surprising how many people absolutely hate it.
GraphQL is just a typed schema (good) with a server capable of serving any subset of the entire schema at a time (pain in the ass).
It doesn’t actually require that second part. Every time I’ve used it in a production system, we had an approved list of query shapes that were accepted. If the client wanted to use a new kind of query, it was performance tested and sometimes needed to be optimized before approval for use.
If you open it up for any possible query, then give that to uncontrolled clients, it’s a recipe for disaster.
Oh, we have that too! But we call it HTTP endpoints.
Really? Hmm... where in the HTTP spec does it allow for returning an arbitrary subset of any specific request, rather than the whole thing? And where does it ensure all the results are keyed by id so that you can actually build and update a sensible cache around all of it rather than the mess that totally free-form HTTP responses lead to? Oh weird HTTP doesn't have any of that stuff? Maybe we should make a new spec, something which does allow for these patterns and behaviors? And it might be confusing if we use the exact same name as HTTP, since the usage patterns are different and it enables new abilities. If only we could think of such a name...
An HTTP Range request asks the server to send parts of a resource back to a client. Range requests are useful for various clients, including media players that support random access, data tools that require only part of a large file, and download managers that let users pause and resume a download.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...
also handy for bypassing bandwidth restrictions: capped at 100kbps? launch 1000 workers to grab chunks then assemble the survivors
Without wishing to take part in a pile on - I am wondering why you're using graphql if you are kneecapping it and restricting it to set queries.
Because it solves all sorts of other problems, like having a well-defined way to specify the schema of queries and results, and lots of tools built around that.
I would be surprised to see many (or any) GQL endpoints in systems with significant complexity and scale that allow completely arbitrary requests.
Shopify's GraphQL API limits you in complexity (essentially max number of fields returned), but it's basically arbitrary shapes.
OpenAPI does the same thing for http requests, with tooling around it.
With typed languages you can auto-generate OpenAPI schemas from your code.
Probably for one of the reasons graphql was created in the first place - accomplish a set of fairly complex operations using one rather than a multitude of API calls. The set can be "everything" or it can be "this well-defined subset".
You could be right, but that's really just "Our API makes multiple calls to itself in the background"
I could be wrong but I thought GraphQL's point of difference from a blind proxy was that it was flexible.
It is flexible, but you don’t have to let it be infinitely flexible. There’s no practical use case for that. (Well, until LLMs, perhaps!)
I guess that I'm reading your initial post a little more strictly than you're meaning