Designing MCP tools for on-chain data
A field guide to turning ABIs into MCP tools that LLMs actually call correctly. Naming, descriptions, inputs, outputs, and the anti-patterns to avoid.
An ABI is almost, but not quite, what an LLM needs to call a smart contract correctly. The function signatures are there. The input and output types are there. What is missing is intent, and intent is the half the model needs most.
This is a practical field guide to closing that gap. We will walk through naming, descriptions, input shaping, output shaping, errors, and the anti-patterns we see in the wild. Everything here is what we learned building ChainContext and what we tell teams when they ask why their own MCP server keeps picking the wrong tool.
The bar for a usable MCP server is not “every function is exposed.” The bar is “an agent makes the right decision ninety-five percent of the time, and the five percent it gets wrong are recoverable.” That bar is reachable. It just takes deliberate design on top of the ABI.
The business metric hiding behind tool design
Tool-call accuracy is the closest thing MCP servers have to a product KPI. Every agent run has three failure modes: the agent picks the wrong tool, the agent picks the right tool with the wrong arguments, or the agent picks the right tool with right arguments and misinterprets the response. All three map back to decisions you made at server-build time.
If you care about one number, watch the ratio of successful tool calls to total tool calls on a representative suite of user prompts. Move that number, and the user experience moves with it. The suggestions in the rest of this post exist because each of them moves that number in practice.
Naming: verb-first, specific, readable
LLMs pick tools by matching the user’s intent to the tool’s name and description. A verb-first, human-readable name is worth more than you think.
Bad: balanceOf
Better: getTokenBalance
Best: get_erc20_balance_for_wallet
Three things happening in the “best” version. The verb is first. The token standard is explicit so the model can distinguish it from get_erc721_owner_of_token later. The object of the call is fully spelled out so the name stands on its own without the description.
A few patterns that hold up across contracts:
- Start with the verb the model would use in natural language:
get_,list_,find_,build_,simulate_. - Name the object specifically:
pool_liquidity, notliquidity.governance_proposal, notproposal. - Disambiguate variants in the name, not in the description:
get_position_by_idandlist_positions_for_walletare clearer than twogetPositiontools.
Avoid tokens that mean nothing to the LLM. v2, ext, internal, raw, protocol-specific abbreviations - all noise. If a human reader of the name would not know what the tool does, the model will not either.
Descriptions: help the model decide
The description is not documentation. It is a decision aid. The one question it needs to answer is “should I call this tool for this user turn?”, and the best descriptions answer it in a sentence or two.
A useful template:
What it returns in plain English, plus when to use it vs obvious alternatives, plus any caveats that change correctness.
Bad: balanceOf: returns uint256.
Good: Get the ERC-20 token balance for a wallet address, returned
in both raw base units and human-readable form. Use this for
"how much [TOKEN] does [WALLET] hold" questions. Does not
include pending rewards or staked balances - use
get_staked_balance for those.
The good version does three things the bad version does not. It tells the model what the return value looks like, so it can plan the answer before calling. It carves out a clear scope vs a neighboring tool, so the model does not guess. It pre-empts the most common wrong use.
Descriptions cost nothing at inference time and pay back on every call. Write them like you are briefing a smart colleague who just walked into the room.
Inputs: constrain ruthlessly
The more specific the input schema, the higher the chance the model fills it correctly on the first try. Three techniques do most of the work.
Use enums for finite sets. If your tool takes a network parameter and you only support five chains, make it an enum of those five chain names. The model picks from a list far more reliably than it invents a value, and invented values are how you get runtime errors at the RPC boundary.
"network": {
"type": "string",
"enum": ["ethereum", "base", "arbitrum", "optimism", "polygon"],
"description": "Which chain the wallet is on. Defaults to ethereum."
}
Pattern-match free strings. Addresses, transaction hashes, ENS names, hex-encoded payloads - all have well-known shapes. Enforce them at the schema level with a regex pattern. Malformed inputs fail validation at the edge, not after an expensive RPC call.
Default sensibly. If 90% of callers pass the same value for a parameter, make that value the default. The model skips the field, and the call is faster, cheaper, and more accurate. A common case: blockTag: "latest" on every read - nobody is asking for historical state in casual conversation.
Avoid object inputs with deeply nested structure. Flat schemas with 3-5 top-level fields are the sweet spot. If your tool needs more than that, it is almost certainly two tools.
Outputs: less is more, and human-readable beats on-wire
Raw on-chain returns are noisy. A Uniswap V3 slot0() call hands you sqrtPriceX96, tick, observationIndex, observationCardinality, observationCardinalityNext, feeProtocol, and unlocked. Your user wants the price. Give them the price.
The job of a good output schema is to hand the model the smallest set of well-named fields that can answer the questions users actually ask. A useful rule of thumb: three to five fields per tool, each with a description and units. Everything else goes into an optional raw field if a power user needs it.
Three shaping patterns that come up constantly:
Decimal normalization. uint256 raw balances are unreadable. Surface both forms: raw for provability, formatted for answers.
{
"balance_raw": "1000000000000000000",
"balance_formatted": "1.0",
"decimals": 18,
"symbol": "DAI"
}
Time normalization. Contracts speak Unix seconds. Humans speak ISO timestamps. Include both, and add a relative form when it is the natural way to describe the value.
{
"unlocks_at_unix": 1735689600,
"unlocks_at_iso": "2025-01-01T00:00:00Z",
"unlocks_in": "in 8 months"
}
Enum decoding. If a field is a uint8 representing a status, decode it. status: "active" beats status: 2 every time, and the model does not need to know your enum mapping to answer correctly.
One tool per intent, not one tool per function
The sharpest mental flip for teams coming from ABI-first thinking: tools are indexed by what the user wants to do, not by what the contract can do. A single user intent often maps to several function calls, and a single contract function can serve zero, one, or several intents.
An ERC-4626 vault exposes asset(), convertToShares(), convertToAssets(), maxDeposit(), maxWithdraw(), previewDeposit(), previewWithdraw(), totalAssets(), totalSupply(), and more. If you expose each as its own MCP tool, you have shipped 10 tools the model has to discriminate between on every call. You will miss.
What users want is “how much can I deposit”, “what would X tokens be worth as shares”, “what is the vault’s current yield”. Three tools, each of which internally calls two or three ABI functions and returns a composed answer. Fewer tools, each higher-signal, each mapped to a recognizable user turn.
Structured errors the model can recover from
When a tool call fails, the model reads the error and decides what to do next. Give it something to work with.
{
"error": "chain_mismatch",
"message": "Wallet is connected to ethereum but this tool targets base. Ask the user to switch networks.",
"recoverable": true,
"hint": "call get_supported_networks to list valid values"
}
A model that gets this error can recover gracefully in a single turn. A model that gets Error: execution reverted cannot. The rule is: every error includes a machine-readable code, a human-sentence explanation of what to do, and a clear signal of whether the call is worth retrying with different arguments.
This matters most on write tools, where a revert can happen for a dozen user-fixable reasons (insufficient balance, missing approval, wrong deadline, paused contract). Each of those is a different code, a different message, a different recovery path.
The anti-patterns we see every week
A quick field survey of what to avoid, based on the MCP servers we audit.
- The 50-tool dump. Every ABI function exposed, nothing shaped, nothing cut. Accuracy tanks past about 15-20 tools per server for current-generation models.
- Naked
uint256outputs. Tools that return{ "result": "18923849..." }with no decimals, no units, no symbol. The model has no way to turn that into a user answer. - Enum-less network parameters. Free-string
networkfields that accept whatever the model decides. You will see “eth”, “mainnet”, “Ethereum”, “ethereum-mainnet”, and “1” - all in the first week. - Admin functions exposed to end users.
pause(),grantRole(),upgradeTo(). These should never be in the tool list a generic assistant sees. Gate them behind a separate permissioned endpoint or do not ship them at all. - Identical-looking tool pairs.
getBalanceandgetBalanceAtBlock, with one-sentence descriptions. The model picks wrong half the time and you do not find out until users complain.
Every one of these is fixable by going back through the naming, description, input, and output steps above.
Testing the way the model does
The last step, and the one that stops teams from getting trapped in endless local iteration: test with the actual model you expect users to use. Run a representative set of user prompts through Claude or your agent runtime, inspect which tools get picked, and look at the arguments. If the model picks wrong or fills badly, the fix is almost always in the name, description, or schema - not in the implementation.
A light prompt suite of twenty or thirty questions, run after every tool change, catches 90% of regressions before they ship. It is the cheapest insurance in the stack.
Recap
- Name tools verb-first, specific, and readable. The name should stand alone.
- Write descriptions as decision aids, not documentation. Answer when to use this and when not to.
- Constrain inputs with enums, patterns, and defaults. Flat schemas beat nested ones.
- Reshape outputs to 3-5 named fields, with decimal, time, and enum decoding done for the model.
- Index tools by user intent, not by ABI function.
- Return structured errors that tell the model how to recover.
- Test with the real model, not with unit tests alone.
For the shorter version of this post and a walkthrough of the ChainContext flow, see From ABI to MCP server in 5 minutes. For the backstory and the wider MCP-for-Web3 thesis, see Introducing ChainContext. New posts drop in the RSS feed.