[{"data":1,"prerenderedAt":22},["ShallowReactive",2],{"blog-post-designing-mcp-tools-for-onchain-data":3},{"post":4,"author":17},{"slug":5,"path":6,"title":7,"description":8,"date":9,"author":10,"tags":11,"draft":15,"html":16},"designing-mcp-tools-for-onchain-data","/blog/designing-mcp-tools-for-onchain-data","Designing MCP tools for on-chain data","A field guide to turning ABIs into MCP tools that LLMs actually call correctly. Naming, descriptions, inputs, outputs, and the anti-patterns to avoid.","2026-04-15","chaincontext-team",[12,13,14],"engineering","mcp","guides",false,"\u003Cp>An ABI is almost, but not quite, what an LLM needs to call a smart contract correctly. The function signatures are there. The input and output types are there. What is missing is \u003Cem>intent\u003C/em>, and intent is the half the model needs most.\u003C/p>\n\u003Cp>This is a practical field guide to closing that gap. We will walk through naming, descriptions, input shaping, output shaping, errors, and the anti-patterns we see in the wild. Everything here is what we learned building \u003Ca href=\"https://chaincontext.dev\">ChainContext\u003C/a> and what we tell teams when they ask why their own MCP server keeps picking the wrong tool.\u003C/p>\n\u003Cp>The bar for a usable MCP server is not “every function is exposed.” The bar is “an agent makes the right decision ninety-five percent of the time, and the five percent it gets wrong are recoverable.” That bar is reachable. It just takes deliberate design on top of the ABI.\u003C/p>\n\u003Ch2>The business metric hiding behind tool design\u003C/h2>\n\u003Cp>Tool-call accuracy is the closest thing MCP servers have to a product KPI. Every agent run has three failure modes: the agent picks the wrong tool, the agent picks the right tool with the wrong arguments, or the agent picks the right tool with right arguments and misinterprets the response. All three map back to decisions you made at server-build time.\u003C/p>\n\u003Cp>If you care about one number, watch the ratio of successful tool calls to total tool calls on a representative suite of user prompts. Move that number, and the user experience moves with it. The suggestions in the rest of this post exist because each of them moves that number in practice.\u003C/p>\n\u003Ch2>Naming: verb-first, specific, readable\u003C/h2>\n\u003Cp>LLMs pick tools by matching the user’s intent to the tool’s name and description. A verb-first, human-readable name is worth more than you think.\u003C/p>\n\u003Cpre>\u003Ccode>Bad:   balanceOf\nBetter: getTokenBalance\nBest:  get_erc20_balance_for_wallet\n\u003C/code>\u003C/pre>\n\u003Cp>Three things happening in the “best” version. The verb is first. The token standard is explicit so the model can distinguish it from \u003Ccode>get_erc721_owner_of_token\u003C/code> later. The object of the call is fully spelled out so the name stands on its own without the description.\u003C/p>\n\u003Cp>A few patterns that hold up across contracts:\u003C/p>\n\u003Cul>\n\u003Cli>Start with the verb the model would use in natural language: \u003Ccode>get_\u003C/code>, \u003Ccode>list_\u003C/code>, \u003Ccode>find_\u003C/code>, \u003Ccode>build_\u003C/code>, \u003Ccode>simulate_\u003C/code>.\u003C/li>\n\u003Cli>Name the object specifically: \u003Ccode>pool_liquidity\u003C/code>, not \u003Ccode>liquidity\u003C/code>. \u003Ccode>governance_proposal\u003C/code>, not \u003Ccode>proposal\u003C/code>.\u003C/li>\n\u003Cli>Disambiguate variants in the name, not in the description: \u003Ccode>get_position_by_id\u003C/code> and \u003Ccode>list_positions_for_wallet\u003C/code> are clearer than two \u003Ccode>getPosition\u003C/code> tools.\u003C/li>\n\u003C/ul>\n\u003Cp>Avoid tokens that mean nothing to the LLM. \u003Ccode>v2\u003C/code>, \u003Ccode>ext\u003C/code>, \u003Ccode>internal\u003C/code>, \u003Ccode>raw\u003C/code>, protocol-specific abbreviations - all noise. If a human reader of the name would not know what the tool does, the model will not either.\u003C/p>\n\u003Ch2>Descriptions: help the model decide\u003C/h2>\n\u003Cp>The description is not documentation. It is a decision aid. The one question it needs to answer is \u003Cem>“should I call this tool for this user turn?”\u003C/em>, and the best descriptions answer it in a sentence or two.\u003C/p>\n\u003Cp>A useful template:\u003C/p>\n\u003Cblockquote>\n\u003Cp>\u003Cstrong>What it returns\u003C/strong> in plain English, plus \u003Cstrong>when to use it\u003C/strong> vs obvious alternatives, plus any \u003Cstrong>caveats\u003C/strong> that change correctness.\u003C/p>\n\u003C/blockquote>\n\u003Cpre>\u003Ccode>Bad:  balanceOf: returns uint256.\n\nGood: Get the ERC-20 token balance for a wallet address, returned\n      in both raw base units and human-readable form. Use this for\n      &quot;how much [TOKEN] does [WALLET] hold&quot; questions. Does not\n      include pending rewards or staked balances - use\n      get_staked_balance for those.\n\u003C/code>\u003C/pre>\n\u003Cp>The good version does three things the bad version does not. It tells the model what the return value looks like, so it can plan the answer before calling. It carves out a clear scope vs a neighboring tool, so the model does not guess. It pre-empts the most common wrong use.\u003C/p>\n\u003Cp>Descriptions cost nothing at inference time and pay back on every call. Write them like you are briefing a smart colleague who just walked into the room.\u003C/p>\n\u003Ch2>Inputs: constrain ruthlessly\u003C/h2>\n\u003Cp>The more specific the input schema, the higher the chance the model fills it correctly on the first try. Three techniques do most of the work.\u003C/p>\n\u003Cp>\u003Cstrong>Use enums for finite sets.\u003C/strong> If your tool takes a \u003Ccode>network\u003C/code> parameter and you only support five chains, make it an enum of those five chain names. The model picks from a list far more reliably than it invents a value, and invented values are how you get runtime errors at the RPC boundary.\u003C/p>\n\u003Cpre>\u003Ccode class=\"language-json\">&quot;network&quot;: {\n  &quot;type&quot;: &quot;string&quot;,\n  &quot;enum&quot;: [&quot;ethereum&quot;, &quot;base&quot;, &quot;arbitrum&quot;, &quot;optimism&quot;, &quot;polygon&quot;],\n  &quot;description&quot;: &quot;Which chain the wallet is on. Defaults to ethereum.&quot;\n}\n\u003C/code>\u003C/pre>\n\u003Cp>\u003Cstrong>Pattern-match free strings.\u003C/strong> Addresses, transaction hashes, ENS names, hex-encoded payloads - all have well-known shapes. Enforce them at the schema level with a regex pattern. Malformed inputs fail validation at the edge, not after an expensive RPC call.\u003C/p>\n\u003Cp>\u003Cstrong>Default sensibly.\u003C/strong> If 90% of callers pass the same value for a parameter, make that value the default. The model skips the field, and the call is faster, cheaper, and more accurate. A common case: \u003Ccode>blockTag: &quot;latest&quot;\u003C/code> on every read - nobody is asking for historical state in casual conversation.\u003C/p>\n\u003Cp>Avoid \u003Ccode>object\u003C/code> inputs with deeply nested structure. Flat schemas with 3-5 top-level fields are the sweet spot. If your tool needs more than that, it is almost certainly two tools.\u003C/p>\n\u003Ch2>Outputs: less is more, and human-readable beats on-wire\u003C/h2>\n\u003Cp>Raw on-chain returns are noisy. A Uniswap V3 \u003Ccode>slot0()\u003C/code> call hands you \u003Ccode>sqrtPriceX96\u003C/code>, \u003Ccode>tick\u003C/code>, \u003Ccode>observationIndex\u003C/code>, \u003Ccode>observationCardinality\u003C/code>, \u003Ccode>observationCardinalityNext\u003C/code>, \u003Ccode>feeProtocol\u003C/code>, and \u003Ccode>unlocked\u003C/code>. Your user wants the price. Give them the price.\u003C/p>\n\u003Cp>The job of a good output schema is to hand the model the smallest set of well-named fields that can answer the questions users actually ask. A useful rule of thumb: three to five fields per tool, each with a description and units. Everything else goes into an optional \u003Ccode>raw\u003C/code> field if a power user needs it.\u003C/p>\n\u003Cp>Three shaping patterns that come up constantly:\u003C/p>\n\u003Cp>\u003Cstrong>Decimal normalization.\u003C/strong> \u003Ccode>uint256\u003C/code> raw balances are unreadable. Surface both forms: raw for provability, formatted for answers.\u003C/p>\n\u003Cpre>\u003Ccode class=\"language-json\">{\n  &quot;balance_raw&quot;: &quot;1000000000000000000&quot;,\n  &quot;balance_formatted&quot;: &quot;1.0&quot;,\n  &quot;decimals&quot;: 18,\n  &quot;symbol&quot;: &quot;DAI&quot;\n}\n\u003C/code>\u003C/pre>\n\u003Cp>\u003Cstrong>Time normalization.\u003C/strong> Contracts speak Unix seconds. Humans speak ISO timestamps. Include both, and add a relative form when it is the natural way to describe the value.\u003C/p>\n\u003Cpre>\u003Ccode class=\"language-json\">{\n  &quot;unlocks_at_unix&quot;: 1735689600,\n  &quot;unlocks_at_iso&quot;:  &quot;2025-01-01T00:00:00Z&quot;,\n  &quot;unlocks_in&quot;:      &quot;in 8 months&quot;\n}\n\u003C/code>\u003C/pre>\n\u003Cp>\u003Cstrong>Enum decoding.\u003C/strong> If a field is a \u003Ccode>uint8\u003C/code> representing a status, decode it. \u003Ccode>status: &quot;active&quot;\u003C/code> beats \u003Ccode>status: 2\u003C/code> every time, and the model does not need to know your enum mapping to answer correctly.\u003C/p>\n\u003Ch2>One tool per intent, not one tool per function\u003C/h2>\n\u003Cp>The sharpest mental flip for teams coming from ABI-first thinking: tools are indexed by what the \u003Cem>user\u003C/em> wants to do, not by what the contract \u003Cem>can\u003C/em> do. A single user intent often maps to several function calls, and a single contract function can serve zero, one, or several intents.\u003C/p>\n\u003Cp>An ERC-4626 vault exposes \u003Ccode>asset()\u003C/code>, \u003Ccode>convertToShares()\u003C/code>, \u003Ccode>convertToAssets()\u003C/code>, \u003Ccode>maxDeposit()\u003C/code>, \u003Ccode>maxWithdraw()\u003C/code>, \u003Ccode>previewDeposit()\u003C/code>, \u003Ccode>previewWithdraw()\u003C/code>, \u003Ccode>totalAssets()\u003C/code>, \u003Ccode>totalSupply()\u003C/code>, and more. If you expose each as its own MCP tool, you have shipped 10 tools the model has to discriminate between on every call. You will miss.\u003C/p>\n\u003Cp>What users want is “how much can I deposit”, “what would X tokens be worth as shares”, “what is the vault’s current yield”. Three tools, each of which internally calls two or three ABI functions and returns a composed answer. Fewer tools, each higher-signal, each mapped to a recognizable user turn.\u003C/p>\n\u003Ch2>Structured errors the model can recover from\u003C/h2>\n\u003Cp>When a tool call fails, the model reads the error and decides what to do next. Give it something to work with.\u003C/p>\n\u003Cpre>\u003Ccode class=\"language-json\">{\n  &quot;error&quot;:      &quot;chain_mismatch&quot;,\n  &quot;message&quot;:    &quot;Wallet is connected to ethereum but this tool targets base. Ask the user to switch networks.&quot;,\n  &quot;recoverable&quot;: true,\n  &quot;hint&quot;:       &quot;call get_supported_networks to list valid values&quot;\n}\n\u003C/code>\u003C/pre>\n\u003Cp>A model that gets this error can recover gracefully in a single turn. A model that gets \u003Ccode>Error: execution reverted\u003C/code> cannot. The rule is: every error includes a machine-readable code, a human-sentence explanation of what to do, and a clear signal of whether the call is worth retrying with different arguments.\u003C/p>\n\u003Cp>This matters most on write tools, where a revert can happen for a dozen user-fixable reasons (insufficient balance, missing approval, wrong deadline, paused contract). Each of those is a different code, a different message, a different recovery path.\u003C/p>\n\u003Ch2>The anti-patterns we see every week\u003C/h2>\n\u003Cp>A quick field survey of what to avoid, based on the MCP servers we audit.\u003C/p>\n\u003Cul>\n\u003Cli>\u003Cstrong>The 50-tool dump.\u003C/strong> Every ABI function exposed, nothing shaped, nothing cut. Accuracy tanks past about 15-20 tools per server for current-generation models.\u003C/li>\n\u003Cli>\u003Cstrong>Naked \u003Ccode>uint256\u003C/code> outputs.\u003C/strong> Tools that return \u003Ccode>{ &quot;result&quot;: &quot;18923849...&quot; }\u003C/code> with no decimals, no units, no symbol. The model has no way to turn that into a user answer.\u003C/li>\n\u003Cli>\u003Cstrong>Enum-less network parameters.\u003C/strong> Free-string \u003Ccode>network\u003C/code> fields that accept whatever the model decides. You will see “eth”, “mainnet”, “Ethereum”, “ethereum-mainnet”, and “1” - all in the first week.\u003C/li>\n\u003Cli>\u003Cstrong>Admin functions exposed to end users.\u003C/strong> \u003Ccode>pause()\u003C/code>, \u003Ccode>grantRole()\u003C/code>, \u003Ccode>upgradeTo()\u003C/code>. These should never be in the tool list a generic assistant sees. Gate them behind a separate permissioned endpoint or do not ship them at all.\u003C/li>\n\u003Cli>\u003Cstrong>Identical-looking tool pairs.\u003C/strong> \u003Ccode>getBalance\u003C/code> and \u003Ccode>getBalanceAtBlock\u003C/code>, with one-sentence descriptions. The model picks wrong half the time and you do not find out until users complain.\u003C/li>\n\u003C/ul>\n\u003Cp>Every one of these is fixable by going back through the naming, description, input, and output steps above.\u003C/p>\n\u003Ch2>Testing the way the model does\u003C/h2>\n\u003Cp>The last step, and the one that stops teams from getting trapped in endless local iteration: test with the actual model you expect users to use. Run a representative set of user prompts through Claude or your agent runtime, inspect which tools get picked, and look at the arguments. If the model picks wrong or fills badly, the fix is almost always in the name, description, or schema - not in the implementation.\u003C/p>\n\u003Cp>A light prompt suite of twenty or thirty questions, run after every tool change, catches 90% of regressions before they ship. It is the cheapest insurance in the stack.\u003C/p>\n\u003Ch2>Recap\u003C/h2>\n\u003Cul>\n\u003Cli>Name tools verb-first, specific, and readable. The name should stand alone.\u003C/li>\n\u003Cli>Write descriptions as decision aids, not documentation. Answer \u003Cem>when to use this\u003C/em> and \u003Cem>when not to\u003C/em>.\u003C/li>\n\u003Cli>Constrain inputs with enums, patterns, and defaults. Flat schemas beat nested ones.\u003C/li>\n\u003Cli>Reshape outputs to 3-5 named fields, with decimal, time, and enum decoding done for the model.\u003C/li>\n\u003Cli>Index tools by user intent, not by ABI function.\u003C/li>\n\u003Cli>Return structured errors that tell the model how to recover.\u003C/li>\n\u003Cli>Test with the real model, not with unit tests alone.\u003C/li>\n\u003C/ul>\n\u003Cp>For the shorter version of this post and a walkthrough of the ChainContext flow, see \u003Ca href=\"/blog/from-abi-to-mcp-server-in-5-minutes\">From ABI to MCP server in 5 minutes\u003C/a>. For the backstory and the wider MCP-for-Web3 thesis, see \u003Ca href=\"/blog/introducing-chaincontext\">Introducing ChainContext\u003C/a>. New posts drop in the \u003Ca href=\"/blog/rss.xml\">RSS feed\u003C/a>.\u003C/p>\n",{"slug":10,"name":18,"role":19,"bio":20,"url":21},"ChainContext Dev Team","Engineering","We build ChainContext - a no-code MCP server builder for Web3 teams. Posts here are collective notes from our engineering and product work.","https://chaincontext.dev",1777015267835]