<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Retrieval | Roy Gabriel</title><link>https://roygabriel.dev/tags/retrieval/</link><description>Roy Gabriel: DevOps Architect &amp; Applied AI Engineer. Technical blog on Go, MCP servers, Kubernetes, and production AI systems.</description><generator>Hugo -- gohugo.io</generator><language>en-us</language><lastBuildDate>Fri, 27 Feb 2026 03:18:04 +0000</lastBuildDate><atom:link href="https://roygabriel.dev/tags/retrieval/index.xml" rel="self" type="application/rss+xml"/><item><title>Tool Discovery at Scale: Solving the Million Tool Problem</title><link>https://roygabriel.dev/blog/million-tool-problem/</link><pubDate>Sat, 15 Nov 2025 12:00:00 -0500</pubDate><guid>https://roygabriel.dev/blog/million-tool-problem/</guid><description>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Tool-using agents are powerful &lt;em&gt;because&lt;/em&gt; they can do real work: read systems, change systems, orchestrate workflows.&lt;/p&gt;
&lt;p&gt;The trap is what I call the &lt;strong&gt;Million Tool Problem&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;The moment you have &amp;ldquo;enough tools,&amp;rdquo; tool selection becomes harder than tool execution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At small scale, you can stuff tool schemas into the prompt and hope the model chooses correctly. At scale, that approach breaks:&lt;/p&gt;</description><content:encoded>&lt;h2 id="why-this-matters"&gt;Why this matters&lt;/h2&gt;
&lt;p&gt;Tool-using agents are powerful &lt;em&gt;because&lt;/em&gt; they can do real work: read systems, change systems, orchestrate workflows.&lt;/p&gt;
&lt;p&gt;The trap is what I call the &lt;strong&gt;Million Tool Problem&lt;/strong&gt;:&lt;/p&gt;
&lt;blockquote class="border-l-4 border-neutral-300 dark:border-neutral-600 pl-4 italic text-neutral-600 dark:text-neutral-400 my-6"&gt;
&lt;p&gt;The moment you have &amp;ldquo;enough tools,&amp;rdquo; tool selection becomes harder than tool execution.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;At small scale, you can stuff tool schemas into the prompt and hope the model chooses correctly. At scale, that approach breaks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;token budgets explode&lt;/li&gt;
&lt;li&gt;accuracy drops (models confuse similar tools)&lt;/li&gt;
&lt;li&gt;latency rises (bigger prompts, more reasoning)&lt;/li&gt;
&lt;li&gt;safety degrades (wrong tool, wrong args, wrong side effects)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This isn&amp;rsquo;t hypothetical. Tool-use research exists because selection is hard. Benchmarks like ToolBench and AgentBench exist specifically to evaluate this capability in interactive settings. [3][6]&lt;/p&gt;
&lt;p&gt;This post is a production-first design for tool discovery that stays:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;fast&lt;/strong&gt; (low latency, bounded prompt size)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;safe&lt;/strong&gt; (tool contracts and policy gates)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;debuggable&lt;/strong&gt; (you can explain why a tool was chosen)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;maintainable&lt;/strong&gt; (tool catalogs evolve constantly)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="tldr"&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Tool discovery is an &lt;strong&gt;IR problem + a policy problem&lt;/strong&gt;, not a prompt trick.&lt;/li&gt;
&lt;li&gt;Use a &lt;strong&gt;3-stage selector&lt;/strong&gt;:&lt;/li&gt;
&lt;/ul&gt;
&lt;ol&gt;
&lt;li&gt;coarse filter (tags / domain / allowlist)&lt;/li&gt;
&lt;li&gt;retrieval (BM25 + embeddings)&lt;/li&gt;
&lt;li&gt;rerank (LLM or learned ranker)&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Treat tool descriptions as a product:&lt;/li&gt;
&lt;li&gt;consistent naming&lt;/li&gt;
&lt;li&gt;sharp &amp;ldquo;when to use&amp;rdquo; / &amp;ldquo;when not to use&amp;rdquo;&lt;/li&gt;
&lt;li&gt;examples of correct arguments&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;tool quality scoring&lt;/strong&gt; (latency, error rate, drift, safety incidents).&lt;/li&gt;
&lt;li&gt;Build a tight evaluation harness (ToolBench/StableToolBench ideas apply). [3][4]&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="contents"&gt;Contents&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="#why-include-all-tools-fails"&gt;Why &amp;ldquo;include all tools&amp;rdquo; fails&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#the-3-stage-tool-selector"&gt;The 3-stage tool selector&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#tool-metadata-that-makes-models-smarter"&gt;Tool metadata that makes models smarter&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#ranking-bm25--embeddings--rerank"&gt;Ranking: BM25 + embeddings + rerank&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#safety-allowlists-danger-gates-and-budgets"&gt;Safety: allowlists, &amp;ldquo;danger gates,&amp;rdquo; and budgets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#quality-scoring-and-tool-quarantine"&gt;Quality scoring and tool quarantine&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#debuggability-explainable-tool-selection"&gt;Debuggability: explainable tool selection&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-minimal-reference-architecture"&gt;A minimal reference architecture&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#a-production-checklist"&gt;A production checklist&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="#references"&gt;References&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="why-include-all-tools-fails"&gt;Why &amp;ldquo;include all tools&amp;rdquo; fails&lt;/h2&gt;
&lt;h3 id="token-and-latency-pressure"&gt;Token and latency pressure&lt;/h3&gt;
&lt;p&gt;Even if your tool schemas are &amp;ldquo;small,&amp;rdquo; they add up. Once you cross a few dozen tools, you spend more tokens describing tools than describing the task.&lt;/p&gt;
&lt;h3 id="confusability"&gt;Confusability&lt;/h3&gt;
&lt;p&gt;Tools with similar names or overlapping domains cause selection errors:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;search_events&lt;/code&gt; vs &lt;code&gt;list_events&lt;/code&gt; vs &lt;code&gt;get_event&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;create_task&lt;/code&gt; vs &lt;code&gt;create_issue&lt;/code&gt; vs &lt;code&gt;create_ticket&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="the-long-tail-problem"&gt;The long tail problem&lt;/h3&gt;
&lt;p&gt;Most catalogs have a long tail:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;10 tools get used daily&lt;/li&gt;
&lt;li&gt;100 tools get used weekly&lt;/li&gt;
&lt;li&gt;1,000 tools are niche, but critical when needed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is exactly the kind of situation information retrieval was invented for.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="the-3-stage-tool-selector"&gt;The 3-stage tool selector&lt;/h2&gt;
&lt;p&gt;Think like a search engine:&lt;/p&gt;
&lt;h3 id="stage-0-policy-filter-mandatory"&gt;Stage 0: Policy filter (mandatory)&lt;/h3&gt;
&lt;p&gt;Before ranking, enforce policy:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;which tools is this client allowed to call?&lt;/li&gt;
&lt;li&gt;which tools are enabled for this tenant/environment?&lt;/li&gt;
&lt;li&gt;which tools are safe for this context (read-only mode, incident mode, etc.)?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;MCP makes tool discovery explicit via listing tools and schemas. That&amp;rsquo;s an interface you can mediate with policy. [1]&lt;/p&gt;
&lt;h3 id="stage-1-coarse-routing-cheap"&gt;Stage 1: Coarse routing (cheap)&lt;/h3&gt;
&lt;p&gt;Route into the right &amp;ldquo;tool neighborhood&amp;rdquo; using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tags (&lt;code&gt;kubernetes&lt;/code&gt;, &lt;code&gt;calendar&lt;/code&gt;, &lt;code&gt;email&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;domains (&amp;ldquo;devops&amp;rdquo;, &amp;ldquo;productivity&amp;rdquo;, &amp;ldquo;security&amp;rdquo;)&lt;/li&gt;
&lt;li&gt;environment (&amp;ldquo;prod&amp;rdquo; vs &amp;ldquo;dev&amp;rdquo;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goal: reduce the candidate set from 10,000 -&amp;gt; 300.&lt;/p&gt;
&lt;h3 id="stage-2-retrieval-bm25--embeddings"&gt;Stage 2: Retrieval (BM25 + embeddings)&lt;/h3&gt;
&lt;p&gt;Run a hybrid search over:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool name&lt;/li&gt;
&lt;li&gt;tool description&lt;/li&gt;
&lt;li&gt;parameter names&lt;/li&gt;
&lt;li&gt;example calls&lt;/li&gt;
&lt;li&gt;&amp;ldquo;when not to use&amp;rdquo; hints&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hybrid search is pragmatic:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;lexical retrieval (BM25-style) is great for exact matches and acronyms [9]&lt;/li&gt;
&lt;li&gt;embeddings are great for semantic similarity [7]&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goal: 300 -&amp;gt; 30.&lt;/p&gt;
&lt;h3 id="stage-3-rerank-expensive-accurate"&gt;Stage 3: Rerank (expensive, accurate)&lt;/h3&gt;
&lt;p&gt;Rerank the top-K tools using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;an LLM judge (cheap if K is small)&lt;/li&gt;
&lt;li&gt;or a learned ranker&lt;/li&gt;
&lt;li&gt;or deterministic rules + a smaller LLM tie-breaker&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Goal: 30 -&amp;gt; 5.&lt;/p&gt;
&lt;p&gt;Then the agent sees a small, high-quality tool set.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="tool-metadata-that-makes-models-smarter"&gt;Tool metadata that makes models smarter&lt;/h2&gt;
&lt;p&gt;If you want better tool selection, stop treating tool schemas as &amp;ldquo;just types.&amp;rdquo; Add metadata that improves discrimination.&lt;/p&gt;
&lt;h3 id="tool-card-fields-recommended"&gt;Tool card fields (recommended)&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Name&lt;/strong&gt;: stable, verb-first&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: one sentence&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When to use&lt;/strong&gt;: 2-4 bullets&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;When NOT to use&lt;/strong&gt;: 2-4 bullets (this is underrated)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Side effects&lt;/strong&gt;: none / read-only / creates / updates / deletes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Required arguments&lt;/strong&gt;: and why they&amp;rsquo;re required&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Examples&lt;/strong&gt;: 2-3 example invocations with realistic args&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Error modes&lt;/strong&gt;: rate limit, auth, not found, validation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This reduces tool confusion dramatically because it gives the model &lt;em&gt;differentiating features&lt;/em&gt;.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="ranking-bm25--embeddings--rerank"&gt;Ranking: BM25 + embeddings + rerank&lt;/h2&gt;
&lt;h3 id="lexical-retrieval-bm25"&gt;Lexical retrieval (BM25)&lt;/h3&gt;
&lt;p&gt;BM25 and probabilistic retrieval approaches are foundational in search. [9]&lt;/p&gt;
&lt;p&gt;Practical benefit: it handles queries like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;S3&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;JWT&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;PodDisruptionBudget&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Cron&amp;rdquo;
&amp;hellip;where embeddings can be inconsistent.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="embeddings"&gt;Embeddings&lt;/h3&gt;
&lt;p&gt;Sentence embeddings (like SBERT-style approaches) are designed to enable efficient semantic similarity search. [7]&lt;/p&gt;
&lt;p&gt;Practical benefit: it handles intent queries like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;delete all tasks due tomorrow&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;find calendar conflicts next week&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;check if deployment is stuck&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="approximate-nearest-neighbor-indexing"&gt;Approximate nearest neighbor indexing&lt;/h3&gt;
&lt;p&gt;At scale, you&amp;rsquo;ll want ANN indexing (FAISS is a well-known library in this space). [8]&lt;/p&gt;
&lt;h3 id="rerank"&gt;Rerank&lt;/h3&gt;
&lt;p&gt;This is where you incorporate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool quality score&lt;/li&gt;
&lt;li&gt;tenant policy&lt;/li&gt;
&lt;li&gt;&amp;ldquo;danger tool&amp;rdquo; gating&lt;/li&gt;
&lt;li&gt;recent tool drift&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Reranking is also where you can enforce &amp;ldquo;don&amp;rsquo;t pick write tools unless necessary.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="safety-allowlists-danger-gates-and-budgets"&gt;Safety: allowlists, &amp;ldquo;danger gates,&amp;rdquo; and budgets&lt;/h2&gt;
&lt;p&gt;Tool discovery is not neutral. It&amp;rsquo;s an &lt;em&gt;authorization problem&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Your selector should be policy-aware:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Read-only mode&lt;/strong&gt;: only surface read tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No-delete mode&lt;/strong&gt;: deletes never appear&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prod incident mode&lt;/strong&gt;: allow observation tools, restrict mutation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Human approval mode&lt;/strong&gt;: show write tools, but require confirmation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Also: build budgets into selection.
If a tool is expensive (slow, rate-limited, high blast radius), rank it lower unless strongly justified.&lt;/p&gt;
&lt;p&gt;For tool-using agents, OWASP highlights prompt injection and excessive agency as key risks - exactly the failure modes you get when tools are over-exposed without gates. [10]&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="quality-scoring-and-tool-quarantine"&gt;Quality scoring and tool quarantine&lt;/h2&gt;
&lt;p&gt;You need a &lt;strong&gt;tool quality score&lt;/strong&gt; because tools drift:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;upstream APIs change&lt;/li&gt;
&lt;li&gt;auth breaks&lt;/li&gt;
&lt;li&gt;quotas shift&lt;/li&gt;
&lt;li&gt;tool server regressions happen&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Track per tool:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;p50 / p95 latency&lt;/li&gt;
&lt;li&gt;error rate&lt;/li&gt;
&lt;li&gt;timeout rate&lt;/li&gt;
&lt;li&gt;&amp;ldquo;invalid argument&amp;rdquo; rate (often a selection problem)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;unsafe attempt&amp;rdquo; rate (policy violations)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then take action:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;quarantine tools with regression spikes&lt;/li&gt;
&lt;li&gt;degrade to read-only tools during outages&lt;/li&gt;
&lt;li&gt;route to backups (alternate implementations)&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="debuggability-explainable-tool-selection"&gt;Debuggability: explainable tool selection&lt;/h2&gt;
&lt;p&gt;If you can&amp;rsquo;t answer &lt;strong&gt;&amp;ldquo;why did the agent pick that tool?&amp;rdquo;&lt;/strong&gt;, you won&amp;rsquo;t be able to operate the system.&lt;/p&gt;
&lt;p&gt;Log (or attach to traces) the selection evidence:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;query text&lt;/li&gt;
&lt;li&gt;candidate tools (top 30)&lt;/li&gt;
&lt;li&gt;retrieval scores&lt;/li&gt;
&lt;li&gt;rerank scores&lt;/li&gt;
&lt;li&gt;policy filters applied&lt;/li&gt;
&lt;li&gt;final selected tools and why&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This also becomes training data later.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-minimal-reference-architecture"&gt;A minimal reference architecture&lt;/h2&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-text" data-lang="text"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Agent runtime (planner) -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Tool Selector Service -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - policy filter -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - hybrid retrieval -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - rerank -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - tool quality weighting -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; - returns top-K tools + schemas
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt; v
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- Agent execution -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;- - calls tools via MCP -
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;-------------------------------
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Where MCP fits: MCP provides a standardized way for clients to discover tools and invoke them. [1]&lt;/p&gt;
&lt;p&gt;The selector doesn&amp;rsquo;t replace MCP. It makes MCP usable at scale.&lt;/p&gt;
&lt;hr&gt;
&lt;h2 id="a-production-checklist"&gt;A production checklist&lt;/h2&gt;
&lt;h3 id="tool-catalog-hygiene"&gt;Tool catalog hygiene&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Stable naming conventions.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; &amp;ldquo;When NOT to use&amp;rdquo; bullets exist.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Examples exist for the top tools.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool side effects are classified.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="selection-pipeline"&gt;Selection pipeline&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Mandatory policy filter before ranking.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Hybrid retrieval (lexical + embeddings). [7][9]&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Rerank top-K with quality + policy.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Candidate set bounded (K is small).&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="safety"&gt;Safety&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Dangerous tools are gated and not surfaced by default.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Budget-aware ranking exists.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; OWASP LLM risks considered in tool exposure strategy. [10]&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="operability"&gt;Operability&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Selection decisions are explainable (log evidence).&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Tool quality scoring exists and drives quarantine.&lt;/li&gt;
&lt;li&gt;&lt;input disabled="" type="checkbox"&gt; Selection regressions are covered by evals (next article).&lt;/li&gt;
&lt;/ul&gt;
&lt;hr&gt;
&lt;h2 id="references"&gt;References&lt;/h2&gt;
&lt;p&gt;[1] Model Context Protocol (MCP) - Specification (Protocol Revision 2025-11-25): &lt;a href="https://modelcontextprotocol.io/specification/2025-11-25" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-11-25&lt;/a&gt;
[2] MCP - Transports (including stdio and Streamable HTTP): &lt;a href="https://modelcontextprotocol.io/specification/2025-03-26/basic/transports" target="_blank" rel="noopener noreferrer"&gt;https://modelcontextprotocol.io/specification/2025-03-26/basic/transports&lt;/a&gt;
[3] ToolLLM / ToolBench (tool-use dataset + evaluation): &lt;a href="https://arxiv.org/abs/2307.16789" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2307.16789&lt;/a&gt;
[4] StableToolBench (stable tool-use benchmarking): &lt;a href="https://arxiv.org/abs/2403.07714" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2403.07714&lt;/a&gt;
[5] tau-bench (tool-agent-user interaction benchmark): &lt;a href="https://arxiv.org/abs/2406.12045" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2406.12045&lt;/a&gt;
[6] AgentBench (evaluating LLMs as agents): &lt;a href="https://arxiv.org/abs/2308.03688" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2308.03688&lt;/a&gt;
[7] Sentence-BERT (efficient semantic similarity search via embeddings): &lt;a href="https://arxiv.org/abs/1908.10084" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/1908.10084&lt;/a&gt;
[8] FAISS / Billion-scale similarity search with GPUs: &lt;a href="https://arxiv.org/abs/1702.08734" target="_blank" rel="noopener noreferrer"&gt;https://arxiv.org/abs/1702.08734&lt;/a&gt;
and &lt;a href="https://github.com/facebookresearch/faiss" target="_blank" rel="noopener noreferrer"&gt;https://github.com/facebookresearch/faiss&lt;/a&gt;
[9] Robertson (BM25 and probabilistic relevance framework): &lt;a href="https://dl.acm.org/doi/abs/10.1561/1500000019" target="_blank" rel="noopener noreferrer"&gt;https://dl.acm.org/doi/abs/10.1561/1500000019&lt;/a&gt;
[10] OWASP - Top 10 for Large Language Model Applications: &lt;a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/" target="_blank" rel="noopener noreferrer"&gt;https://owasp.org/www-project-top-10-for-large-language-model-applications/&lt;/a&gt;
&lt;/p&gt;</content:encoded></item></channel></rss>