Twenty-six million parameters. That is roughly one-thousandth the size of the models that dominate the AI conversation right now. A team called Cactus Compute has published a small model named Needle that, by their account, reproduces a meaningful slice of what Gemini does when it decides which software tools to call and in what order. If they are right—even partially right—the number that matters is not the benchmark score. It is the electricity bill.

What “Tool Calling” Actually Means, and Why Getting It Small Is Hard
When you ask an AI assistant to book a flight, it does not search the internet the way you would. It issues a structured command to a piece of software—a function, an API—that does the searching for it. Then it reads the result, decides what to do next, and possibly calls another tool. This chain of decisions is what the industry means by “agentic” behavior, and it is where most small models fail badly. They can answer questions. They cannot reliably orchestrate actions across multiple steps without hallucinating a tool name, flipping an argument, or simply stopping mid-chain.
The reason that competence has been locked inside large, expensive models is partly about raw capability and partly about training data. You need to have seen enough examples of correct multi-step tool use to generalize. Gemini and GPT-4o have seen those examples at scale. A 26-million-parameter model, trained conventionally, has not.
What Needle does differently is use Gemini as a teacher. The process—model distillation agent capabilities transferred from a large model to a small one—is not new in principle. The idea is that you run the big model on thousands of tool-calling scenarios, capture its outputs, and train the small model to reproduce those outputs rather than to learn from raw text. The student does not need to rediscover everything the teacher knows. It needs to learn the teacher’s decisions. That is a narrower problem, and narrow problems are solvable at smaller scale.
The Benchmark That Made This Legible
Measuring whether a small model can actually do this has historically been murky. That changed with the release of Full-Duplex-Bench-v3 (FDB-v3), a benchmark specifically designed for spoken-language models operating under real-world conditions—interruptions, fillers, restarts, the kind of messy audio that actual humans produce. FDB-v3 evaluated six model configurations, including GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, and Ultravox, against chained API calls across four task domains.
The benchmark matters here for what it reveals structurally: even frontier models degrade meaningfully under disfluency, and multi-step tool chains amplify every error. A model that gets step one 90% right and step two 90% right gets the full chain right only 81% of the time. Add a third step and you are at 73%. This is the compounding problem that makes model distillation agent capabilities so commercially significant—not whether a small model can match a large one on a clean single-step query, but whether it can hold together across a realistic sequence.
Needle’s claim is not that it beats Gemini overall. It is that it beats Gemini on the latency side of that trade-off for simpler, well-defined task categories, while accepting lower accuracy on complex multi-step reasoning. That is an honest characterization, and it is more useful than the usual benchmark theater.
The One Situation Where Small Actually Wins
Here is the scenario where Needle, or something like it, is the correct choice: a voice interface running on a device that cannot tolerate a round-trip to a cloud API. Think of a warehouse handset, a medical bedside terminal, a vehicle infotainment system operating in a low-connectivity environment. The task is bounded—check inventory, update a patient record, confirm a navigation waypoint. The tool schema is fixed. The user population speaks in predictable patterns. In that environment, the latency penalty of routing to Gemini is not a minor inconvenience. It is a product failure.
For that use case, model distillation agent capabilities compressed into 26 million parameters is not a compromise. It is an architecture decision. The model runs locally, responds in milliseconds, and handles the three to five tool-call types the application actually needs. Gemini is not competing for this slot. It is too large, too slow, and too expensive for what the application requires.
The situation where you keep Gemini—or something comparably large—is any scenario where the tool set is open-ended, the stakes of an error are high, or the user can generate requests that fall outside the training distribution. A customer service agent handling edge cases. A research assistant calling unfamiliar APIs. A financial workflow where a malformed function argument triggers an irreversible transaction. In those contexts, the accuracy gap is not a reasonable trade-off. It is a liability.
What Builders Are Actually Getting Wrong Right Now
If you are building an AI-powered product today, you are probably making one of two mistakes. The first is routing every inference through a frontier model because it feels safer, without calculating whether the latency and cost profile can survive at production scale. The second is reaching for a small model because it is cheap, without testing whether it degrades gracefully at the edges of its training distribution.
The smarter path—and the one that Needle gestures toward—is treating model distillation agent capabilities as a design primitive rather than a fallback. You build the large model first, use it to generate the training signal for the small model, and then deploy the small model with a fallback escalation path for queries that exceed its confidence threshold. This is not a novel concept in software engineering. It is what load balancers and tiered caching have done for twenty years. The AI industry is just arriving at it late, because it was easier to pretend that one very large model could handle everything.
The Berkeley Function-Calling Leaderboard and MCPMark evaluations that tracked GPT-5, Claude Sonnet 4, and Gemini 2.5 across 2025 tell a consistent story: the gap between frontier models on complex agentic workflows is narrowing faster than anyone expected, but the gap between frontier models and small models on those same tasks remains wide. Distillation is the most credible mechanism for closing it without waiting for hardware to catch up.
“The hard part isn’t getting the small model to call the right tool once. It’s getting it to recover when the first call returns something unexpected.”
— Senior ML engineer, enterprise AI infrastructure team
That recovery problem is where the current generation of distilled models, including Needle, remain genuinely fragile. The teacher-student training process transfers the teacher’s confident decisions well. It transfers the teacher’s error-recovery behavior poorly, because error-recovery scenarios are underrepresented in the training distribution—the large model handles them well precisely because they are rare enough that the distillation set may not capture them in sufficient volume.
Who Loses When Small Models Get Good Enough
The commercial stakes here are not symmetric. If model distillation agent capabilities continue to compress toward the edge at the pace implied by Needle’s release, the first market to restructure is inference-as-a-service. The business model of companies that charge per token for hosted frontier models depends, in part, on there being no credible alternative for production agentic workloads. That assumption is eroding.
The second market to watch is the chip market. Nvidia’s data center dominance is partly a function of the inference compute requirements of large models. A world where meaningful agentic behavior runs on ARM processors in edge devices is a world where the addressable market for H100s looks different. This is not a near-term threat. It is a direction.
The clearest winner in the short run is the developer who builds a product that does not require a frontier model for its core loop but has been paying for one anyway. That developer gets a margin improvement and a latency improvement simultaneously. The losers are the hosted inference providers who have been pricing on the assumption that agentic capability requires their most expensive tier.
You have probably already noticed the implication sitting underneath all of this: the value in AI is migrating from the model itself toward whoever controls the training signal. Gemini’s tool-calling behavior is valuable. But once that behavior is encoded in a 26-million-parameter model, the barrier to replicating it—in a different domain, with a different tool schema, for a different industry vertical—drops substantially. The capability that took Google years to develop becomes, via model distillation agent capabilities, a methodology that a small team can apply in weeks.
That is what the Cactus Compute release actually demonstrates. Not that the 26-million-parameter model is better than Gemini. It is not. It demonstrates that the process of making small models agentic is now legible enough to be repeated. The first time something is done, it is a research result. The second and third time, it is a technique. At scale, it is an industry assumption. We are somewhere between the first and second of those stages, which is exactly the moment when it is worth paying attention.
The benchmark researchers who built FDB-v3 were trying to measure something specific: whether voice models could handle the messiness of real human speech while still executing reliable multi-step tool chains. Their findings showed consistent degradation under disfluency across all tested configurations—including frontier models. Needle’s bet is that “good enough under realistic conditions at a fraction of the cost” is a better product than “excellent under clean conditions at full price.” For a large enough slice of production deployments, that bet is probably correct.
FetchLogic Take
Within eighteen months, at least two enterprise SaaS categories—field service management and clinical documentation—will ship production agentic features built on distilled sub-100M parameter models rather than hosted frontier APIs. The cost and latency math will force it, the tools to execute it now exist, and the FDB-v3 benchmark will become the reference evaluation that procurement teams cite when vendors make accuracy claims. Any inference provider that has not announced a distillation-as-a-service offering by Q1 2026 will be visibly late to a market that its own customers are building around it.
Related Analysis
Andrej Karpathy Joins Anthropic: What Everyone Is Getting Wrong About This Talent MigrationMay 20, 2026
Musk’s $6B OpenAI Lawsuit Collapses: What the Judge Actually RuledMay 18, 2026
Anthropic’s Small Business Play Reveals the Weak Spot in OpenAI’s Pricing StrategyMay 14, 2026
Claude’s Code Generation Works Because Everyone Is Measuring the Wrong ThingMay 9, 2026