The Room Where the NSA Chose Claude

9 min read · 2,036 words

The calendar showed eighteen days until the congressional briefing. Three secure terminals in a Virginia facility had been sitting dark for eleven weeks while analysts waited for clearance on a language model that officially did not exist on any approved procurement list. Someone had already written two versions of the memo explaining why the agency would miss its signals intelligence modernization deadline. Neither version mentioned that the French were already running inference at scale.

The National Security Agency’s deployment of Anthropic’s Claude—specifically the Mythos variant tuned for classified workflows—broke no laws. The model never appeared on a formal blacklist, because formal blacklists require formal proceedings, and formal proceedings leave paper trails that congressional staffers read during budget negotiations. What existed instead was something more pernicious: a consensus among procurement officers that touching anything Anthropic-adjacent would trigger the kind of interagency email storm that turns promotion tracks into early retirement conversations. Yet by late March, Claude was processing signals intelligence across three NSA facilities, marking the sharpest divergence yet between stated technology policy and operational reality in government AI adoption.

Anthropic had spent fourteen months positioning itself as the safety-conscious alternative to OpenAI’s breakneck commercialization. Constitutional AI, the company’s signature approach, promised language models that could refuse harmful requests while maintaining competitive performance. Defense contractors noticed something else: the architecture’s modularity made it easier to airgap, to fine-tune on classified corpora, to audit for information leakage. Specifications mattered less than the fifty-page technical report that explained exactly how the model weighted competing objectives—a level of transparency that made risk assessors comfortable and procurement lawyers nervous.

When Policy Lags at Light Speed

Three paths sat on the conference table during the February decision meeting. Path one: wait for the interagency working group on AI procurement to issue guidelines, expected in eight to fourteen months based on the previous three working groups’ timelines. Analysts would continue using legacy keyword systems that missed context, missed sarcasm, missed the entire semantic layer that makes modern signals intelligence worthwhile. Path two: build something in-house, the way NSA built its cryptographic tools for decades. Early prototypes suggested eighteen months and forty million dollars to reach performance parity with commercial models from 2022. Path three: deploy Claude under an existing software licensing authority that technically covered “analytical tools” and prepare very good answers for the inevitable questions.

Pressure came from operational commanders who had seen what Google’s models did for allied services. Context windows had expanded to the point where an analyst could feed an entire day’s worth of intercepts and ask: what changed? The machine would catch the idiom shift that signaled a new operations chief, the repeated phrase that suggested coordinated messaging, the gap in communications that meant someone had switched protocols. Legacy systems required analysts to specify what they were looking for. Modern systems let analysts specify what they were trying to understand. The gap between those two capabilities was the gap between reading transcripts and reading minds.

But the real driver was simpler. China’s government AI adoption had accelerated past the point where American technical superiority could compensate for American procedural caution. NSA’s director had seen the intelligence assessments on Ernie 4.0’s deployment across Ministry of State Security facilities. The calculus was blunt: technological advantage degrades faster than policy frameworks adapt. Someone would need to accept career risk.

The Constitutional Loophole

Constitutional AI turned out to be the policy key. Anthropic’s approach meant the model could be prompted with classification guidelines, compartmentalization rules, and handling restrictions that became part of its inference process. An analyst asking Claude to summarize a signals package received summaries that automatically elided sources and methods. The model wouldn’t explain how it knew something if explaining would reveal collection capabilities. Traditional software required programmers to anticipate every potential disclosure scenario. Constitutional AI let security officers specify principles and trust the model to extrapolate.

Trust is the wrong word—security officers never trust anything—but the testing regime was manageable. Red teams spent five weeks trying to trick Claude into revealing compartmented information. The model failed in ways that were predictable and logged. It occasionally refused to answer questions it should have answered, but overcaution beats disclosure in the national security risk calculus. What sealed the decision was realizing the model could be upgraded without retraining: new restrictions could be added to the constitutional framework and take effect immediately across all deployments. Policy could update at the same cadence as threats.

—which is how you end up with a language model processing some of the nation’s most sensitive intelligence while technically violating no rules, because the rules assumed AI meant narrow systems for specific tasks, not general-purpose reasoning engines that redefined what “analytical tool” meant.

“We stopped asking whether it was allowed and started asking whether we could explain the alternative. When the alternative is telling operational commanders we’re waiting for a policy working group while adversaries are moving at machine speed, that’s not risk management, that’s risk acceptance of a different kind.”

Commercial implications rippled outward faster than the deployment itself. Anthropic had positioned itself as the enterprise-safe choice, the model for institutions that needed explainability and control. Microsoft’s partnership with OpenAI had frozen out customers who couldn’t accept the same infrastructure serving consumer chatbots and classified analytics. Google’s models carried the baggage of its advertising business and its uncertain commitment to any product line. Anthropic occupied a narrow gap: technically competitive, architecturally transparent, commercially independent enough to sign the liability frameworks that government AI adoption requires.

What Changed in Building 2

The analysts notice first. Questions that took forty minutes of database queries now take forty seconds of conversation. The shift isn’t just speed—speed optimization could come from caching and indexing. The shift is that questions themselves become exploratory. An analyst who once needed to know what to search for can now describe what feels wrong about a pattern and let the model propose hypotheses. Some of those hypotheses are brilliant. Some are nonsense. All of them are documented, logged, and traceable in ways that let human judgment operate at a higher level of abstraction.

Mistakes happen differently. A keyword system misses things: false negatives that disappear into the noise. A language model hallucinates things: false positives that demand investigation before they’re revealed as pattern-matching errors. The error profile trades silent failure for noisy failure. Security culture prefers noisy failure—at least you know something broke. What nobody anticipated was how much analyst time would shift from finding needles in haystacks to explaining to Claude why certain needles weren’t interesting. The model’s eagerness to find patterns exceeded its ability to weight significance. Training humans to prompt effectively became as critical as training the model itself.

Researchers watching from outside see something different. NSA’s deployment validates a technical bet Anthropic made two years ago: that constitutional approaches would matter more than raw performance. Turns out institutions with legal departments and classification authorities and congressional oversight care deeply about explainability. They care less about whether a model scores three points higher on MMLU. Palantir’s stock moved on rumors of the deployment—not because Palantir had any involvement, but because investors finally understood that government AI adoption would favor architectures built for auditability over architectures built for benchmarks.

The Blacklist That Never Was

Calling it a blacklist was always shorthand. What existed was an informal agreement among procurement officers to avoid anything that might trigger the April 2023 memo from the Office of Management and Budget about AI systems in sensitive contexts. That memo mentioned no vendors by name. It established principles: transparency in training data, auditability in decision processes, American ownership of critical infrastructure. Anthropic technically satisfied all three, but the company’s association with effective altruism and AI safety activism made risk-averse bureaucrats nervous. Nervousness isn’t policy, but in government procurement, nervousness functions as policy until someone senior enough decides it shouldn’t.

That decision landed in mid-February, after a deputy director asked why the agency was optimizing for optics over operations. Nobody had a good answer. The blacklist dissolved not through formal reversal but through operational defiance that senior leadership chose not to see. Plausible deniability works in both directions: commanders could claim they were using approved analytical tools, leadership could claim they weren’t micromanaging software choices, and the legal opinion supporting the deployment could remain classified for the next twenty-five years.

Practitioners should watch what happens in the next six months more than what happened in the last six. NSA’s move gives cover to Defense Intelligence Agency, National Geospatial-Intelligence Agency, and every other three-letter organization that’s been running quiet pilot programs while waiting for someone else to test the policy boundaries. Government AI adoption accelerates not through top-down mandates but through middle-layer imitation. One agency’s deployment becomes three agencies’ proof of concept becomes nine agencies’ standard practice. Anthropic’s challenge now is scaling support for customers who can’t tweet about successful implementations, who need security clearances for bug reports, who measure latency in lives rather than milliseconds.

Whose Margin Expands

Follow the money after the policy shift settles. Anthropic captures margin not from NSA’s direct payments—government contracts are rarely profitable at scale—but from the cleared workforce that learns to build on Claude and carries those preferences to defense contractors. Booz Allen Hamilton, Leidos, SAIC, the integrators who turn agency requirements into deployed systems—they’re staffed by former agency personnel who default to tools they already know. An analyst who spent two years prompting Claude doesn’t want to learn a new model when they transition to the private sector. Lock-in happens through human capital, not contract terms.

OpenAI’s loss here is subtler than it appears. GPT-4 remains technically superior on many benchmarks. But government AI adoption isn’t a technical competition—it’s a trust competition wrapped in a procurement process designed to minimize individual accountability. Anthropic wins not by being better but by being explainable in the specific way that lets GS-14s approve purchases without fear. Microsoft’s tight integration with OpenAI, once an advantage, becomes a liability when customers need air-gapped deployments and dedicated instances. The cloud hyperscalers’ business model assumes shared infrastructure. National security assumes the opposite.

Google watches this play out with institutional whiplash. The company pioneered transformer architectures, trained models that still define state-of-the-art in academic contexts, and then ceded the government market through a combination of employee activism and executive caution. Gemini’s technical capabilities exceed Claude’s in specific domains. But technical capability stopped being the binding constraint eighteen months ago. The binding constraint is willingness to sign liability frameworks that don’t cap damages, to accept security requirements that void standard terms of service, to support customers who can’t publicly acknowledge the relationship. Anthropic’s advantage is structural, not technical.

FetchLogic Take

Within fourteen months, at least five major intelligence agencies will formally standardize on Claude or a direct competitor—not through coordinated policy but through cascading operational adoption that outpaces procurement reform. The real signal will be when OMB’s AI guidelines, expected by year-end, retroactively legitimize deployments that are already processing classified data. Policy will document reality rather than shape it. Watch for Anthropic to announce a dedicated government cloud offering by Q3 2024, probably through a partnership with a defense-cleared infrastructure provider. The company’s current trajectory gives it eighteen months of runway before OpenAI credibly matches its auditability features or Google overcomes its institutional hesitation. That window determines whether Anthropic becomes the default government AI layer or just the vendor who moved first. Momentum matters more than technology in markets where switching costs include congressional testimony.

AI Tools We Recommend

ElevenLabs  ·  Synthesia  ·  Murf AI  ·  Gamma  ·  InVideo AI  ·  OutlierKit

Affiliate links · we may earn a commission.

Leave a Comment

We use cookies to personalise content and ads. Privacy Policy