H-LLM Logo Hallucinations.cloud

The Evel Knievel Problem: Why the AI Race Demands Multi-Model Verification

Three companies control 88% of enterprise AI. Their breakneck release pace — 14 frontier models in 12 months — means no single model stays reliable for long.

Brian R. Demsey

February 2026 | 10 min read

Evel Knievel jumping over 14 Greyhound buses at Kings Island, 1975

Evel Knievel clears 14 Greyhound buses at Kings Island, October 25, 1975. Photo: Cincinnati Enquirer

Fourteen Buses and Fourteen Models

On October 25, 1975, Evel Knievel pointed his Harley-Davidson XR-750 at a row of fourteen Greyhound buses in an Ohio parking lot, hit 95 miles per hour, and sailed 133 feet through cold, drizzly air. Half the nation watched on ABC. It was his longest successful jump — a record that stood for 24 years — and he barely made it, his rear wheel clipping the roof of the final bus.

I think about that jump a lot these days. Not because I'm nostalgic for the 1970s, though at 83 I've earned the right to be. I think about it because the artificial intelligence industry in 2025 looked a lot like Knievel lining up for that fourteenth bus: moving at tremendous speed, adding obstacles faster than anyone can safely clear them, and hoping the landing gear holds.

Consider the numbers. Between January 2025 and February 2026, OpenAI, Anthropic, and Google collectively released at least twenty major model versions. GPT-4.5 in February. Gemini 2.5 Pro in March. Claude Sonnet 4 and Opus 4 in May. GPT-5 in August. Claude Sonnet 4.5 in September. Gemini 3 Pro in November. Claude Opus 4.5 a week later. GPT-5.2 two weeks after that. Gemini 3 Flash five days later. And those are just the headline releases — each company shipped dozens of incremental updates, reasoning variants, and specialized models alongside them.

The pace is exhilarating. It is also, I believe, the single greatest argument for why no enterprise, no government agency, and no individual should trust any single AI model to deliver consistently reliable information.


The 88% Problem

According to Menlo Ventures' 2025 State of Generative AI report, OpenAI, Anthropic, and Google together command 88% of enterprise LLM API usage. The remaining 12% is scattered among Meta's Llama, Cohere, Mistral, and a long tail of smaller providers. Enterprise spending on generative AI hit $37 billion in 2025, up 3.2 times from the prior year.

But here's what most people miss: the distribution of that 88% shifted radically. In 2023, OpenAI held 50% of enterprise LLM spend. By 2025, it had fallen to 27%. Anthropic surged from 12% to 40%. Google tripled from 7% to 21%. In the coding market specifically, Anthropic now commands 54% share.

These aren't gentle market rotations. These are tectonic shifts happening in months, not years. And they reflect something important: each new model release doesn't just add capability — it reshuffles which model is best at what. GPT-5.2 excels at spreadsheets and presentations. Claude Opus 4.5 leads in agentic coding. Gemini 3 Pro tops reasoning benchmarks. No single model wins everywhere, and the leader in any given category changes with every release cycle.

For anyone relying on AI-generated information — which increasingly means everyone — this creates a fundamental reliability problem. A model that was the most accurate last month may not be the most accurate this month. A model that handles medical questions brilliantly might hallucinate freely about legal precedents. And you won't know which failure mode you're hitting until the damage is done.


What Fifty Years of Risk Assessment Taught Me

I came to this problem not from Silicon Valley venture circles but from an actuarial background. I spent decades quantifying risk for insurance companies and benefits platforms, including building RemoteNet Corporation's unified benefits system for Fortune 100 companies most notably Northrop Grumman. Actuaries are professional skeptics. We don't trust single data points. We triangulate. We cross-reference. We build redundancy into every calculation because we know that any individual estimate, no matter how sophisticated, carries embedded error.

When I first encountered ChatGPT in late 2023, my actuarial instincts fired immediately. Here was a system that delivered answers with extraordinary confidence and zero margin of error disclosure. It didn't say "I'm 73% sure about this." It said "Here's the answer," and sometimes the answer was completely fabricated. I asked it about my own career history and it invented companies I'd never worked for, awards I'd never received, and achievements that never happened — all presented with the same authoritative tone as its accurate responses.

That experience led me to build Hallucinations.cloud and the H-LLM Multi-Model platform. The core idea is simple, borrowed directly from actuarial science: if you want to know whether an answer is reliable, don't ask one oracle. Ask eight.


The Multi-Model Thesis

The H-LLM platform simultaneously queries eight AI models with the same prompt and compares their responses. When models agree, confidence is high. When they diverge, the system flags the discrepancy and generates an H-Score — a reliability rating that tells users how much trust to place in any given answer.

This approach works precisely because of the competitive dynamics described above. Because OpenAI, Anthropic, and Google train their models on different data, with different architectures, different safety frameworks, and different optimization targets, their failure modes are largely independent. When Claude hallucinates about a legal case, GPT might get it right. When Gemini invents a statistic, Claude might flag the correct number. The errors don't correlate in the way they would if you simply asked the same model twice.

The November 2025 release cluster proved this thesis spectacularly. Within six days, Google launched Gemini 3 Pro (November 18), Anthropic launched Claude Opus 4.5 (November 24), and each claimed superiority on different benchmarks. When we ran identical prompts through both new models plus GPT-5.1, the agreement rate on factual claims was roughly 84% — meaning 16% of responses contained claims that at least one other frontier model contradicted. That 16% is where hallucinations live, and without multi-model verification, users would have no way to identify them.


The Knievel Parallel

Evel Knievel's career offers a surprisingly apt metaphor for where AI stands today. His jumps got longer and more spectacular over time, but his safety engineering didn't keep pace. He relied on speed, courage, and a prayer that the physics would work out. When it didn't — at Caesar's Palace in 1967, at Wembley in 1975 — the crashes were catastrophic.

The AI industry is adding buses at an astonishing rate. Fourteen major model releases in twelve months. Each one faster, more capable, more impressive than the last. But the safety infrastructure — the mechanisms for verifying whether these models are actually telling the truth — hasn't scaled at the same pace.

Knievel's longest successful jump covered 133 feet. The current motorcycle ramp record, set by Australia's Robbie Maddison in 2008, is 351 feet — more than 2.5 times the distance. The difference isn't just courage. It's engineering. Better ramps, better suspension, better landing systems, better understanding of the physics.

AI needs its Maddison moment. The models themselves are improving at a breathtaking pace. What's missing is the verification infrastructure — the equivalent of a properly engineered landing ramp — that lets organizations deploy these models with confidence that the information they produce is actually reliable.


What Comes Next

I'm 83 years old. I've founded companies, paddled solo across ocean channels, run 22 marathons, and written more than 30,000 lines of production code for the H-LLM platform — much of it with Claude as my development partner. I say this not to boast but to underscore a principle I've lived by: don't lament, engage.

The AI arms race isn't slowing down. If anything, the pace is accelerating. OpenAI declared a "code red" after Gemini 3's launch and rushed GPT-5.2 to market. Anthropic responded with Opus 4.6 in February 2026. Google has Gemini 3 Ultra in the pipeline. Each company is adding buses to the row.

The question isn't whether these models will get more powerful. They will. The question is whether we'll build the verification systems that let society use that power safely. Multi-model verification isn't the only answer, but it's an answer that works today, with existing technology, at a cost that enterprises can absorb.

Evel Knievel's back wheel clipped that fourteenth bus. He held on, kept the bike upright, and rolled to a stop. The crowd went wild. But watching the replay, you can see how close it was to disaster.

The AI industry is clipping the fourteenth bus right now. The question is whether we land it.

Brian R. Demsey is the founder and CEO of Hallucinations.cloud LLC, an AI safety company focused on detecting misinformation through multi-model verification. He previously founded RemoteNet Corporation and has over 50 years of experience in enterprise technology.

See Multi-Model Verification in Action

Experience how H-LLM compares responses across eight AI models to surface discrepancies and build confidence in AI outputs.

Explore the Working Model