OpenAI’s New AIs Are Smarter—But Less Trustworthy
OpenAI’s o3 and o4-mini AI models show advanced reasoning skills but hallucinate more than older models. Here’s why accuracy still lags.
OpenAI’s New AIs Are Smarter—But Less Trustworthy
OpenAI’s latest creations, the o3 and o4-mini models, are redefining what artificial intelligence can do in reasoning tasks. Yet despite—or perhaps because of—their sharp cognitive leaps, these advanced models are making a startling number of mistakes. They’re hallucinating more than ever, and even OpenAI isn’t sure why.
For a company that’s pushed AI toward precision and reliability, the rise in hallucinations—a term used when models generate false or fabricated information—marks a concerning twist in the story of machine intelligence.
Smarter Tech, Bigger Mistakes
Unlike their predecessors, the o3 and o4-mini models were built to think more like humans—to reason, analyze, and synthesize. But with that higher-order thinking has come an unexpected downside: increased falsehoods. According to OpenAI’s internal testing, o3 hallucinated answers to 33% of questions in PersonQA, a benchmark designed to test factual knowledge about people.
That’s more than double the hallucination rate of its earlier models, o1 and o3-mini, which came in at 16% and 14.8%, respectively. Even more concerning, the o4-mini model performed worse—hallucinating 48% of the time. These results flip the conventional expectation that newer models should be more accurate.
Why Are These Models Losing Their Grip on Truth?
The paradox lies in how these models operate. Because they engage in more complex reasoning, they naturally produce more responses—and more responses mean more chances for things to go wrong. OpenAI’s technical report suggests that the models make both more accurate and more inaccurate statements simply because they “claim more overall.”
Third-party researchers are also sounding the alarm. At Transluce, a nonprofit AI research group, scientists discovered that o3 not only hallucinates facts but even invents parts of its own decision-making process. In one case, the model falsely claimed to have run code on a 2021 MacBook Pro—something it simply cannot do.
Neil Chowdhury, a researcher at Transluce and former OpenAI engineer, believes the issue may stem from the reinforcement learning techniques used to train the o-series models. “These methods can unintentionally amplify hallucinations that would otherwise be smoothed out by traditional post-training techniques,” he explained.
The Business Case Against Hallucination
While occasional AI misfires may seem harmless in casual use, they raise serious concerns in professional settings. Kian Katanforoosh, a Stanford professor and CEO of the AI-skilling startup Workera, has been testing o3 in real-world coding environments. He’s impressed with its performance—but notes it often provides broken or nonexistent web links, a problem that can’t be ignored in critical workflows.
In industries like healthcare, law, and finance—where facts aren’t negotiable—these hallucinations could lead to costly errors or legal liabilities. Even one fabricated clause in a legal document or a fictitious citation in a medical report could have damaging consequences.
Is Real-Time Web Search the Answer?
One potential fix lies in grounding AIs with real-time information. OpenAI’s GPT-4o, which includes built-in web search capabilities, scored 90% accuracy on the SimpleQA benchmark. By cross-referencing real-world data, models can better fact-check themselves in real time.
However, this solution comes with trade-offs—mainly privacy. Allowing models to access external web searches may expose user prompts to third parties, a red flag for security-conscious organizations.
Still, this hybrid approach may offer a middle path for bridging the gap between reasoning power and reliable output.
Redefining Progress in the Age of Reasoning AI
The recent shift in the AI industry toward reasoning-based models represents a strategic pivot. Traditional methods were hitting a wall—demanding massive computing power for marginal gains. Reasoning models like o3 and o4-mini offer a new frontier: they’re lighter, faster, and cognitively deeper.
But with those strengths come new vulnerabilities.
Sarah Schwettmann, co-founder of Transluce, warns that the hallucination rates in these models may ultimately curb their real-world utility. “If a model can’t be trusted to tell the truth, its intelligence becomes a liability,” she said.
OpenAI acknowledges the problem and says reducing hallucinations remains a core research priority. As spokesperson Niko Felix put it, “We’re continually working to improve the accuracy and reliability of all our models.”
Conclusion: Intelligence Without Integrity Is Not Progress
The surge in reasoning power seen in o3 and o4-mini showcases just how far AI has come. But it also reveals how far it still has to go. Intelligence alone isn’t enough—trust is the new frontier. As OpenAI and others race toward smarter, faster, more capable AIs, they must confront an inconvenient truth: brilliance without reliability is just noise dressed up as genius.
Fixing hallucinations won’t just improve performance—it will restore the trust needed for AI to truly integrate into everyday life. Until then, we’ll need to look at each new model with both admiration—and a healthy dose of skepticism.
Disclaimer:
This article reflects public findings and expert opinions on AI model performance as of publication. Given the evolving nature of artificial intelligence, behaviors and capabilities may shift over time. Always validate AI-generated content before use in critical applications.
source : India Today