New ARC-AGI-2 Test Exposes AI’s Intelligence Limits
The ARC-AGI-2 test, developed by the Arc Prize Foundation, reveals a stark gap between human reasoning and current AI models. Top systems from OpenAI and Google struggle with this new benchmark of general intelligence.
When it comes to evaluating artificial intelligence, accuracy alone no longer cuts it. A new benchmark from the Arc Prize Foundation is challenging everything we think we know about machine intelligence—and revealing just how far today’s top AI models still have to go.
This week, the nonprofit foundation co-founded by AI researcher François Chollet unveiled ARC-AGI-2, a significantly more rigorous successor to its earlier intelligence test, ARC-AGI-1. Designed to probe general intelligence, the new test pushes AI models beyond memorization and brute-force pattern matching and instead requires adaptive reasoning—something even the most powerful systems have failed to demonstrate effectively.
So far, the results are startling.
Leading AI Models Score Shockingly Low
Despite their dominance in industry benchmarks and commercial applications, leading AI models like OpenAI’s o1-pro, DeepSeek’s R1, and Anthropic’s Claude 3.7 Sonnet have performed dismally on ARC-AGI-2. According to the official Arc Prize leaderboard, most scored around 1% to 1.3%, indicating that these models struggle to apply logic and reasoning in unfamiliar scenarios.
For context, the original ARC-AGI test was recently conquered by OpenAI’s advanced reasoning model o3, which hit 75.7%—but ARC-AGI-2 is a different beast entirely. When the same model was run on ARC-AGI-2, it scored just 4%, despite using $200 in computing power per task.
By contrast, average human panels scored a whopping 60% on the same test, reinforcing a critical gap between human cognition and current machine learning capabilities.
ARC-AGI-2: Designed for Adaptability, Not Memorization
Unlike earlier AI benchmarks that often favored models with massive training datasets and compute power, ARC-AGI-2 aims to assess a more nuanced skill: intelligence efficiency. That means not just whether an AI can solve a problem, but how quickly and cost-effectively it can learn and adapt.
The test features a collection of visual puzzles—essentially abstract grids filled with multi-colored squares—that require the model to recognize and extrapolate patterns in real-time. It’s a setup that prevents models from leaning on pre-learned patterns or “brute force” methods. Instead, they must reason through the task on the fly.
“In ARC-AGI-2, you can’t rely on memorization,” said François Chollet in a recent post. “The model has to generalize from principles, just like a human would when solving a novel problem.”
A Shift Toward Measuring AI’s True Cognitive Ability
The Arc Prize Foundation is not just issuing a new test—it’s issuing a challenge to the AI community. Alongside ARC-AGI-2, it has launched the Arc Prize 2025, a high-stakes competition calling on developers to build a model that can score at least 85% on ARC-AGI-2 while keeping costs at $0.42 per task.
This target is as ambitious as it is symbolic. It represents a vision of general intelligence that is not just capable, but efficient—a key distinction that has often been lost in the arms race of larger models and higher compute budgets.
“The efficiency with which those capabilities are acquired and deployed is a crucial, defining component,” said Arc Prize co-founder Greg Kamradt. “The core question being asked is not just, ‘Can AI solve the task?’ but also, ‘At what cost and effort?’”
Experts Call for New AI Benchmarks
Chollet’s concerns echo a broader industry sentiment. As AI models have become more capable at mimicking human output—especially in text and code—there’s growing worry that traditional benchmarks no longer capture what truly matters: intelligence, adaptability, and creativity.
Earlier this year, Thomas Wolf, co-founder of open-source AI platform Hugging Face, told TechCrunch that the AI field is “desperately lacking tests that measure the actual traits of general intelligence.” In his view, existing tests fail to challenge models in open-ended or abstract thinking—abilities central to human cognition.
ARC-AGI-2 may be a step toward filling that gap. By prioritizing adaptability and penalizing inefficiency, the test represents a philosophical shift from performance to principled evaluation.
Why Most AI Models Are Failing
The glaring weakness exposed by ARC-AGI-2 isn’t just about performance—it’s about process. Modern AI models are built on pattern recognition at scale, powered by massive datasets and extensive fine-tuning. But this architecture has blind spots.
Even models like GPT-4.5, Gemini 2.0 Flash, and Claude 3.7, which dominate tasks like summarization or code generation, fail when asked to reason from first principles in unfamiliar territory. ARC-AGI-2 highlights that intelligence isn’t just stored data—it’s flexible thinking.
This distinction matters, especially as companies explore AGI (artificial general intelligence)—the hypothetical point where machines can think and learn like humans. If current models can’t pass a test that humans ace at a 60% clip, it raises serious questions about how close we are.
What’s Next for AI Evaluation?
As the hype around generative AI begins to collide with practical limitations, the industry may need to rethink what progress looks like. Bigger models aren’t always better—especially if they can’t adapt without enormous compute budgets.
The Arc Prize Foundation’s focus on efficiency as a core metric reflects a broader sustainability issue in AI development. Models like o3 (low) can deliver high performance—but at $200 per task, it’s far from scalable or practical.
That’s why ARC-AGI-2 matters. It’s not just a test of intelligence, but a test of whether intelligence can be made accessible, efficient, and real.
The Real Test of Intelligence Has Just Begun
ARC-AGI-2 is more than a benchmark—it’s a wake-up call. While AI has made astonishing strides in the past decade, this new test suggests we’re still in the early stages of building machines that can reason, adapt, and think in the ways we take for granted.
For now, humans remain the reigning champions of general intelligence. But the Arc Prize 2025 challenge is open, and the race to reach 85% efficiency is on. The question isn’t just who will win—but whether the next generation of AI will learn to think smarter, not just bigger.
(Disclaimer: This article is for informational purposes only and reflects publicly available statements and research. It does not constitute an endorsement of any AI product, model, or benchmark.)
Also Read: OpenAI’s Advanced Voice Mode Gets a Human Touch in Latest Update