Apple Researchers Study AI’s Mathematical Reasoning Abilities: Key Findings


Apple researchers studied the mathematical reasoning capabilities of large language models (LLMs) and found that these models rely on probabilistic pattern-matching rather than true logical reasoning. The study revealed that LLMs show significant variability when responding to different versions of the same question and struggle with complex reasoning tasks, particularly as the number of steps or tokens increases. The research highlights the limitations of current LLMs in handling formal reasoning, suggesting their performance declines as question complexity grows.


Apple researchers have explored the reasoning capabilities of large language models (LLMs), particularly in the context of mathematics. Their study aimed to assess the reliability of existing metrics, revealing that LLMs show significant variability in responses to different versions of the same question.

Motivation for the Study

The team was concerned about whether the mathematical reasoning abilities of LLMs had truly advanced, prompting them to conduct a comprehensive study involving several advanced open and closed models.

Study Results

The findings indicate that LLMs rely on probabilistic pattern-matching rather than formal reasoning. In their paper titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models,” the researchers noted, “LLMs exhibit noticeable variance when responding to different instantiations of the same question.” They also observed that performance declines when only the numerical values in a question are changed within the GSM-Symbolic benchmark, a tool commonly used to evaluate mathematical reasoning for grade-school-level questions.

Limitations of LLMs in Reasoning

While LLMs can emulate certain abstract reasoning patterns, they fall short of genuine logical reasoning. The researchers pointed out that, in tasks requiring the accurate selection of multiple tokens, the likelihood of producing a correct answer decreases exponentially with the number of tokens or steps involved, highlighting their unreliability in complex reasoning situations.

The study also examined the fragility of mathematical reasoning in these models, demonstrating a significant decline in performance as the complexity of the questions increased. The researchers hypothesized that this deterioration occurs because current LLMs do not engage in true logical reasoning; instead, they attempt to mimic the reasoning steps present in their training data.

Leave a Reply

Your email address will not be published. Required fields are marked *