How to improve the understanding of numbers in language models?

Hello, this is Yulia Rogozina, a business analyst at Sherpa Robotics. Today I translated for you an article about the shortcomings of language models in terms of calculations, as well as how scientists continue to improve methods for solving the simplest problems.

The mathematical and logical abilities of large language models (LLM) are currently quite impressive, they are capable of solving problems at the level of a graduate student or even more complex ones, such as olympiad problems, GAOKAO national exam problems, and university-level math problems. However, upon closer examination of the models' results, it was found that despite their remarkable skill in problem-solving approaches, they often struggle with basic understanding and processing of numbers.

Like a careless student who claims, "I know how to do it, but I couldn't get it right."

Some of these errors are quite surprising, such as the belief that 9.11 > 9.9, or errors in simple addition of 8/7 + 3/5. These errors are the main cause of hallucinations when working with mathematical, logical, and analytical tasks, as the model presents seemingly correct approaches to solving the problem but ultimately gets incorrect results. Therefore, researching and improving the fundamental "numerical understanding and processing abilities" (NUPA) of models is crucial.

However, in modern research, the ability to reason and NUPA are often tested together, as in classic datasets such as GSM8k, MATH, MMLU, as well as in more complex tests mentioned above. For example, the task in GSM8k sounds like this: "Natalia sold 48 sets of tracksuits in April, and then sold half as many sets in May. How many sets did Natalia sell in total in April and May?" Solving this problem requires two aspects: on the one hand, mathematical reasoning, including understanding the text, extracting the necessary information, formulating mathematical equations (or finding other solution methods), solving equations or performing an algorithm and obtaining the result; on the other hand, it also requires understanding and processing the numbers given in the problem or obtained as intermediate results at each step, such as 48/2 = 24 and 48 + 24 = 72. Although both of these abilities are necessary for correctly solving problems, tests on such datasets do not distinguish between them.

A more serious problem is that the numerical content in these datasets is often deliberately simplified. In various exam questions, in order to focus on assessing students' understanding of mathematical concepts—such as composing correct equations and applying appropriate theorems—the numbers in both questions and answers are often specifically chosen to be integers. However, in real scenarios, this is not the case.

Despite the importance of NUPA, there is still no precise, detailed, and comprehensive formalization, measurement, and analysis of this fundamental ability. In this paper, the authors take a preliminary step towards formalizing NUPA in LLM. They classified numerical concepts and operations, usually taught in elementary and middle school, into four types of numerical representations: integers, floating-point numbers (finite decimal fractions), fractions, and exponential notation, as well as four categories of abilities, including 17 tasks. The combination of these representations leads to 41 meaningful tasks, forming our NUPA benchmark (Table 1). These representations and tasks cover the most common scenarios related to understanding and processing numbers, which usually do not pose a challenge for humans, as we read, use, or process such numbers almost every day.

On this benchmark, the authors carefully tested several advanced LLMs, including -4o, Llama-3.1, and Qwen2. They asked the models to directly output answers without invoking external tools. While the latest LLMs perform well on some of the simplest tasks, their performance significantly decreases when the tasks become slightly more complex (e.g., multiplication, modulo operations, or digit-based calculations) or when the representation of numbers goes beyond simple integers.

Overall, the unsatisfactory performance highlights a significant discrepancy between the claimed high mathematical reasoning and the low practical, everyday abilities to understand and process numbers in modern LLMs.

To address this issue, the authors explored three categories of approaches to improving the NUPA of models.

The first category of methods aims to improve NUPA models at the pre-training stage, including alternative tokenization, specially designed positional encoding (PE), changing numerical formats (e.g., zero-padding, index-hint, and reverse representation). They evaluated and analyzed them on a new benchmark, testing their effectiveness/ineffectiveness for relevant tasks/representations, which goes beyond previous evaluations conducted mainly on integer addition/multiplication tasks. Additionally, the authors summarized these methods into three mechanisms: simplifying the reasoning process, facilitating digit alignment, and ensuring regularization, and discussed the potential application of these mechanisms to a broader range of numerical representations.

The second category of approaches aims to improve NUPA on an already trained model. Researchers found that while simple direct fine-tuning can significantly enhance NUPA performance, applying the aforementioned methods (PE, data formats, and tokenizers) at this stage can have adverse effects. They tested various fine-tuning settings and configurations, but none could achieve performance equal to or surpassing the original model. The study results indicate that these modifications can significantly disrupt the established behavior of models or conflict with their existing knowledge, leading to performance degradation.

In conclusion, the authors discussed the potential of using "chain of thought" (CoT) methods for processing numerical information. While CoT methods allow breaking down complex tasks into simpler subtasks and significantly increase the likelihood of obtaining correct answers, their drawbacks—such as consuming a large context window and requiring increased processing time—become particularly evident in numerical tasks. They tested a general CoT method known as RFFT and found that for more complex tasks (such as tasks with O(n²) complexity, including multiplication, division, and fraction simplification), "chain of thought" methods face scalability issues, making their application in practical scenarios difficult. It should be noted that this article does not consider methods of using tools for NUPA because

1) it was necessary to study a self-sufficient NUPA LLM,

2) calling external tools when encountering numbers increases output latency,

3) the authors believe that NUPA without tools is a necessary skill for general artificial intelligence (AGI).

As a result, the researchers proposed a more comprehensive benchmark for assessing the basic capabilities of understanding and processing numerical information (NUPA) in LLMs, evaluated the performance of several advanced LLMs on it, and further explored three categories of approaches to improving NUPA: pre-training, fine-tuning, and CoT. The results show that existing research is insufficient to fully solve the NUPA problem, despite it being a fundamental ability for solving many more complex tasks. The authors hope that by presenting a systematic classification and more comprehensive assessment of NUPA, they can draw more attention from the community to this important but often overlooked fundamental ability.

Conclusion

As we have already understood from the article, tasks that are simple for us are complex for language models and vice versa. But progress does not stand still and research is being conducted to improve the efficiency of language models in the simplest mathematical tasks. In the future, this will increase the reliability of information from language models, which are being implemented in more and more processes in our daily lives.

Comments