The latest OpenAI experimental reasoning model has showcased exceptional performance at the International Mathematical Olympiad (IMO), successfully solving 5 out of 6 problems and achieving a gold medal level score of 35 points. This breakthrough is seen as an important milestone for AI in terms of general reasoning abilities, although experts have raised concerns about the evaluation conditions, suggesting there may be significant differences compared to human participants.
The International Mathematical Olympiad, as the most prestigious math competition globally, has served as a benchmark for evaluating high school students’ mathematical abilities since 1959. The contest takes place over two days, during which participants must solve three exceptionally challenging math problems within 4.5 hours each day. Competitors are only allowed to use pen and paper, with no form of communication permitted.
OpenAI’s models were evaluated according to the competition rules, which included two 4.5-hour exam sessions conducted without the use of any external tools, writing natural language proofs based on the official problem statements. The evaluation was independently scored by three IMO medalists, leading to the final determination of their scores.
Wei指出,这一模型展现了生成复杂且严谨的数学论证的潜力,强调这一成就并非依赖狭隘的任务专注方法,而是在通用强化学习和计算拓展上取得的显著进展。
OpenAI’s CEO Sam Altman stated that this achievement marks a decade of progress in AI, revealing that this model will not be available to the public in the short term. He described this as a vision that was part of OpenAI’s founding.
However, against the backdrop of AI’s rapidly advancing mathematical capabilities, experts have raised questions about the evaluation methods used. While AI critic Marcus finds the model’s performance impressive, he also questions the validity of the training methods and their practical value to the general public. Additionally, some mathematicians have pointed out that if participants had more resources, their chances of success would significantly increase.
Recent test results from the independent evaluation agency MathArena indicate that major language models, including GPT-4, have underperformed in the IMO competition, riddled with logical errors and incomplete proofs. This makes OpenAI’s announcement particularly striking, yet its true value still needs to be confirmed through independent validation and practical application.



