Are we AI Math yet?

Is there a computer system that, given a theorem statement, can prove it more reliably than 100 top mathematicians given 100 years?

Not yet.

Frontier Math Benchmark

FrontierMath is a benchmark by Epoch AI in collaboration with OpenAI intended to test the limits of AI mathematics. It was announced in November 2024.

It uses 290 questions where the answers are mathematical objects such as integers and matrices.

Frontier Math evaluation graph
FrontierMath evaluations (source)
Commentary:
Math Olympiads

Math olympiads are competitions where contestants solve mathematical problems. The most famous one is the International Mathematical Olympiad, a competition for high-school students from across the globe.

There is much interest in getting AI to be good at these competitions. In November 2023, XTX Markets launched the AIMO prize, with the Grand Prize of $5M for the first publicly shared AI model capable of winning a gold medal in the IMO, and an additional $5M for partial progress prizes. The AIMO prize requires the same input/output as human contestants (not computer-formal).

AlphaProof and AlphaGeometry 2

In July 2024, Google DeepMind announced AI achieves silver-medal standard solving International Mathematical Olympiad problems. On hand-formalized problem statements, the systems AlphaProof and AlphaGeometry 2 achieved a score at the silver medal level on the IMO.

The types of problems on the IMO seem to present different levels of challenge for AI solutions:

AlphaProof solved two algebra problems and one number theory problem by determining the answer and proving it was correct. This included the hardest problem in the competition, solved by only five contestants at this year’s IMO. AlphaGeometry 2 proved the geometry problem, while the two combinatorics problems remained unsolved.

MiniF2F

MiniF2F is a benchmark of AI formal mathematics. It consists of 244 "exercise statements from olympiads (AMC, AIME, IMO) as well as high-school and undergraduate maths classes".

The current state of the art seems to be ProofAug, solving 66% of the problems.

Chart of AI performance on MiniF2F
MiniF2F performance across time chart from Papers With Code (the top-right dot is ProofAug)

Other links:

Millennium Prize

The Millennium Prize Problems are seven important problems in mathematics for which the Clay Mathematics Institute has offered a prize of $1 million per problem. Currently, 1 problem has been solved by a human, 0 have been solved by AI, and 6 remain unsolved.

ProblemHuman or AI?Solved By / Date
Birch and Swinnerton‑Dyer Conjecture
Hodge Conjecture
Navier‑Stokes Equation
P vs NP
Poincaré Conjecture🧑Grigori Perelman, 2003
Riemann Hypothesis
Yang–Mills Existence and Mass Gap
Prediction Markets
Prediction markets are a social technology that crowdsources predictions using financial incentives. Here included are:
  1. Polymarket: real-money prediction market site with curated questions.
  2. Manifold: play-money prediction market site with user-submitted questions.
  3. Metaculus: market-adjacent forecasting site.
Prediction markets theoretically allow us to get probabilities of future events without trusting any specific authority. However, on topics such as exciting new technologies, even short-term, real-money prediction markets can be quite dumb.

IMO in 2025

FrontierMath in 2025

Long term

Other links