Fixing Math Phrase Issues


We’ve skilled a system that solves grade college math issues with practically twice the accuracy of a fine-tuned GPT-3 mannequin. It solves about 90% as many issues as actual youngsters: a small pattern of 9-12 yr olds scored 60% on a check from our dataset, whereas our system scored 55% on those self same issues. That is essential as a result of right now’s AI continues to be fairly weak at commonsense multistep reasoning, which is straightforward even for grade college youngsters. We achieved these outcomes by coaching our mannequin to acknowledge its errors, in order that it will probably strive repeatedly till it finds an answer that works.

Learn paperBrowse samplesObtain dataset


Massive language fashions like GPT-3 have many spectacular abilities, together with their capability to mimic many writing kinds, and their in depth factual information. Nevertheless, they battle to carry out duties that require correct multistep reasoning, like fixing grade college math phrase issues. Though the mannequin can mimic the cadence of appropriate options, it frequently produces important errors in logic.

To match human efficiency in complicated logical domains, our fashions should study to acknowledge their errors and to decide on their steps fastidiously. To that finish, we prepare verifiers to judge whether or not or not a proposed answer is appropriate. To unravel a brand new downside, we use verifiers to pick one of the best amongst many proposed options. We collected the brand new GSM8K dataset to judge our strategies, and we’re releasing this dataset to facilitate analysis.

Within the ten examples beneath, we present options generated by our new technique, verification, and our baseline technique, fine-tuning.

GSM8K Dataset

GSM8K consists of 8.5K prime quality grade college math phrase issues. Every downside takes between 2 and eight steps to unravel, and options primarily contain performing a sequence of elementary calculations utilizing primary arithmetic operations (+ − × ÷) to succeed in the ultimate reply. Advantageous-tuned state-of-the-art language fashions carry out poorly on this dataset, primarily as a result of excessive variety of issues. On the identical time, GSM8K options rely solely on elementary ideas, so reaching excessive check efficiency is a tractable objective.

Options in GSM8K are written as pure language quite than as pure math expressions. By sticking to pure language, model-generated options are extra readily interpretable by people, and our strategies stay comparatively area agnostic.

Coaching Verifiers: Fashions that Study from their Errors

One important problem in mathematical reasoning is the excessive sensitivity to particular person errors. Autoregressive fashions, which generate every answer token by token, haven’t any mechanism to appropriate their very own errors. Options that veer off-course rapidly change into unrecoverable, as could be seen within the examples offered.

We handle this downside by coaching verifiers to judge the correctness of model-generated options. Verifiers are given many doable options, all written by the mannequin itself, and they’re skilled to resolve which of them, if any, are appropriate.

To unravel a brand new downside at check time, we generate 100 candidate options after which choose the answer that’s ranked highest by the verifier. Verifiers profit from this inherent optionality, in addition to from the truth that verification is commonly a less complicated process than era.

We discover that we get a robust increase in efficiency from verification, so long as the dataset is massive sufficient. With datasets which are too small, we imagine that the verifiers overfit by memorizing the ultimate solutions within the coaching set, quite than studying any extra helpful properties of mathematical reasoning.

On the total coaching set, 6B parameter verification barely outperforms a fine-tuned 175B parameter mannequin, giving a efficiency increase that’s roughly equal to a 30x mannequin dimension improve. Furthermore, verification seems to scale extra successfully with extra knowledge, if we extrapolate primarily based on present outcomes.


Producing appropriate arguments and recognizing incorrect ones are key challenges in growing extra normal AI. Grade college math is a perfect testbed for these capabilities. The issues in GSM8K are conceptually easy, but one delicate mistake is sufficient to derail a whole answer. Figuring out and avoiding such errors is a vital talent for our fashions to develop. By coaching verifiers, we train our fashions to separate the great options from those that didn’t fairly work out. We anticipate these abilities to change into more and more related as we try to use our fashions to extra logically complicated domains.