Researchers combat AI hallucinations in math

Two University of California, Berkeley, researchers documented how they tamed AI hallucinations in math by asking ChatGPT to solve the same problem 10 times. Credit: Eugene Mymrin/ Moment via Getty Images

One of the biggest problems with using AI in education is that the technology hallucinates. That’s the word the artificial intelligence community uses to describe how its newest large language models make up stuff that doesn’t exist or isn’t true. Math is a particular land of make-believe for AI chatbots. Several months ago, I tested Khan Academy’s chatbot, which is powered by ChatGPT. The bot, called Khanmigo, told me I had answered a basic high school Algebra 2 problem involving negative exponents wrong. I knew my answer was right. After typing in the same correct answer three times, Khanmigo finally agreed with me. It was frustrating.

Errors matter. Kids could memorize incorrect solutions that are hard to unlearn, or become more confused about a topic. I also worry about teachers using ChatGPT and other generative AI models to write quizzes or lesson plans. At least a teacher has the opportunity to vet what AI spits out before giving or teaching it to students. It’s riskier when you’re asking students to learn directly from AI.

Website for Mind/Shift — This story also appeared in Mind/Shift

Computer scientists are attempting to combat these errors in a process they call “mitigating AI hallucinations.” Two researchers from University of California, Berkeley, recently documented how they successfully reduced ChatGPT’s instructional errors to near zero in algebra. They were not as successful with statistics, where their techniques still left errors 13 percent of the time. Their paper was published in May 2024 in the peer-reviewed journal PLOS One.

In the experiment, Zachary Pardos, a computer scientist at the Berkeley School of Education, and one of his students, Shreya Bhandari, first asked ChatGPT to show how it would solve an algebra or statistics problem. They discovered that ChatGPT was “naturally verbose” and they did not have to prompt the large language model to explain its steps. But all those words didn’t help with accuracy. On average, ChatGPT’s methods and answers were wrong a third of the time. In other words, ChatGPT would earn a grade of a D if it were a student.

Current AI models are bad at math because they’re programmed to figure out probabilities, not follow rules. Math calculations are all about rules. It’s ironic because earlier versions of AI were able to follow rules, but unable to write or summarize. Now we have the opposite.

The Berkeley researchers took advantage of the fact that ChatGPT, like humans, is erratic. They asked ChatGPT to answer the same math problem 10 times in a row. I was surprised that a machine might answer the same question differently, but that is what these large language models do. Often the step-by-step process and the answer were the same, but the exact wording differed. Sometimes the methods were bizarre and the results were dead wrong. (See an example in the illustration below.)

Researchers grouped similar answers together. When they assessed the accuracy of the most common answer among the 10 solutions, ChatGPT was astonishingly good. For basic high-school algebra, AI’s error rate fell from 25 percent to zero. For intermediate algebra, the error rate fell from 47 percent to 2 percent. For college algebra, it fell from 27 percent to 2 percent.

ChatGPT answered the same algebra question three different ways, but it landed on the right response seven out of 10 times in this example

Source: Pardos and Bhandari, “ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills,” PLOS ONE, May 2024

However, when the scientists applied this method, which they call “self-consistency,” to statistics, it did not work as well. ChatGPT’s error rate fell from 29 percent to 13 percent, but still more than one out of 10 answers was wrong. I think that’s too many errors for students who are learning math.

The big question, of course, is whether these ChatGPT’s solutions help students learn math better than traditional teaching. In a second part of this study, researchers recruited 274 adults online to solve math problems and randomly assigned a third of them to see these ChatGPT’s solutions as a “hint” if they needed one. (ChatGPT’s wrong answers were removed first.) On a short test afterwards, these adults improved 17 percent, compared to less than 12 percent learning gains for the adults who could see a different group of hints written by undergraduate math tutors. Those who weren’t offered any hints scored about the same on a post-test as they did on a pre-test.

Those impressive learning results for ChatGPT prompted the study authors to boldly predict that “completely autonomous generation” of an effective computerized tutoring system is “around the corner.” In theory, ChatGPT could instantly digest a book chapter or a video lecture and then immediately turn around and tutor a student on it.

Before I embrace that optimism, I’d like to see how much real students – not just adults recruited online – use these automated tutoring systems. Even in this study, where adults were paid to do math problems, 120 of the roughly 400 participants didn’t complete the work and so their results had to be thrown out. For many kids, and especially students who are struggling in a subject, learning from a computer just isn’t engaging.

This story about AI hallucinations was written by Jill Barshay and produced by The Hechinger Report, a nonprofit, independent news organization focused on inequality and innovation in education. Sign up for Proof Points and other Hechinger newsletters.