AI Code Debugging Still a Challenge: Microsoft Research Highlights Limitations of OpenAI and Anthropic Models

Edited by: Veronika Nazarova

A recent Microsoft Research study reveals that AI models from OpenAI and Anthropic still face challenges in debugging code effectively. The study, conducted in April 2025, assessed nine AI models, including Claude 3.7 Sonnet, OpenAI's o1, and OpenAI's o3-mini, using the SWE-bench Lite benchmark with debugging tools. Claude 3.7 Sonnet achieved the highest success rate at 48.4%. The researchers attributed the suboptimal performance to a lack of data representing sequential decision-making behavior. Microsoft Research is also introducing debug-gym, a novel environment designed to train AI coding tools in the complex art of debugging code. Despite the mixed results, the research underscores the ongoing need for human expertise in software development and the potential for future advancements in AI debugging capabilities.

Did you find an error or inaccuracy?

We will consider your comments as soon as possible.

AI Struggles with Code Debugging: Microsof... | Gaya One