AI 'Mind's Eye': Visual Reasoning Boosts Performance in Complex Tasks-Cambridge, Microsoft

When humans try to solve problems, they often visualize the tasks in their heads. New research suggests that enabling artificial intelligence to do the same could boost performance on spatial reasoning challenges.

While large language models excel at many text-based tasks, they often struggle with those that require more complex reasoning. To try and close that gap, researchers at the University of Cambridge and Microsoft Research have developed a new approach that lets AI "think" in both text and images.

The technique enables multimodal large language models to generate visual representations of their intermediate reasoning steps. In non-peer reviewed research posted to arXiv, the researchers report that when they tested the approach on spatial reasoning challenges involving 2D mazes, they saw significant improvements over the typical "chain-of-thought" (CoT) technique on the most challenging scenarios.

"Spatial relations and layouts and also some geometric features are very hard to describe with pure text," says co-lead author Chengzu Li, a Ph.D. student at Cambridge. "That's why we think that reasoning with pure text would limit the performance of the model in spatial tasks. And that's the main motivation for introducing visual 'thoughts,'" he says.

The new approach enables a single multimodal model to generate both visual and text reasoning steps itself. For these experiments, the researchers used a model called Anole that can respond in either modality. The researchers fine-tuned a pre-trained model on text and image data from three maze-like games with different levels of complexity. They called their fine-tuned version Multimodal Visualization of Thought (MVoT).

During testing the model was only given the starting image and a sequence of actions to perform. It then generated image and text reasoning steps followed by a prediction of what would happen.

They found that on all three games, the MVoT model significantly outperformed all models apart from the one using traditional text CoT. That model actually did slightly better on the two simpler mazes, successfully predicting the outcome 98 percent of the time on both, compared to MVoT's scores of 93 percent and 95 percent. But the traditional text CoT model did much worse on the most complicated game, scoring just 61 percent compared to MVoT's 86 percent.

The researchers say this outcome is likely because CoT relies on accurate textual descriptions of the environment, which get harder the more complex the mazes become. In contrast, the inclusion of images in the reasoning process appears to make MVoT much better at dealing with more challenging environments.

Li says extending this approach into more complex domains could have broad applications. One of the most compelling is robotics, where the approach could help machines reason more effectively about the visual input they get from the environment. It could also be help AI tutors better illustrate and explain ideas, particularly in areas like geometry. More broadly, he says the approach can boost model interpretability by giving humans a clear picture of what the model is thinking about in spatial tasks.

Bạn có phát hiện lỗi hoặc sai sót không?

Chúng tôi sẽ xem xét ý kiến của bạn càng sớm càng tốt.