Robots That Think Before They Act

The TiPToP system enables robots to reason in order to plan and complete tasks using spatial modeling and plain-language task instructions.

TiPToP's interface allows a robot to identify the items in view and determine the most effective and efficient way to grasp and move them.

courtesy William Shen

For decades, roboticists have worked to solve a stubborn problem: while robots can successfully complete specific routine tasks, they are stymied when they encounter something unfamiliar. Telling a robot to "clear the table" doesn’t mean it will pick up the mugs, straighten the papers, and wipe up the spills — that kind of flexible, common-sense competence has remained frustratingly out of reach. Most modern approaches rely on training robots with enormous datasets of demonstrations, teaching a robot to imitate the behaviors it has seen. The hope is that with enough data, the robot will generalize to new situations — but in practice, this often fails if a task differs from what was demonstrated or requires composing several steps.

A new system called TiPToP, developed by researchers at MIT and tested by researchers at The University of Pennsylvania, takes a different approach. Rather than learning from thousands of hours of robot-specific demonstrations, TiPToP gives robots the ability to reason — to look at a scene, understand a plain-language instruction, build a mental model of the environment, and plan a sequence of actions before moving a single joint. The name stands for "TiPToP is a Planner That just works on Pixels," and the system requires zero robot training data. Because TiPToP requires no retraining, getting TiPToP running on a new robot arm takes just a few hours.

"Instead of training on large amounts of robotics data and hoping new problems are similar to previously-seen problems, TiPToP reasons online: it thinks about how to solve each new problem it encounters," says Nishanth Kumar, SM '24, PhD '26, a primary author of TiPToP.

The foundations of TiPToP rest on three interlocking ideas. The first is modularity: rather than one monolithic model trying to do everything, TiPToP chains together specialized components for perception, planning, and execution, each of which can be independently improved or swapped out as better tools become available. The second is leveraging knowledge from foundation models: TiPToP uses Gemini, Google's large vision-language model, to translate natural-language commands into formal symbolic goals, giving the robot access to the vast visual and semantic knowledge already encoded in a model trained on the breadth of the internet. The third is reasoning via search: instead of committing to an action outright, an algorithm called cuTAMP explores many possible sequences of actions to find solutions that satisfy the goal while respecting physical constraints, such as avoiding other items in the surrounding area or placing objects safely and stably. This allows TiPToP to handle what researchers call "long-horizon" tasks, such as picking just the plush toys from a bin of stuffed animals, wooden trains, and toy food, or moving an obstacle out of the way in order to fill a tray with a drink and snacks. In head-to-head evaluations against π0.5-DROID, a state-of-the-art robot system fine-tuned on 350 hours of demonstrations, TiPToP achieved a 59.4% success rate compared to 33.3%, and completed tasks 37% faster on average.

TiPToP in action: Using TiPToP, a robot can identify all objects in a scene before executing a task.

"What surprised us is that a system using no robot training data could match — and often beat — end-to-end models trained on hundreds of hours of demonstrations," says William Shen, SM '23 and current PhD student, who is also a primary author of TiPToP. "We don’t see planning and learning as competing. The really exciting open question is how best to combine them."

The researchers see systems like TiPToP potentially being used in semi-structured environments with varied, unpredictable tasks — warehouses, manufacturing floors, and eventually, domestic settings. A warehouse robot using TiPToP could handle novel products without retraining; a factory arm could adapt to new assembly configurations as needed. Further down the road, the team envisions TiPToP as a reasoning core inside a broader system — integrated with mapping technologies that help mobile robots track their surroundings — that could power a household helper robot capable of interpreting everyday requests in everyday language.

The team designed and implemented the system to be easy for others to install and use; it is open-source and has detailed instructions for installation and use. They are actively working to improve several aspects, particularly making grasps more reliable and executing plans in closed-loop, using perception to trigger replanning in the event of an error.

"TiPToP is a great demonstration of the capacity of planning-based methods to be efficient and general. I hope it will inspire others to explore the integration of reasoning and learning to obtain generally intelligent robot behavior," says Leslie Pack Kaelbling, Panasonic Professor in the Department of Electrical Engineering and Computer Science and Director of Research for the MIT Siegel Family Quest for Intelligence.