Generative AI models are getting closer to taking action in the real world. Already, the big AI companies are introducing AI agents that can take care of web-based busywork for you, ordering your groceries or making your dinner reservation. Today, Google DeepMind announcedtwo generative AI models designed to power tomorrow’s robots.
The models are both built on Google Gemini, a multimodal foundation model that can process text, voice, and image data to answer questions, give advice, and generally help out. DeepMind calls the first of the new models, Gemini Robotics, an “advanced vision-language-action model,” meaning that it can take all those same inputs and then output instructions for a robot’s physical actions. The models are designed to work with any hardware system, but were mostly tested on the two-armed Aloha 2 system that DeepMind introduced last year.

In a demonstration video, a voice says: “Pick up the basketball and slam dunk it” (at 2:27 in the video below). Then a robot arm carefully picks up a miniature basketball and drops it into a miniature net—and while it wasn’t a NBA-level dunk, it was enough to get the DeepMind researchers excited.
Google DeepMind released this demo video showing off the capabilities of its Gemini Robotics foundation model to control robots. Gemini Robotics
“This basketball example is one of my favorites,” said Kanishka Raothe principal software engineer for the project, in a press briefing. He explains that the robot had “never, ever seen anything related to basketball,” but that its underlying foundation model had a general understanding of the game, knew what a basketball net looks like, and understood what the term “slam dunk” meant. The robot was therefore “able to connect those (concepts) to actually accomplish the task in the physical world,” says Rao.
What are the advances of Gemini Robotics?
Carolina Paradahead of robotics at Google DeepMind, said in the briefing that the new models improve over the company’s prior robots in three dimensions: generalization, adaptability, and dexterity. All of these advances are necessary, she said, to create “a new generation of helpful robots.”
Generalization means that a robot can apply a concept that it has learned in one context to another situation, and the researchers looked at visual generalization (for example, does it get confused if the color of an object or background changed), instruction generalization (can it interpret commands that are worded in different ways), and action generalization (can it perform an action it had never done before).
Parada also says that robots powered by Gemini can better adapt to changing instructions and circumstances. To demonstrate that point in a video, a researcher told a robot arm to put a bunch of plastic grapes into the clear Tupperware container, then proceeded to shift three containers around on the table in an approximation of a shyster’s shell game. The robot arm dutifully followed the clear container around until it could fulfill its directive.
Google DeepMind says Gemini Robotics is better than previous models at adapting to changing instructions and circumstances.Google DeepMind
As for dexterity, demo videos showed the robotic arms folding a piece of paper into an origami fox and performing other delicate tasks. However, it’s important to note that the impressive performance here is in the context of a narrow set of high-quality data that the robot was trained on for these specific tasks, so the level of dexterity that these tasks represent is not being generalized.
What Is Embodied Reasoning?
The second model introduced today is Gemini Robotics-ER, with the ER standing for “embodied reasoning,” which is the sort of intuitive physical world understanding that humans develop with experience over time. We’re able to do clever things like look at an object we’ve never seen before and make an educated guess about the best way to interact with it, and this is what DeepMind seeks to emulate with Gemini Robotics-ER.
Parada gave an example of Gemini Robotics-ER’s ability to identify an appropriate grasping point for picking up a coffee cup. The model correctly identifies the handle, because that’s where humans tend to grasp coffee mugs. However, this illustrates a potential weakness of relying on human-centric training data: for a robot, especially a robot that might be able to comfortably handle a mug of hot coffee, a thin handle might be a much less reliable grasping point than a more enveloping grasp of the mug itself.
DeepMind’s Approach to Robotic Safety
Vikas SindhwaniDeepMind’s head of robotic safety for the project, says the team took a layered approach to safety. It starts with classic physical safety controls that manage things like collision avoidance and stability, but also includes “semantic safety” systems that evaluate both its instructions and the consequences of following them. These systems are most sophisticated in the Gemini Robotics-ER model, says Sindhwani, which is “trained to evaluate whether or not a potential action is safe to perform in a given scenario.”
And because “safety is not a competitive endeavor,” Sindhwani says, DeepMind is releasing a new data set and what it calls the Asimov benchmarkwhich is intended to measure a model’s ability to understand common-sense rules of life. The benchmark contains both questions about visual scenes and text scenarios, asking models’ opinions on things like the desirability of mixing bleach and vinegar (a combination that make chlorine gas) and putting a soft toy on a hot stove. In the press briefing, Sindhwani said that the Gemini models had “strong performance” on that benchmark, and the technical report showed that the models got more than 80 percent of questions correct.
DeepMind’s Robotic Partnerships
Back in December, DeepMind and the humanoid robotics company Apptronik announced a partnershipand Parada says that the two companies are working together “to build the next generation of humanoid robots with Gemini at its core.” DeepMind is also making its models available to an elite group of “trusted testers”: Agile Robots, Agility Robotics, Boston Dynamicsand Enchanted Tools.
From Your Site Articles
Related Articles Around the Web
GIPHY App Key not set. Please check settings