Speech models like AI chatbots: Google trains new generation of robots

The search for helpful robots has always been a Herculean task, because a robot that can perform general tasks in the world must be able to handle complex, abstract tasks in highly variable environments – especially ones it has never seen before.

Robotics Transformer 2, or RT-2, is the new version of what Google calls the Vision Language Action (VLA) model. RT-2 is a Transformer-based model that is trained on text and images from the Web and can output robot actions directly. Just as language models are trained using text from the Web to learn general ideas and concepts, RT-2 uses Web data to teach robots to better recognize visual and linguistic patterns in order to interpret instructions and infer which objects are most appropriate for the query.

Unlike chatbots, however, robots need something like “grounding”; they need to link real-world circumstances with their capabilities. Their training is not just about learning everything you need to know about an apple, for example: How it grows, what its physical properties are, or even that one supposedly landed on Sir Isaac Newton’s head. A robot must be able to recognize an apple in context, distinguish it from a red ball, understand what it looks like, and most importantly, know how to pick it up.

-> Read more on <-