By Hussameddine Al Attar | Staff Writer

 

The space of artificial intelligence has seen an incredible expansion in recent years. The meteoric rise in Generative Pre-trained Transformers (GPTs) has allowed engineers to develop deep learning systems and large language models that can adapt and respond to virtually any user input and produce astounding results. This is largely made possible by the expansive text and image datasets that have been compiled and released. Development of robotic models, however, although successful, has not been able to keep up with that of its linguistic counterparts due to a shortage in relevant data for training the models.

To tackle this impediment, Google developed PaLM-E, an “embodied multimodal language model” that integrates natural language processing with computer vision and other sensory inputs. By leveraging these sensory inputs, PaLM-E is able to better comprehend language within the context of physical experiences. PaLM-E has two primary purposes: as a robotics model, it can solve various tasks across different types of robots and modalities, such as images and neural scene representations. Additionally, it is a versatile vision-and-language model that can perform visual tasks like image description, object detection, and scene classification, as well as language tasks like generating code, writing poetry, and summarizing text passages.

PaLM-E combines two of Google’s most powerful models: PaLM, one of the largest language models in existence with 540 billion parameters, and ViT, one of their most advanced computer vision models with 22 billion parameters. The result is a groundbreaking vision-language model with an unprecedented 562 billion parameters. This model has achieved a new state-of-the-art performance on the visual-language OpenKI Visual Question Answering (OK-VQA) benchmark, without any task-specific fine-tuning. Remarkably, this improved performance was achieved while retaining almost the same performance as the language-exclusive PaLM.

PaLM-E works by integrating sensory inputs such as images, robot states, and scene embeddings into a pre-trained language model. These sensory inputs are transformed into a representation using a procedure similar to how words are processed by a language model. In language models, text is mathematically represented as tokens that encode (sub)words, and each token is associated with a high-dimensional vector of numbers known as token embedding. The language model then uses mathematical operations on the sequence of vectors to predict the word token most likely to follow. By converting different input types into the same space as word token embeddings, they can be fed into the same language model, allowing us to process text, images, scene embeddings, and possibly other data types simultaneously.

PaLM-E takes in text and other modalities in an arbitrary order, which it calls “multimodal sentences,” and generates text output using the pre-trained language model. For instance, an input could be given as “What color blocks are present in <img>?” where <img> denotes an uploaded image, and the model would then answer with the different colors of the blocks present in the image. This could be expanded into “Given the following <emb> and <img>, how can we grab the blue block?” where <emb> and <img> correspond to a scene embedding and an image respectively. The output could be in the form of an answer to a question or a sequence of decisions in text form.

PaLM-E introduces a novel approach to train a versatile model by transferring knowledge from visual and language domains into a robotic system, leading to improved robot learning outcomes. The results indicate that PaLM-E can handle various robotics, vision, and language tasks simultaneously without compromising performance, unlike training distinct models for each task. Additionally, the use of visual-language data actually improves the performance of robot tasks. This knowledge transfer allows PaLM-E to learn robotics tasks efficiently with minimal examples required to solve a given task. That is, it can adapt to tasks it had not been previously trained to perform.

The success of knowledge-transfer in the case of PaLM-E demonstrates the potential for this strategy to be adopted in future AI models. With the ability to transfer knowledge across domains, AI models may no longer be limited to a narrow set of tasks within a specific field. This approach could pave the way for the development of more versatile and adaptable multi-purpose AI systems. While we are still far from achieving Artificial General Intelligence, or general-purpose AI, PaLM-E represents an important step towards expanding the boundaries of narrow AI.

See PaLM-E in action and read the blog: https://ai.googleblog.com/2023/03/palm-e-embodied-multimodal-language.html

Read the paper:
https://palm-e.github.io/assets/palm-e.pdf