Google releases PaLM-E, the largest visual language model in history
Google PaLM-E has 562 billion parameters (ChatGPT is 175 billion parameters), combining the PaLM-540B language model and the ViT-22B vision model. Incorporating continuous sensor modalities of the real world directly into language models, "gives eyes to AI" to establish connections between words and perception.
PaLM-E directly obtains raw image data from the robot camera, and performs action planning and execution according to natural language instructions, thus avoiding the need for manual preprocessing or labeling data, and can learn these tasks end-to-end autonomously.
The research team also found:
1. The larger the language model, the more it can maintain its language ability during visual language and robot task training. The 562 billion parameters of PaLM-E just allow it to retain almost all language abilities.
2. The positive transfer ability of "generalist AI", PaLM-E trained in multiple task areas at the same time, the single-task ability is significantly improved compared with "specialized AI".
3. In addition to significant progress in human-computer interaction, the team also discovered that PaLM-E has emerging capabilities such as multi-modal thinking chain reasoning and multi-image reasoning, and achieved a new SOTA (most recent) on the OK-VQA visual question answering benchmark test. best level AI).