InternVL is a large-scale vision-language foundation model developed to integrate visual and linguistic understanding in AI systems. By scaling up the vision foundation model to 6 billion parameters and aligning it with large language models (LLMs), InternVL bridges the gap between visual perception and language comprehension.
One of the key features of InternVL is its robust vision encoder, known as InternViT-6B, which has been trained on extensive image-text datasets sourced from the web. This training enables the model to perform a wide array of tasks, such as image and video classification, image and video-text retrieval, and contributing to multimodal dialogue systems. The model’s versatility is evident in its strong performance across 32 visual-linguistic benchmarks, making it a valuable alternative to other large-scale vision models like ViT-22B.
To facilitate its application and further development, InternVL has been made open-source, with code and models accessible on platforms like GitHub and Hugging Face. This openness encourages collaboration and innovation within the AI community, fostering advancements in multimodal AI systems.
For more information on this foundation model, see the paper InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks.