Microsoft has unveiled two new additions to its Phi-4 family of small language models: Phi-4-multimodal, which integrates speech, vision, and text, and Phi-4-mini. In December 2024, Microsoft ...
A surge in related works is happening on a daily basis. More recent works can be found on the GitHub page (https://github.com/BradyFU/Awesome-Multimodal-Large ...
Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.
Overview:  Multimodal AI is changing how machines process information by combining text, images, audio, video, and sensor ...
Google Gemma 4 12B, released June 3, is an open-weight multimodal model that processes text, images, audio, and video in a ...