Multimodal learning

Multimodal learning, in the context of machine learning, is a type of deep learning using multiple modalities of data, such as text, audio, or images.

In contrast, unimodal models can process only one type of data, such as text (typically represented as feature vectors) or images. Multimodal learning is different from combining unimodal models trained independently. It combines information from different modalities in order to make better predictions.[1]

Large multimodal models, such as Google Gemini and GPT-4o, have become increasingly popular since 2023, enabling increased versatility and a broader understanding of real-world phenomena.[2]

  1. ^ Rosidi, Nate (March 27, 2023). "Multimodal Models Explained". KDnuggets. Retrieved 2024-06-01.
  2. ^ Zia, Tehseen (January 8, 2024). "Unveiling of Large Multimodal Models: Shaping the Landscape of Language Models in 2024". Unite.ai. Retrieved 2024-06-01.

© MMXXIII Rich X Search. We shall prevail. All rights reserved. Rich X Search