The human experience of reality unfolds across multiple sensory channels, creating a rich tapestry of understanding that no single mode can capture alone. As AI systems evolve beyond text-only interactions, the integration of vision, sound, and other modalities mirrors our own multisensory world, promising deeper and more nuanced machine comprehension that enhances rather than diminishes each individual channel's power.
Photo Credit: Rob GrzywinskiOriginally posted on March 12, 2023 on LinkedIn. Edited from original version.
The Human Experience
We experience the world through our senses. We can see, hear, feel, touch, taste and smell. No single modality is sufficient to represent our environment. Our senses work together to allow us to interpret the world around us. Without one sense, our understanding can be incomplete. If you have a head cold, then you may notice that foods taste bland. Taste and smell are best buds (pun intended!) working together to provide us with the full experience.
Beyond Text
With all of the talk about Large Language Models (LLMs) and text, you may be wondering about all of the other modalities. Can the models that enabled LLMs work with other modalities too? The answer is an emphatic YES! The same architectures can be used to process audio, images, video, robot control signals, genomic data and more! What's more, these different modalities and be combined into a single stream and processed by the model. Image and textual data can be combined to allow the model to understand the context of an image by looking at the text description or to understand the sentiment of a sound clip by looking at the accompanying text.
The Data Hunger Games
LLMs are data hungry beasts. The amount of data that an LLM needs for training depends on the size of the model. GPT-3 for example has 175 billion parameters that need to be learned from the training data. When GPT-3 was trained back in 2020, it was thought that around 2 tokens per parameter were necessary to optimally train the model. In late 2022, the folks at DeepMind discovered that a better ratio was about twenty tokens per parameter -- 10 times as much! That means if GPT-3 were retrained today, it would be trained on around 3.5 trillion tokens! On average there is 1.4 tokens per word so this is equivalent to around 5TB of text. LLMs are hungrier than we imagined!
Better Together
Fortunately, multimodal architectures can help alleviate this problem. Instead of training the LLM on 5TB of text, you can combine it with audio, images, and other modalities to give it more data to learn from. Even a few gigabytes of data from other modalities can make a huge difference in the performance of the model.What boggles my mind is that all of these different modalities actually reinforce and enhance each other. From a human perspective, it may be a big fat "Well duh!". Of course combining text with vision increases the understanding of the whole. But it's not obvious that if you train one model on just text and another same sized model on text and image data that the second model wouldn't lose some of its language ability. Yet, it doesn't! By combining the two modalities, the model gains a deeper understanding of both.
A Glimpse of Tomorrow
Let me tickle your brain for a moment. Remember OCR (optical character recognition)? That's table stakes for multimodal models. If you read my post about Mental Models you may have been thinking that sometimes a picture is worth a thousand words. Soon you're going to be able to intersperse pictures and drawings in with your ChatGPT discussion. If you've been tickled then take a look at the images in the Microsoft paper "Language Is Not All You Need: Aligning Perception with Language Models". 😳Quick aside: Multimodal generation is a bit further off. You're not going to be getting ChatGPT answering you with pictures in 2023. If you've ever used a GAN-based image generator such as DALL-E or Midjourney then you know that generating text isn't something that it excels at. This capability will exist. Papers were published just last week (as of this posting in early March 2023) that exhibit this ability. Just don't expect it in 2023.Multimodal models are going to be all the rage starting in 2023. We are just beginning to scratch the surface of what these architectures are capable of. By combining the different modalities, we can create richer and more powerful models that can understand our environments in ways that single modality models cannot. I'm excited to see what the future holds!(859 tokens)
Understanding data ownership and privacy becomes crucial as AI conversations shape our daily workflows. Traditional notions of intellectual property collide with AI's collaborative nature, forcing us to examine who truly owns these digital dialogues and how the information flows between human and machine.18 January 2025
Emergence forces us to abandon simplistic reductionism as complex systems reveal behaviors that transcend their parts. From ant colonies to neural networks, understanding how novel properties arise from collective interactions reshapes our approach to problem-solving across disciplines.16 January 2025