url('https://fonts.googleapis.com/css2?family=Raleway:ital,wght@0,100..900;1,100..900&family=Source+Serif+4:ital,opsz,wght@0,8..60,200..900;1,8..60,200..900&display=swap');

(min-width: 768px) {
  .article.article-md h2 {
    font-size: 24px;
  }
}

(min-width: 1024px) {
  .article.article-md h2 {
    font-size: 28px;
  }
    .heading-button {
    width: 50px;
    height: 50px;
  }
  .heading-button-arrow {
    font-size: 28px;
    padding: 12px 8px
  }
}

(max-width: 768px) {
  [data-root-node="true"] {
    --margin-left: 0px !important;
    --margin-right: 0px !important;
  }
  .contained { 
    padding: 0px 16px;
  }

  [data-tid="h"] {
    margin-left: var(--margin-left);
    margin-right: var(--margin-right);
  }
  .auto-column-container {
    column-count: 1;
  }
  .pullquote, .pullquote.left, .pullquote.right{
    float: none;
    width: 100%;
    text-align: center;
    margin-left: 0;
    margin-right: 0;
  }

  .footer-content { 
    grid-template-columns: repeat(3, 1fr);
  }

  .footer-about {
    grid-column: 1 / span 3;
  }
}

The human experience of reality unfolds across multiple sensory channels, creating a rich tapestry of understanding that no single mode can capture alone. As AI systems evolve beyond text-only interactions, the integration of vision, sound, and other modalities mirrors our own multisensory world, promising deeper and more nuanced machine comprehension that enhances rather than diminishes each individual channel's power.

We experience the world through our senses. We can see, hear, feel, touch, taste and smell. No single 

is sufficient to represent our environment. Our senses work together to allow us to interpret the world around us. Without one sense, our understanding can be incomplete. If you have a head cold, then you may notice that foods taste bland. Taste and smell are best buds (

) working together to provide us with the full experience.

With all of the talk about Large Language Models (LLMs) and text, you may be wondering about all of the other modalities. Can the models that enabled LLMs work with other modalities too? The answer is an emphatic 

The same architectures can be used to process audio, images, video, robot control signals, genomic data and more! What's more, these different modalities and be 

into a single stream and processed by the model. Image and textual data can be combined to allow the model to understand the context of an image by looking at the text description or to understand the sentiment of a sound clip by looking at the accompanying text.

LLMs are data hungry beasts. The amount of data that an LLM needs for training depends on the size of the model. GPT-3 for example has 175 billion parameters that need to be learned from the training data. When GPT-3 was trained back in 2020, it was thought that around 2 tokens per parameter were necessary to optimally train the model. In late 2022, the folks at DeepMind discovered that a better ratio was about 

tokens per parameter -- 10 times as much! That means if GPT-3 were retrained today, it would be trained on around 

tokens! On average there is 1.4 tokens per word so this is equivalent to around 5TB of text. LLMs are hungrier than we imagined!

Fortunately, multimodal architectures can help alleviate this problem. Instead of training the LLM on 5TB of text, you can combine it with audio, images, and other modalities to give it more data to learn from. Even a few gigabytes of data from other modalities can make a huge difference in the performance of the model.

What boggles my mind is that all of these different modalities actually reinforce and enhance each other. From a human perspective, it may be a big fat "

". Of course combining text with vision increases the understanding of the whole. But it's not obvious that if you train one model on just text and another 

model on text and image data that the second model wouldn't lose some of its language ability. Yet, it doesn't! By combining the two modalities, the model gains a deeper understanding of both.

Let me tickle your brain for a moment. Remember OCR (optical character recognition)? That's table stakes for multimodal models. If you read my post about 

 you may have been thinking that sometimes a picture is worth a thousand words. Soon you're going to be able to intersperse pictures and drawings in with your ChatGPT discussion. If you've been tickled then take a look at the images in the Microsoft paper "

Language Is Not All You Need: Aligning Perception with Language Models

is a bit further off. You're not going to be getting ChatGPT answering you with pictures in 2023. If you've ever used a GAN-based image generator such as DALL-E or Midjourney then you know that generating text isn't something that it excels at. This capability 

 were published just last week (as of this posting in early March 2023) that exhibit this ability. Just don't expect it in 2023.

Multimodal models are going to be all the rage starting in 2023. We are just beginning to scratch the surface of what these architectures are capable of. By combining the different modalities, we can create richer and more powerful models that can understand our environments in ways that single modality models cannot. I'm excited to see what the future holds!

Multimodal

The Human Experience

Beyond Text

The Data Hunger Games

Better Together

A Glimpse of Tomorrow

More from Tangents in Surface Tension

Decoding ChatGPT: Who Owns Your AI Conversations and How They're Used

How to Think about Emergence

Join us on Discord

About Us

Info

Legal

Social