TL;DR Google’s Gemini can use text, image, video, audio as input. This makes lots of new applications possible or easier to implement or more powerful. In this post, I will introduce the Large Multimodal Model (LMM) with Gemini.

Introduction

We already seen lots of amazing applications of Large Language Model (LLM) (e.g. ChatGPT, GPT-3). But most of them only use text as input. Some advanced LLM models, like ChatGPT 4o, can use text and image as input. but Gemini can use text, image, video, audio as input. These new features give us lots of new possibilities.

I will cover the following topics in this post:

  • Image understanding
  • Video understanding
  • Audio understanding

Gemini version

Not all versions of Gemini support all these features. You need to check the version of Gemini you are using to see if it supports the features you need. For now (2024-10-27), Gemini 1.5 Pro is the only version that supports all these features.

Image understanding

Gemini can understand image. This is very useful in many applications. For example, in a chatbot, you can send an image to the chatbot, and the chatbot can understand the image and give you a response. This is very useful in many applications. Similar functionality can be found in some ChatGPT 4o, but Gemini is still a competitive choice.

With Gemini, you can send multiple images to the model at the same time. This is very useful in many applications. For example, in a chatbot, you can send multiple images to the chatbot, and the chatbot can understand all the images and give you a response. This is very useful in many applications.

In Gemini 1.5 Pro, as much as 3000 images can be sent to the model in one prompt. This is a huge number of images, can cover most of the applications. if you need to send more images, maybe you should consider mulitimodal RAG, which will be covered later in this post.

Best practices: Suggested by the official document, it is recommended to place the image before the text prompt if you have only one image.

Video understanding

Gemini can understand video. This is very useful in many applications. Although we can use an alternative approach to understand video, e.g. by splitting the video into frames and then sending each frame to Large Vision Language Model (LVLM), such like ChatGPT 4o. But Gemini can understand video directly, which is more efficient and more powerful. The method of splitting the video into frames and then sending each frame to LVLM is easy to influence by the method of splitting the video. But Gemini can understand video directly, which is more robust.

In Gemini 1.5 Pro, user can send up to 10 videos to the model in one prompt. Unlike the image understanding, the video contains more information, which is related to the length of the video. So there is a limitation of the length of the video. Also videos with audio and videos without audio contain different density of information, so the model may have different limitations on these two types of videos. For the videos with audio, Gemini 1.5 Pro can understand up to 50 minutes of video. For the videos without audio, Gemini 1.5 Pro can understand up to 60 minutes of video.

Some notes on the Gemini 1.5 Pro on how to capture the video information: Videos are sampled at 1 frame per second, and the audio track is truncated to 1 second. If the video contains high-speed motion, the model may not be able to understand the video well.

Best practices: Suggested by the official document, it is recommended to place the video before the text prompt if you have only one video.

Audio understanding

Gemini can understand audio. For now (2024-10-27), Gemini can only understand speech rather than music or other audio. It very similar to Video understanding, but the input is audio rather than video. Smilar to the video understanding, the audio understanding is also limited by the length of the audio. Gemini 1.5 Pro can understand up to ~8.4 hours of audio.

Cool feature: Gemini 1.5 Pro can distinguish different speakers in the video. Which make it more powerful in many applications. One of the most common applications is the transcription of the meeting. In the meeting, there are usually multiple speakers, and the transcription needs to distinguish different speakers. Gemini 1.5 Pro can do this very well. Other application is used as ASR (Automatic Speech Recognition) system. It can outputs the speaker name, the start time, the end time, and the transcript of the speech.

Retrive Augmented Generation (RAG) with multimodal data

Traditional RAG can only work with text-based data, such as web pages, pdf files, word documents, etc. But with the quick development of multimodal models, lots of new applications are beginning to use multimodal data. Similar to the traditional LLMs, LMMs are also limited by the length of the input. When deal with lots of input or reference data, the input length limitation is a big problem. LMMs still need to use RAG to overcome this limitation.

Similar to text-based RAG, multimodal RAG also is hard to implement. Same with the text-based RAG, the difficult part of multimodal RAG is the retrieval part. The core work of the retrieval part is to find the most relevant data from the reference data. In the text-based RAG, the retrieval part is to find the most relevant text data from the reference text data. In the multimodal RAG, the retrieval part is to find the most relevant multimodal data from the reference multimodal data. The retrieval part is hard because the multimodal data is more complex than the text data. The multimodal data is a combination of text, image, video, audio, etc. The retrieval part needs to consider all these types of data.

The chunking method in the multimodal RAG is more complex than the text-based RAG. The chunking method in the text-based RAG is to split the text data into chunks. The chunking method in the multimodal RAG is to split the multimodal data into chunks which is more tricky and complex.

Other hard part of the multimodal RAG is the vecterization part. The vecterization part is to convert the data to vector, so the retrival engine can recall data and rank the data. The mothod and model of multimodal vecterization is not as well-studied and mature as the text-based vecterization. We still on the way to find more efficient and powerful multimodal vecterization method and model.

Useful materials