In a world where unstructured data—text, images, video, and beyond—is growing exponentially, the ability to connect and retrieve meaning across different modalities is becoming essential. Enter multimodal embeddings, a powerful advancement enabling systems to understand and relate information across formats. At a recent webinar hosted by Stefan Webb, Developer Advocate and champion of Milvus (an open-source vector database), he walked a global audience through the what, why, and how of building multimodal RAG systems. Here’s what you need to know.
In-person conference | May 13th-15th, 2025 | Boston, MA
Join us May 13th-15th, 2025, for 3 days of immersive learning and networking with AI experts.
🔹 World-class AI experts | 🔹 Cutting-edge workshops | 🔹 Hands-on Training
🔹 Strategic Insights | 🔹 Thought Leadership | 🔹 And much more!
What Are Multimodal Embeddings?
Think of embeddings as a way to translate data into a shared language. Whether it’s a paragraph of text or a snapshot of a street corner, multimodal embeddings convert this information into numerical vectors that live in the same space. This makes it possible for, say, a line of text to be compared with an image, and for systems to understand that “a woman walking a dog” in text is semantically related to a corresponding image.
By mapping content to a high-dimensional space, related pieces cluster together. This enables true cross-modal search: text-to-image, image-to-text, even audio-to-video.
Why Multimodal Retrieval Matters
Modern use cases are everywhere:
- Cross-modal search: Ask a system for “images of the Iberian lynx with audio,” and get relevant visuals and sounds—retrieved efficiently from a vector database.
- Foundation models: Tools like ChatGPT are evolving to handle inputs from various sources (text, images, user behavior) and deliver outputs across these same formats.
- Multimodal RAG (Retrieval-Augmented Generation): Here’s where it gets powerful. Imagine asking for a graph showing your company’s revenue from 2021–2023. Instead of digging manually, a multimodal RAG system pulls images, tables, and docs to assist in generating a complete response automatically.
Training the Models: Data and Design
Creating these embeddings requires datasets where each modality is paired—for example, an image and its description. The internet provides a goldmine for this, especially through attributes like alt text.
The training process typically uses contrastive learning:
- Similar (paired) data is brought closer together in the embedding space.
- Dissimilar data is pushed apart. This allows models to differentiate meaning and context more precisely.
Case in point: CLIP (Contrastive Language–Image Pretraining) by OpenAI. It learns image–text relationships using massive datasets and can even perform zero-shot classification, making predictions without needing traditional training labels.
Instruction-Following Models: A New Frontier
A newer class of models go one step further. They can follow instructions tied to an image. Think: “front view of this lion” when given a side-view image.
These systems (e.g., Magic Lens by DeepMind) are trained using triplets—source image, instruction, and the correct target image. Using web data proximity and synthetic instructions generated by LLMs, these models demonstrate significant gains in retrieval performance, especially when dealing with abstract or complex instructions.
Level Up Your AI Expertise! Subscribe Now:
Building Your Own Multimodal Search with Milvus
At the webinar, attendees were treated to a practical demo of how to build a multimodal RAG system using:
- Milvus: The industry’s most widely used open-source vector database.
- Visual BGE: An embedding model that processes both images and text.
- Open-source tools: PyTorch, Hugging Face, and PyMilvus.
Demo steps included:
- Indexing images from a subset of Amazon Reviews.
- Creating a collection in Milvus with image embeddings.
- Performing a search using a hybrid text + image query like “phone case with this leopard image.”
- Visualizing results and using a Large Language Vision Model (LLVM) to rerank for better accuracy.
The reranking process involved:
- Captioning the query image.
- Constructing a prompt that combined the caption and the original query.
- Having the model rank the results and explain why the top match was selected.
The result? A sharper, more relevant output aligned with the user’s intent.
In-person conference | May 13th-15th, 2025 | Boston, MA
Join us at ODSC East for hands-on training, workshops, and bootcamps with the leading experts. Topics include:
🔹 Introduction to scikit-learn
🔹 Building a Multimodal AI Assistant
🔹 Explainable AI for Decision-Making Applications
🔹 Building and Deploying LLM Applications
🔹 Causal you have
🔹 Adaptive RAG Systems with Knowledge Graphs
🔹 Idiomatic Polars
🔹 Machine Learning with CatBoost
Wrapping Up: What’s Next?
Multimodal search is no longer theoretical—it’s here, and it’s scalable. Thanks to tools like Milvus, anyone can start experimenting with systems that blend images, text, and more to surface meaning, improve retrieval, and supercharge user experience.
Ready to build your own system?
- Explore Kite Resources at milvus.io
- Join the Milvus community on Discord to share ideas and get support
- Follow Milvus on LinkedIn and Twitter for the latest updates
The future is multimodal. And it starts with a vector.
For more info visit at Times Of Tech