Microsoft Phi-3 Vision: A New Multimodal Eye-based AI Model Transforming AI

By Zernab Farooqi - May 25, 2024

A 4.2 billion parameter model called Phi-3-vision may respond to queries about pictures or graphs.

Microsoft has released a new version of the Phi-3 language model, which can analyze pictures and offer insights into their contents.

With its latest AI offerings, the Phi-3 family of models, Microsoft has already gone beyond the expectations. These small yet powerful models, which made their debut at the Microsoft Build 2024 conference, are expected to provide outstanding AI performance in a variety of applications.

The Phi-3 family of models includes:

  • the bite-sized Phi-3-mini
  • the slightly larger Phi-3-small
  • the midrange Phi-3-medium
  • the innovative Phi-3-vision
Microsoft Phi-3 Vision: A New Multimodal Eye-based AI Model Transforming AI

These models are optimised for real-world applications, providing superior reasoning capabilities and extremely quick answers with minimal processing demands.

High-quality datasets such as filtered public websites, instructional content, and synthetic data are used to train the Phi-3 models. This guarantees their ability to do well in mathematics, coding, reasoning, and language understanding tasks. The Phi-3 vision is a multimodal model that stands out for its remarkable visual and language processing skills. It supports a 128K token context length and performs brilliantly in tasks like OCR and chart interpretation. The Phi-3 family, which was developed in accordance with Microsoft's Responsible AI guidelines, provides developers with a strong, secure, and adaptable toolkit to create cutting-edge AI applications.

Phi-3-vision: Parameters, Multimodal Capabilities, Applications

In the Phi-3 family, Phi-3-vision is a special multimodal model with 4.2 billion parameters and 128K token support for context length. This model is appropriate for several applications requiring text and picture processing since it combines language and visual skills. OCR, general picture comprehension, and chart and table interpretation are areas where Phi-3-vision excels. Because it is based on high-quality datasets—such as synthetic data and publicly available documents—robust performance is guaranteed across a range of multimodal situations.

A Glimpse into Phi-3 Vision’s Capabilities

Think about an AI that can "see" and comprehend the pictures all around you in addition to understanding the words you speak. That is what Phi-3 Vision promises to deliver. Actually, this model is capable of:

Explain visuals:

When you ask Phi-3 Vision to describe a photo of a busy city street, it can tell you what's going on, who's in it, and even how the atmosphere is.

Text extraction from images:

Do you need to convert a scanned paper to a digital file? Text in a picture may be extracted using Phi-3 Vision, enabling easy editing and sharing.

Analyse diagrams and charts:

When it comes to the data presented in charts and diagrams, Phi-3 Vision can analyze them, extract important insights, and respond to your inquiries.

For instance, below is the graph showing the sales growth of two of a company’s products. If you want Phi-3 Vision to analyze the graph and want to know the trend of the blue product as shown in the below chart, Phi-3 Vision would analyze the chart and report you that “The blue product's trend tends to be upward with some fluctuations.”

Microsoft Phi-3 Vision: A New Multimodal Eye-based AI Model Transforming AI

Under the Hood: A Powerful Architecture

The amazing capabilities of Phi-3 Vision are based on a well-crafted architecture. It integrates dedicated image processing modules with the strengths of the Phi-3 Mini language model:

Image Encoder:

With the help of this module, an input image may be converted into a numerical representation that highlights the key elements of the image.


By lining up the image representation with the internal representation of the language model, this module fills the gap between the language model and the image encoder.

Image Projector:

With the help of this module, the language model can process and generate text based on both textual and visual information by projecting the aligned image representation into its vocabulary space.

Phi-3 Mini Language Model:

The Phi-3 Mini language model is the fundamental component of Phi-3 Vision. This little but mighty model serves as the basis for both text generation and comprehension. It was trained on an enormous dataset including 3.3 trillion tokens.

All of these parts work together smoothly to enable Phi-3 Vision to comprehend and "see" images, analyze them in a larger context, and produce meaningful text based on its observations.

Performance: Phi-3 Vision’s Impressive Results

A range of benchmarks have been used to thoroughly test and evaluate Phi-3 Vision with other top multimodal models. The outcomes show that, despite its modest size, Phi-3 Vision regularly delivers excellent performance, frequently surpassing bigger and more computationally demanding models.

Even though Phi-3 Vision doesn't always get the highest marks, it consistently delivers at the top of its class, showcasing its amazing powers for such a compact model. Thus, our findings highlight the potential of Phi-3 Vision as an extremely useful tool for developers creating apps that demand deep multimodal understanding.

Microsoft Phi-3 Vision: A New Multimodal Eye-based AI Model Transforming AI

The Phi-3 Family: A Suite of Efficient Open-Source Tools

The latest addition of the Phi-3 family of language models is Phi-3 Vision. Among the others are:


There are two sizes of this versatile language model: Long context length (128K tokens) and a short context length (4K tokens). For tasks that don't need for a lot of text or complex reasoning, Phi-3-mini is ideal.


A 7 billion parameter model (128K and 8K tokens) is offered in two context lengths. Phi-3-small works well on more difficult problems requiring more advanced thinking abilities.


A 14 billion parameter model with two context lengths (128K and 4K tokens) is also offered. The most difficult assignments requiring a high degree of understanding and reasoning are best suited for Phi-3-medium.

Since all Phi-3 models are open-source, developers are allowed to use and modify them as they see fit. Phi-3 models are available on Hugging Face and Microsoft Azure, ready to be included in your projects.

Examples of Real-World Applications

Phi-3 models are now being used a number of interesting ways, including:


This Indian company is developing Phi-3, a virtual assistant that will allow farmers to ask questions and get answers in their language.

Microsoft Phi-3 Vision: A New Multimodal Eye-based AI Model Transforming AI

Khan Academy:

Phi-3 is being utilized to enhance math tutoring, increasing accessibility and lowering costs associated with education.


This healthcare organization is utilizing Phi-3 to assist doctors in providing better care by quickly summarising complex patient histories.

Digital Green:

They are using Phi-3 to improve Farmer.Chat, their AI assistant, so that it can comprehend visual data more effectively.


To sum up, we can say that Phi-3 Vision is an open-source model that is effective and performant that should be taken into consideration in the future.

So, get ready to use Phi-3 Vision to "vision" the possibilities! It’s time to see the incredible future of artificial intelligence.


Zernab Farooqi

My name is Zernab Farooqi and I am from Lahore, Pakistan. I graduated from Punjab University in Lahore with a master's degree in human resource management in 2016. I am honest with my work. I am a Multitasker, a positive thinker, and have a "Can do" attitude. My hobbies are traveling, listening to music, cooking, gaming, photography, and writing.