Skip to main content

Pixtral Large: A Guide With Examples

Learn how to use Mistral’s Pixtral Large via Le Chat or access it programmatically through the API on La Plateforme.
Nov 27, 2024  · 8 min read

Mistral AI recently announced Pixtral Large, a multimodal model that is a more powerful version of Pixtral 12B, which was released earlier this year.

Pixtral Large is a 124B multimodal model that can tackle advanced image understanding tasks while preserving the exceptional text-processing capabilities of Mistral Large 2. With a massive 128K context window, this model can analyze up to 30 high-resolution images alongside text, making it ideal for documents, charts, and natural images.

In this tutorial, I’ll walk you through various examples and provide detailed, step-by-step instructions on how to use Pixtral Large through both the Le Chat web interface and the API. Before diving into the hands-on steps, let’s first explore the key theoretical concepts behind Pixtral Large.

Project: Building RAG Chatbots for Technical Documentation

Implement RAG with LangChain to create a chatbot for answering questions about technical documentation.
Explore Project

What Is Pixtral Large?

Pixtral Large features a 123B multimodal decoder and 1B parameter vision encoder (124B parameters), which makes it able to deal effectively with multimodal inputs. It has demonstrated unmatched capabilities in diverse scenarios such as:

  • Multilingual OCR: This is for accurately processing receipts and performing calculations with context (an example is analyzing our restaurant bills).
  • Chart understanding: This is used to identify anomalies and trends, such as pinpointing issues in training loss curves.
  • General analysis: It can extract insights from screenshots, answering contextual queries effectively, etc.

Pixtral Large shows good results on quite a few benchmarks—here are the ones that caught my attention:

  • MathVista: 69.4% accuracy in mathematical reasoning over visual data.
  • ChartQA and DocVQA: Outperformed top competitors like GPT-4o and Gemini-1.5 Pro in reasoning over complex charts and documents.
  • MM-MT-Bench: Surpassed Claude-3.5 Sonnet, Gemini-1.5 Pro, and GPT-4o in real world multimodal tasks.

Pixtral Large on multiple benchmarks

Source: Mistral AI

Pixtral Large 124B vs Pixtral 12B

Pixtral Large 124B and Pixtral 12B share some architectural foundations but are not identical. While Pixtral Large builds on the foundations laid by Pixtral 12B, its features are significant in architecture, context window size, and multimodal integration, leading to superior performance in various tasks. Here are a few differences between the two models.

  • The significant difference in size allows Pixtral Large to handle more complex tasks and larger datasets.
  • A larger context window enables Pixtral Large to process and understand longer sequences of text and more complex images.
  • Both models are designed for multimodal tasks, integrating text and image data. However, Pixtral Large introduces specialized cross-attention layers to better fuse information across different modalities, enhancing its performance in tasks like image understanding and document processing.
  • Pixtral Large requires more computational resources, including over 200GB of data and a powerful GPU, compared to Pixtral 12B. This highlights the increased complexity and capability of the larger model.

Pixtral Architecture

Pixtral Architecture. Source: Pixtral 12B technical report.

If you want to explore the model architecture more deeply, check out the Pixtral 12B technical report. The Pixtral Large technical report is currently unavailable.

How to Use Pixtral Large on Le Chat

Le Chat offers a chat-based interface, much like the popular ChatGPT.

Le Chat interface from Mistral AI

Although we can’t select the Pixtral Large model specifically, the announcement article makes it clear that “state of the art document and image understanding” is powered by the new multimodal Pixtral Large.

I first tried Pixtral Large on a research paper on Transformer-Lite, which has plenty of text and graphs. I asked the model to provide me with a summary of the document:

Summarizing research paper using Pixtal Large model via Le Chat

I was fairly satisfied with the result. Next, I asked Pixtral to explain this pie chart of smartphone brands distribution, but I was careful to keep my prompt vague and not give away what the pie chart is about.

Experiment 2 - Summarizing pie chart using Pixtal Large model via Le Chat

The model understood the pie chart and identified the numbers correctly. The response is well structured and also gives an accurate interpretation, saying that while Samsung and Apple lead the market, a significant portion is occupied by a variety of other brands.

How to Use Pixtral Large’s API on La Plateforme

For most everyday use cases, Le Chat is a great option, but if we want to use Pixtral Large programmatically, we need to connect through its API. I’ll first show you how to connect to the API through La Plateforme, and then I’ll test Pixtral Large on my own drawings.

Connect to La Plateforme

To use Pixtral Large through the API, we can take the following steps:

1. Log in to La Plateforme.

2. Click on Quickstart → API Requests

Quickstart to Mistral API

3. Select the free account option. Click on Experiment for free, which takes us to another page for verifying the account using our mobile number using an OTP and setting up a new account.

Mistral Billing Plan options

4. Once the account is set up, select API keys from the API tab (left side). 

Mistral API keys tab

5. Next, click on Create new key. We may fill in the optional details and click on Create key.

Mistral pop up to create API key

The API key is now ready for use. 

Tab to find active and expired API keys in Mistral

Image Description Generator Using API On My Drawing

Let’s now create an image description generator using the API. The goal is to pass any image to the model along with a question in the prompt (e.g., “Provide a detailed description of the given image”) and receive a response.

I’ll test this by generating a description for one of my drawings, which is shown below. The image depicts a mobile framework for a 24/7 customer support chatbot application.

drawing used for the pixtral large api demo

Before we get started, make sure you have the following setup:

  1. GPU: A T4 GPU is required for this setup. You can use Google Colab, which provides free access to a T4 GPU.
  2. Mistral AI library: To install it, run the following command:
!pip install -U mistralai

Let’s begin:

Step 1: Imports

Next, we set up our imports. For this tutorial we will use the following libraries:

  • Pillow(PIL) for image processing
  • Base64 for encoding and decoding binary data into a base64 format
  • io.BytesIO library provides an in-memory byte-stream interface
import base64
from io import BytesIO
from PIL import Image
from mistralai import Mistral

Step 2: Pixtral API client

Next, use the API key. To set up the API key in Google Colab, go to the Secrets tab. Click on the key icon in the left sidebar, then paste your API key and assign a name to the secret. Afterward, toggle the button to enable the secret for use in your code. This will allow us to use the following code in our application.

from google.colab import userdata
userdata.get('secretName') # pass in your secretName here

Once the secret key is in place, we set up a variable with the name API_KEY and pass it in our secret name, as shown in the code below. Next, we pass the API key to the Mistral function and set up the client.

from google.colab import userdata
API_KEY = userdata.get('secretName')  # Replace with your actual secret name
client = Mistral(api_key=API_KEY) 

Note: We can optionally pass the API key directly but it is not recommended for safety and privacy reasons. 

Step 3: Encoding Image

Once we have the client set up, we start to process the image that will be the input to the Pixtral Large model.

# Helper function to encode an image to base64
def encode_image(image_obj):
    if isinstance(image_obj, Image.Image):  # Check if it's already a PIL Image
        img = image_obj
    else:  # Otherwise, try opening it as a path
        img = Image.open(image_obj)
    buffered = BytesIO()
    img.save(buffered, format="PNG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

The primary goal of the above function is to take an image, either as a file path or a preloaded PIL.Image.Image object and return its Base64 string representation. This is useful for:

  • Embedding the image directly into a JSON object or API request.
  • Avoiding the need to send the image as a separate file.

Step 4: Inference on image

Once we have the image, we want to generate its description with the Pixtral Large API. The workflow includes encoding the image, crafting a prompt, sending a request to the Pixtral API, and receiving a descriptive response. Here is how the following code  works:

  1. The image is encoded into a Base64 format and sent along with the prompt in a structured message to the Pixtral API.
  2. The Pixtral Large model processes the prompt and image to generate a detailed textual description of the image's contents.
  3. The response is parsed to extract the description, which is displayed to the user.

Note: Make sure that under client.chat.complete, the model is set to "pixtral-large-latest" which is currently the only Pixtral Large model available to the public.

# Function to perform inference for image description
def describe_image(image_path):
    # Load and encode the image
    image_base64 = encode_image(image_path)
    # Prompt for the Pixtral model
    prompt = "Please provide a detailed description of the given image."
    # Prepare input for the Pixtral API
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}
            ]
        }
    ]
    # Perform inference
    response = client.chat.complete(
        model="pixtral-large-latest",
        messages=messages,
        max_tokens=300
    )
    # Return the model's output
    return response.choices[0].message.content

# Usage
if __name__ == "__main__":
    # Provide the path to your image file
    image_path = "image.png"  # Replace with the path to your image
    # Get the description of the image
    description = describe_image(image_path)
    
    # Print the result
    print("Image Description:")
    print(description)
Image Description:
The image is a sketch of a mobile app prototype for automated customer support. The sketch depicts a smartphone screen with a customer support interface.
At the top of the screen, there is a header that reads "24x7 Customer Support Powered By AI." Below the header, there is a chat window where a conversation between a user and an AI-powered support system is taking place.
The conversation starts with the AI greeting the user: "Hi! How can I help you?" The user responds: "Hi, I am looking for assistance for featureX." The AI then provides a response: "Sure, here are a few steps:" followed by a placeholder for the steps, which are represented by horizontal lines.
At the bottom of the screen, there is a text input box labeled "Type your message" and a button labeled "Ask AI." The button is highlighted in red, indicating it is an actionable element for the user to interact with.
Overall, the sketch illustrates a user-friendly interface for automated customer support, utilizing AI to provide assistance and guidance to users.

Here’s another example of a Christmas tree I’ve also drawn and below the image is what Pixtral Large thought of it.

Drawing of a Christmas tree

Image Description:
The image presents a charming scene of a Christmas tree, standing tall and proud. 
.
.
This description is based on the visible content of the image and does not include any speculative or imaginary elements.

Conclusion

Pixtral Large looks like a step forward in multimodal AI, with better image understanding and text processing. Through this guide, we explored the model’s capabilities through both Le Chat and La Plateforme, and I hope this guide will be helpful to you.


Photo of Aashi Dutt
Author
Aashi Dutt
LinkedIn
Twitter

I am a Google Developers Expert in ML(Gen AI), a Kaggle 3x Expert, and a Women Techmakers Ambassador with 3+ years of experience in tech. I co-founded a health-tech startup in 2020 and am pursuing a master's in computer science at Georgia Tech, specializing in machine learning.

Topics

Learn AI with these courses!

course

Retrieval Augmented Generation (RAG) with LangChain

3 hr
1.1K
Learn cutting-edge methods for integrating external data with LLMs using Retrieval Augmented Generation (RAG) with LangChain.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

Pixtral 12B: A Guide With Practical Examples

Learn how to use Mistral’s Pixtral 12B interactively via Le Chat or programmatically through the API available on La Plateforme.
François Aubry's photo

François Aubry

8 min

tutorial

A Comprehensive Guide to Working with the Mistral Large Model

A detailed tutorial on the functionalities, comparisons, and practical applications of the Mistral Large Model.
Josep Ferrer's photo

Josep Ferrer

12 min

tutorial

Getting Started With Mixtral 8X22B

Explore how Mistral AI's Mixtral 8X22B model revolutionizes large language models with its efficient SMoE architecture, offering superior performance and scalability.
Bex Tuychiev's photo

Bex Tuychiev

12 min

tutorial

Flux AI Image Generator: A Guide With Examples

Learn how to use Flux AI to generate images and explore the features, applications, and use cases of each model in the Flux family: Flux Pro, Flux Dev, and Flux Schnell.
Bhavishya Pandit's photo

Bhavishya Pandit

8 min

tutorial

Codestral API Tutorial: Getting Started With Mistral’s API

To connect to the Codestral API, obtain your API key from Mistral AI and send authorized HTTP requests to the appropriate endpoint (either codestral.mistral.ai or api.mistral.ai).
Ryan Ong's photo

Ryan Ong

9 min

code-along

Introduction to Large Language Models with GPT & LangChain

Learn the fundamentals of working with large language models and build a bot that analyzes data.
Richie Cotton's photo

Richie Cotton

See MoreSee More