Pixtral Large: A Guide With Examples

Learn how to use Mistral’s Pixtral Large via Le Chat or access it programmatically through the API on La Plateforme.

Nov 27, 2024 · 8 min read

Mistral AI recently announced Pixtral Large, a multimodal model that is a more powerful version of Pixtral 12B, which was released earlier this year.

Pixtral Large is a 124B multimodal model that can tackle advanced image understanding tasks while preserving the exceptional text-processing capabilities of Mistral Large 2. With a massive 128K context window, this model can analyze up to 30 high-resolution images alongside text, making it ideal for documents, charts, and natural images.

In this tutorial, I’ll walk you through various examples and provide detailed, step-by-step instructions on how to use Pixtral Large through both the Le Chat web interface and the API. Before diving into the hands-on steps, let’s first explore the key theoretical concepts behind Pixtral Large.

Project: Building RAG Chatbots for Technical Documentation

Implement RAG with LangChain to create a chatbot for answering questions about technical documentation.

Explore Project

What Is Pixtral Large?

Pixtral Large features a 123B multimodal decoder and 1B parameter vision encoder (124B parameters), which makes it able to deal effectively with multimodal inputs. It has demonstrated unmatched capabilities in diverse scenarios such as:

Multilingual OCR: This is for accurately processing receipts and performing calculations with context (an example is analyzing our restaurant bills).
Chart understanding: This is used to identify anomalies and trends, such as pinpointing issues in training loss curves.
General analysis: It can extract insights from screenshots, answering contextual queries effectively, etc.

Pixtral Large shows good results on quite a few benchmarks—here are the ones that caught my attention:

MathVista: 69.4% accuracy in mathematical reasoning over visual data.
ChartQA and DocVQA: Outperformed top competitors like GPT-4o and Gemini-1.5 Pro in reasoning over complex charts and documents.
MM-MT-Bench: Surpassed Claude-3.5 Sonnet, Gemini-1.5 Pro, and GPT-4o in real world multimodal tasks.

Source: Mistral AI

Pixtral Large 124B vs Pixtral 12B

Pixtral Large 124B and Pixtral 12B share some architectural foundations but are not identical. While Pixtral Large builds on the foundations laid by Pixtral 12B, its features are significant in architecture, context window size, and multimodal integration, leading to superior performance in various tasks. Here are a few differences between the two models.

The significant difference in size allows Pixtral Large to handle more complex tasks and larger datasets.
A larger context window enables Pixtral Large to process and understand longer sequences of text and more complex images.
Both models are designed for multimodal tasks, integrating text and image data. However, Pixtral Large introduces specialized cross-attention layers to better fuse information across different modalities, enhancing its performance in tasks like image understanding and document processing.
Pixtral Large requires more computational resources, including over 200GB of data and a powerful GPU, compared to Pixtral 12B. This highlights the increased complexity and capability of the larger model.

Pixtral Architecture. Source: Pixtral 12B technical report.

If you want to explore the model architecture more deeply, check out the Pixtral 12B technical report. The Pixtral Large technical report is currently unavailable.

How to Use Pixtral Large on Le Chat

Le Chat offers a chat-based interface, much like the popular ChatGPT.

Although we can’t select the Pixtral Large model specifically, the announcement article makes it clear that “state of the art document and image understanding” is powered by the new multimodal Pixtral Large.

I first tried Pixtral Large on a research paper on Transformer-Lite, which has plenty of text and graphs. I asked the model to provide me with a summary of the document:

I was fairly satisfied with the result. Next, I asked Pixtral to explain this pie chart of smartphone brands distribution, but I was careful to keep my prompt vague and not give away what the pie chart is about.

The model understood the pie chart and identified the numbers correctly. The response is well structured and also gives an accurate interpretation, saying that while Samsung and Apple lead the market, a significant portion is occupied by a variety of other brands.

How to Use Pixtral Large’s API on La Plateforme

For most everyday use cases, Le Chat is a great option, but if we want to use Pixtral Large programmatically, we need to connect through its API. I’ll first show you how to connect to the API through La Plateforme, and then I’ll test Pixtral Large on my own drawings.

Connect to La Plateforme

To use Pixtral Large through the API, we can take the following steps:

1. Log in to La Plateforme.

2. Click on Quickstart → API Requests

3. Select the free account option. Click on Experiment for free, which takes us to another page for verifying the account using our mobile number using an OTP and setting up a new account.

4. Once the account is set up, select API keys from the API tab (left side).

5. Next, click on Create new key. We may fill in the optional details and click on Create key.

The API key is now ready for use.

Image Description Generator Using API On My Drawing

Let’s now create an image description generator using the API. The goal is to pass any image to the model along with a question in the prompt (e.g., “Provide a detailed description of the given image”) and receive a response.

I’ll test this by generating a description for one of my drawings, which is shown below. The image depicts a mobile framework for a 24/7 customer support chatbot application.

Before we get started, make sure you have the following setup:

GPU: A T4 GPU is required for this setup. You can use Google Colab, which provides free access to a T4 GPU.
Mistral AI library: To install it, run the following command:

!pip install -U mistralai

Let’s begin:

Step 1: Imports

Next, we set up our imports. For this tutorial we will use the following libraries:

Pillow(PIL) for image processing
Base64 for encoding and decoding binary data into a base64 format
io.BytesIO library provides an in-memory byte-stream interface

import base64from io import BytesIOfrom PIL import Imagefrom mistralai import Mistral

Step 2: Pixtral API client

Next, use the API key. To set up the API key in Google Colab, go to the Secrets tab. Click on the key icon in the left sidebar, then paste your API key and assign a name to the secret. Afterward, toggle the button to enable the secret for use in your code. This will allow us to use the following code in our application.

from google.colab import userdatauserdata.get('secretName') # pass in your secretName here

Once the secret key is in place, we set up a variable with the name API_KEY and pass it in our secret name, as shown in the code below. Next, we pass the API key to the Mistral function and set up the client.

from google.colab import userdataAPI_KEY = userdata.get('secretName')  # Replace with your actual secret nameclient = Mistral(api_key=API_KEY)

Note: We can optionally pass the API key directly but it is not recommended for safety and privacy reasons.

Step 3: Encoding Image

Once we have the client set up, we start to process the image that will be the input to the Pixtral Large model.

# Helper function to encode an image to base64def encode_image(image_obj):    if isinstance(image_obj, Image.Image):  # Check if it's already a PIL Image        img = image_obj    else:  # Otherwise, try opening it as a path        img = Image.open(image_obj)    buffered = BytesIO()    img.save(buffered, format="PNG")    return base64.b64encode(buffered.getvalue()).decode("utf-8")

The primary goal of the above function is to take an image, either as a file path or a preloaded PIL.Image.Image object and return its Base64 string representation. This is useful for:

Embedding the image directly into a JSON object or API request.
Avoiding the need to send the image as a separate file.

Step 4: Inference on image

Once we have the image, we want to generate its description with the Pixtral Large API. The workflow includes encoding the image, crafting a prompt, sending a request to the Pixtral API, and receiving a descriptive response. Here is how the following code works:

The image is encoded into a Base64 format and sent along with the prompt in a structured message to the Pixtral API.
The Pixtral Large model processes the prompt and image to generate a detailed textual description of the image's contents.
The response is parsed to extract the description, which is displayed to the user.

Note: Make sure that under client.chat.complete, the model is set to "pixtral-large-latest" which is currently the only Pixtral Large model available to the public.

# Function to perform inference for image descriptiondef describe_image(image_path):    # Load and encode the image    image_base64 = encode_image(image_path)    # Prompt for the Pixtral model    prompt = "Please provide a detailed description of the given image."    # Prepare input for the Pixtral API    messages = [        {            "role": "user",            "content": [                {"type": "text", "text": prompt},                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}}            ]        }    ]    # Perform inference    response = client.chat.complete(        model="pixtral-large-latest",        messages=messages,        max_tokens=300    )    # Return the model's output    return response.choices[0].message.content# Usageif __name__ == "__main__":    # Provide the path to your image file    image_path = "image.png"  # Replace with the path to your image    # Get the description of the image    description = describe_image(image_path)        # Print the result    print("Image Description:")    print(description)

Image Description:The image is a sketch of a mobile app prototype for automated customer support. The sketch depicts a smartphone screen with a customer support interface.At the top of the screen, there is a header that reads "24x7 Customer Support Powered By AI." Below the header, there is a chat window where a conversation between a user and an AI-powered support system is taking place.The conversation starts with the AI greeting the user: "Hi! How can I help you?" The user responds: "Hi, I am looking for assistance for featureX." The AI then provides a response: "Sure, here are a few steps:" followed by a placeholder for the steps, which are represented by horizontal lines.At the bottom of the screen, there is a text input box labeled "Type your message" and a button labeled "Ask AI." The button is highlighted in red, indicating it is an actionable element for the user to interact with.Overall, the sketch illustrates a user-friendly interface for automated customer support, utilizing AI to provide assistance and guidance to users.

Here’s another example of a Christmas tree I’ve also drawn and below the image is what Pixtral Large thought of it.

Image Description:The image presents a charming scene of a Christmas tree, standing tall and proud. ..This description is based on the visible content of the image and does not include any speculative or imaginary elements.

Conclusion

Pixtral Large looks like a step forward in multimodal AI, with better image understanding and text processing. Through this guide, we explored the model’s capabilities through both Le Chat and La Plateforme, and I hope this guide will be helpful to you.

Author

Aashi Dutt

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

AI Fundamentals

0 min

Discover the fundamentals of AI, dive into models like ChatGPT, and decode generative AI secrets to navigate the dynamic AI landscape.

See Details

Start Course

Course

Retrieval Augmented Generation (RAG) with LangChain

3 hr

12.1K

Learn cutting-edge methods for integrating external data with LLMs using Retrieval Augmented Generation (RAG) with LangChain.

See Details

Start Course

Course

Understanding the EU AI Act

1 hr

6.4K

Get your AI Act together! Understand the obligations, risks, and requirements of the EU AI Act.

See Details

Start Course

Tutorial

Pixtral 12B: A Guide With Practical Examples

Learn how to use Mistral’s Pixtral 12B interactively via Le Chat or programmatically through the API available on La Plateforme.

François Aubry

Tutorial

A Comprehensive Guide to Working with the Mistral Large Model

A detailed tutorial on the functionalities, comparisons, and practical applications of the Mistral Large Model.

Josep Ferrer

Tutorial

Getting Started With Mixtral 8X22B

Explore how Mistral AI's Mixtral 8X22B model revolutionizes large language models with its efficient SMoE architecture, offering superior performance and scalability.