Pixtral 12B: A Guide With Practical Examples

Learn how to use Mistral’s Pixtral 12B interactively via Le Chat or programmatically through the API available on La Plateforme.

Sep 26, 2024 · 8 min read

Mistral released Pixtral 12B, a 12 billion parameter open-source large language model (LLM). This is Mistral’s first multimodal model, which means it can process both text and images.

Here’s why Pixtral is a valuable addition to the LLM landscape:

It efficiently processes images of all sizes without pre-processing.
A 128K context window enables the handling of complex prompts and numerous images.
Strong performance in both text-only and multimodal tasks.
Free for non-monetized projects, making it ideal for researchers and hobbyists.
Open-source under the Apache 2.0 license, supporting AI democratization.

In this tutorial, I’ll provide numerous examples and step-by-step guidance on using Pixtral through the web chat interface, Le Chat, and programmatically via the API. But first, let’s cover the essential theoretical aspects of Pixtral.

What Is Pixtral 12B?

Mistral AI has launched Pixtral 12B, a model designed to process both images and text together. With 12 billion parameters, it can handle tasks that involve a mix of visuals and language, such as interpreting charts, documents, or graphs.

It’s useful for environments that require a deep understanding of both formats.

A key feature of Pixtral 12B is its capacity to handle multiple images within a single input, processing them at their native resolution. The model has a 128,000-token context window, which allows for the analysis of long and complex documents, images, or multiple data sources simultaneously. This makes it helpful in areas like financial reporting or document scanning for businesses.

Pixtral’s benchmarks

Pixtral performs well in tasks related to Multimodal Knowledge & Reasoning, especially in the MathVista test, where it leads the pack. In multimodal QA tasks, it also holds a strong position, particularly in ChartQA.

Source: Mistral AI

However, in instruction following and text-based tasks, other models like Claude-3 Haiku and Gemini Flash-8B show competitive or superior performance. This suggests that Pixtral 12B excels in multimodal and visual reasoning but may not dominate in purely text-based tasks.

Pixtral’s Architecture

The architecture of Pixtral 12B is designed to process both text and images simultaneously. It features two main components: a Vision Encoder and a Multimodal Transformer Decoder.

The Vision Encoder, which has 400 million parameters, is specifically trained to accommodate images of varying sizes and resolutions.

Source: Mistral AI

The second component, the Multimodal Transformer Decoder, is a more extensive model with 12 billion parameters. It's based on the existing Mistral Nemo architecture and is designed to predict the next text token in sequences that interleave text and image data.

This decoder can process very lengthy contexts (up to 128k tokens), enabling it to handle numerous image tokens and extensive textual information in large documents.

Source: Mistral AI

This combined architecture allows Pixtral to deal with a wide range of image sizes and formats, translating high-resolution images into coherent tokens without losing context.

How to Use Pixtral on Le Chat

The easiest way to access Pixtral for free is through Le Chat, their chat interface. This interface resembles other LLM chat interfaces—like that of ChatGPT, for example.

To use Pixtral, navigate to the model selector located at the bottom, next to the prompt input, and choose the Pixtral model from the list of available models.

Pixtral is a multimodal model that supports both text and images. By using the clip icon located at the bottom, we can upload one or more images and combine them with a text prompt. For example, this functionality can assist us in identifying a fruit depicted in an image.

Let's explore another example where we request Pixtral to transform an image containing a pie chart into a markdown table:

How to Connect to Pixtral’s API on La Plateforme

Although using Pixtral through its web interface is nice and easy, it's not suitable for incorporating it into our projects. In this section, we will discuss how to interact with Pixtral via their API using Python, through La Plateforme.

Profile setup

To begin, we need to create an account. This can be done with just one click by using a Google account, or alternatively, by setting up a traditional account with a username and a password.

Upon creating the account, we are prompted to set up a workspace. You can select any name for your workspace and opt for the "I'm a solo creator" option.

After creating the account, proceed to the billing plans page. Here, we have the option to either create an experimental billing plan, which allows us to try the API for free, or set up a paid plan. It's important to note that the free experimental plan requires us to link a valid phone number to our account.

Our profile should now be ready to create an API key. This key is necessary for making requests to the Mistral API and for programmatically interacting with Pixtral using Python.

Generating the API key

To generate the API key, navigate to the API key page. At the top of the page, we have a button to create a new API key:

When creating a key, we are prompted to name it and set an expiration date. However, both fields are optional, allowing us to leave them blank if desired.

Generally, it is advisable to set an expiration date for keys. Often, keys are created to experiment with an API but then are forgotten, leaving them active indefinitely. Setting an expiration date ensures that if a key is accidentally leaked, it cannot be used forever, thus minimizing potential risks.

Once the key is created, it will be displayed. This display is the only opportunity to view the key, so it's essential to copy it. If the key is lost, the solution is to delete it from the list and create a new one.

I recommend creating a .env file in the same directory as the Python script to store the key using the following format (replacing <key_value> with the actual key):

# contents of the .env fileAPI_KEY=<key_value>

Skipping this step and hardcoding the API key in our script is not recommended. Doing so prevents us from sharing our code without also sharing the key. Learn more about this approach in this tutorial on Python environment variables.

Install dependencies

To begin, we install the necessary dependencies, which include:

mistralai, the client library provided by Mistral for API interaction.
python-dotenv, a module used for loading environment variables from a .env file.

pip install python-dotenv mistralai

Once the dependencies are installed, we can proceed to script creation. Create a file called mistral_example.py in the same directory as the .env file. The initial step involves importing the modules and loading the API key into a variable.

# Create a mistral_example.py file in the same folder as the .env fileimport osfrom mistralai import Mistralfrom dotenv import load_dotenvload_dotenv()api_key = os.getenv("API_KEY")

Following that, we can proceed to initialize the client.

Initialize the client

The Mistral documentation page provides a list of all available models. We are particularly interested in the latest Pixtral model, which has the API endpoint pixtral-12b-2409.

# Add this code in mistral_example.py after initializing the API keymodel = "pixtral-12b-2409"client = Mistral(api_key=api_key)

Make an API request

We’re now ready to make a request to the Mistral API and interact with Pixtral programmatically. Here’s an example of how to submit a text prompt:

# Add this code in mistral_example.py after initializing the clientchat_response = client.chat.complete(  model=model,  messages = [   {     "role": "user",     "content": [       {         "type": "text",         "text": "What is 1 + 1?"       }     ]   }, ])print(chat_response.choices[0].message.content)

We can run this script on the terminal and see the Pixtral’s response to our prompt:

$ python mistral_example.py                  The sum of 1 + 1 is 2. So,1 + 1 = 2

Using the API with multimodality

In the example provided, the text prompt is submitted through the content field, where type is set to "text".

{  "type": "text",  "text": "What is 1 + 1?"}

The content field is an array that allows us to send multiple pieces of data. As a multi-modal model, Pixtral also accepts image data. To use an image from a URL in the prompt, we can include it in the content field by specifying "image_url" as the type:

{  "type": "image_url",  "Image_url": "<image_url>” }

Replace <image_url> with the actual URL of the image. For example, we can use Pixtral to analyze the performance charts below:

Source: Mistral AI

chat_response = client.chat.complete( model=model, messages = [   {     "role": "user",     "content": [       {         "type": "text",         "text": "According to the chart, how does Pixtral 12B performs compared to other models?"       },       {         "type": "image_url",         "image_url": "https://mistral.ai/images/news/pixtral-12b/pixtral-benchmarks.png"       }     ]   }, ])print(chat_response.choices[0].message.content)

When this request is submitted, Pixtral receives both the text prompt and the image containing the charts for analysis and then provides a response detailing the analysis. We won't display the response here due to its considerable length.

Loading local images

In the previous example, we showed how to display an image from a URL. Alternatively, we can use an image stored on our hard drive by loading it as a base-64 encoded image. To load and encode an image in base-64, we use the built-in base64 package:

def encode_image_base64(image_path): with open(image_path, "rb") as image_file:   return base64.b64encode(image_file.read()).decode("utf-8")

When using base-64 encoded images, we still use the image_url type to provide the encoded image but we need to prepend it with data:image/jpeg;base64,:

{  "type": "image_url",  "image_url": f"data:image/jpeg;base64,{base_64_image}"}

base_64_image is the result of calling the encode_image_base64() function to load the image. Let’s use this to ask Pixtral to build a to-do list website with two pages based on the following two sketches I made:

We provide the two images separately as well as a prompt asking to create an HTML website based on the images:

list_image = encode_image_base64("./todo-list.jpeg")new_item_image = encode_image_base64("./new-item-form.jpeg")chat_response = client.chat.complete( model=model, messages = [   {     "role": "user",     "content": [       {         "type": "text",         "text": "Create a HTML website with two pages like in the images"       },       {         "type": "image_url",         "image_url": f"data:image/jpeg;base64,{list_image}"       },       {         "type": "image_url",         "image_url": f"data:image/jpeg;base64,{new_item_image}"       },           ]   }, ])print(chat_response.choices[0].message.content)

Pixtral will output two code blocks with the content of the two pages. We saved the code into two files named index.html and add.html and opened them in the browser. This was the result:

Although it's not a fully functional application yet, it's operational and an excellent starting point for further development.

Conclusion

Pixtral 12B is Mistral’s first multimodal model. It can handle images of all sizes without pre-processing, features a 128K context window for complex prompts, and performs well in both text-only and multimodal tasks.

Available for free in non-monetized projects and open-source under the Apache 2.0 license, Pixtral is valuable for researchers and hobbyists alike.

In this tutorial, I’ve provided practical insights into using Pixtral, highlighting its capabilities through examples and step-by-step guidance.

Is Pixtral free to use, and under what conditions?

Is Pixtral open source?

What kind of input data does Pixtral support?

What kind of images does Pixtral support?

What makes Pixtral important?

Author

François Aubry

Full-stack engineer & founder at CheapGPT. Teaching has always been my passion. From my early days as a student, I eagerly sought out opportunities to tutor and assist other students. This passion led me to pursue a PhD, where I also served as a teaching assistant to support my academic endeavors. During those years, I found immense fulfillment in the traditional classroom setting, fostering connections and facilitating learning. However, with the advent of online learning platforms, I recognized the transformative potential of digital education. In fact, I was actively involved in the development of one such platform at our university. I am deeply committed to integrating traditional teaching principles with innovative digital methodologies. My passion is to create courses that are not only engaging and informative but also accessible to learners in this digital age.

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

Developing Large Language Models

16 hr

Learn to develop large language models (LLMs) with PyTorch and Hugging Face, using the latest deep learning and NLP techniques.

See Details

Start Course

Course

Image Processing in Python

4 hr

53.2K

Learn to process, transform, and manipulate images at your will.

See Details

Start Course

Course

Understanding the EU AI Act

1 hr

6.7K

Get your AI Act together! Understand the obligations, risks, and requirements of the EU AI Act.

See Details

Start Course

Tutorial

A Comprehensive Guide to Working with the Mistral Large Model

A detailed tutorial on the functionalities, comparisons, and practical applications of the Mistral Large Model.

Josep Ferrer

Tutorial

Getting Started With Mixtral 8X22B

Explore how Mistral AI's Mixtral 8X22B model revolutionizes large language models with its efficient SMoE architecture, offering superior performance and scalability.

Bex Tuychiev

Tutorial

Codestral API Tutorial: Getting Started With Mistral’s API

To connect to the Codestral API, obtain your API key from Mistral AI and send authorized HTTP requests to the appropriate endpoint (either codestral.mistral.ai or api.mistral.ai).

Ryan Ong

Tutorial

Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B

The tutorial covers accessing, quantizing, fine-tuning, merging, and saving this powerful 7.3 billion parameter open-source language model.

Abid Ali Awan

Man getting out of a monitor to represent the realistic images generated with Flux AI

Tutorial

Flux AI Image Generator: A Guide With Examples

Learn how to use Flux AI to generate images and explore the features, applications, and use cases of each model in the Flux family: Flux Pro, Flux Dev, and Flux Schnell.

Bhavishya Pandit

Tutorial

Replit Agent: A Guide With Practical Examples

Learn how to set up Replit Agent and discover how to use it through an example walkthrough and 10 real-world use cases.

Dr Ana Rojo-Echeburúa

See More See More

What Is Pixtral 12B?

Pixtral’s benchmarks

Pixtral’s Architecture

How to Use Pixtral on Le Chat

How to Connect to Pixtral’s API on La Plateforme

Profile setup

Generating the API key

Install dependencies

Initialize the client

Make an API request

Using the API with multimodality

Loading local images

Conclusion

FAQs

What kind of input data does Pixtral support?

What kind of images does Pixtral support?

What makes Pixtral important?

A Comprehensive Guide to Working with the Mistral Large Model

Getting Started With Mixtral 8X22B

Codestral API Tutorial: Getting Started With Mistral’s API

Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B

Flux AI Image Generator: A Guide With Examples

Replit Agent: A Guide With Practical Examples

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Developing Large Language Models

Image Processing in Python

Understanding the EU AI Act

A Comprehensive Guide to Working with the Mistral Large Model

Getting Started With Mixtral 8X22B

Codestral API Tutorial: Getting Started With Mistral’s API

Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B

Flux AI Image Generator: A Guide With Examples

Replit Agent: A Guide With Practical Examples

Developing Large Language Models