Gemini 2.0 Flash: Step-by-Step Tutorial With Demo Project

Learn how to use Google's Gemini 2.0 Flash model to develop a visual assistant capable of reading on-screen content and answering questions about it using Python.

Dec 16, 2024 · 12 min read

Step 1: Set Up the API Key for Google AI Studio

To set up the API key, navigate to Google AI Studio and click the “Create API Key” button. Remember to copy the key and then paste it into a file named .env, with the following format:

GOOGLE_API_KEY=replace_this_with_api_key

If you have already worked with the Google Cloud Platform using the same Google account, Google AI Studio will prompt you to choose one of your projects to activate the API.

To follow along with this tutorial, the code Python code must be in the same folder as the .env file.

Develop AI Applications

Learn to build AI applications using the OpenAI API.

Start Upskilling For Free

Step 2: Install Python Dependencies

For this project, we’ll be using the following packages:

google-genai: A Python library for integrating Google’s generative AI models into our applications.
pyautogui: A cross-platform library for programmatically controlling the mouse and keyboard to automate tasks. In our case, we use it to provide the screen content to the AI model.
python-dotenv: A library to manage environment variables by loading them from .env files into our Python application.
sounddevice: A Python library for recording and playing sound using simple APIs for audio input and output.
numpy: A fundamental library for numerical computing in Python, providing support for arrays, matrices, and a wide range of mathematical operations.

To install the dependencies, we can use pip:

pip install google-genai pyautogui python-dotenv sounddevice numpy

Alternatively, we download the requirements.txt file from the GitHub repository I set up for this project and use it to create a Conda environment:

conda create --name gemini python=3.11
conda activate gemini
pip install -r requirements.txt

Step 3: Create A Text Chatbot With Google GenAI API

Let’s start by creating a simple command-line AI chat interface using Google’s Gemini 2 Flash model with the google.genai library. I recommend checking out the official documentation of Gemini 2.0 in case of any hiccups along the way.

The complete code for this example is available in the text.py file from my GitHub repository.

Creating and connecting to the Google GenAI client

The first step is to load the API key securely and initialize the Google GenAI client. The script uses dotenv to load environment variables from the .env file.

Here’s how to set up the client with the necessary credentials:

from google import genai
from dotenv import load_dotenv
import os

# Load environment variables from a .env file
load_dotenv()
client = genai.Client(
    api_key=os.getenv("GOOGLE_API_KEY"),
    http_options={"api_version": "v1alpha"},
)

print("Connected to the AI model!")

Making asynchronous API calls

When working with APIs like Google GenAI, we often need to manage asynchronous operations. Asynchronous programming allows other operations to continue while waiting for network requests, making your application more responsive. This is particularly important when dealing with high-latency operations such as network requests.

In Python, asynchronous programming is made possible using the asyncio library and the async/await syntax.

Here’s how we can make an asynchronous request to Google GenAI:

from google import genai
from dotenv import load_dotenv
import os
import asyncio

# Load environment variables from a .env file
load_dotenv()
async def main():
    client = genai.Client(
        api_key=os.getenv("GOOGLE_API_KEY"),
        http_options={"api_version": "v1alpha"},
    )    
    # Define the AI model and configuration
    model_id = "gemini-2.0-flash-exp"
    config = {"response_modalities": ["TEXT"]}
    async with client.aio.live.connect(model=model_id, config=config) as session:
        await session.send("Hello", end_of_turn=True)
        # Process responses from the AI
        async for response in session.receive():
            if not response.server_content.turn_complete:
                for part in response.server_content.model_turn.parts:
                    print(part.text, end="", flush=True)

# Run the main function
asyncio.run(main())

This version connects to the AI model and sends a single “Hello” message. The response is printed word by word to the console.

Making it interactive

To make the application interactive, allowing the user to chat back and forth with the AI model, we add a loop that lets the user send multiple messages. The loop continues until the user types "exit."

from google import genai
from dotenv import load_dotenv
import os
import asyncio

# Load environment variables from a .env file
load_dotenv()
async def main():
    client = genai.Client(
        api_key=os.getenv("GOOGLE_API_KEY"),
        http_options={"api_version": "v1alpha"},
    )
    # Define the AI model and configuration
    model_id = "gemini-2.0-flash-exp"
    config = {"response_modalities": ["TEXT"]}
    async with client.aio.live.connect(model=model_id, config=config) as session:
        while True:
            message = input("> ")
            print()
            # Exit the loop if the user types "exit"
            if message == "exit":
                print("Exiting...")
                break
            # Send the user's message to the AI model, marking the end of the turn
            await session.send(message, end_of_turn=True)
            # Receive responses asynchronously and process each response
            async for response in session.receive():
                if not response.server_content.turn_complete:
                    for part in response.server_content.model_turn.parts:
                        print(part.text, end="", flush=True)
            print()

# Run the main function
asyncio.run(main())

And that’s it! With the above script, we’ve created a command line AI chatbot using the Google GenAI API. Here’s what it looks like:

Step 4: Add Audio Mode

Audio mode enables the model to respond with voice instead of text. To adjust the previous example for handling audio responses, we:

Import sounddevice for audio playback and numpy to process audio data.
Change the response modality from TEXT to AUDIO:

	config = {"response_modalities": ["AUDIO"]}

Initialize an audio stream before connecting to the client:

	with sd.OutputStream(
	samplerate=24000, 
	channels=1, 
	dtype="int16",
) as audio_stream:

Access the audio data from the response part and add it to the audio stream from playback:

for part in response.server_content.model_turn.parts:
    # Get the audio data from the response part and add it to the steam
    inline_data = part.inline_data
    audio_data = np.frombuffer(inline_data.data, dtype="int16")
    audio_stream.write(audio_data)

The audio.py file in the repository contains the full script with these changes applied. The script contains comments on the lines that have changed.

Step 5: Add Code Execution With Tools

One of the great features of modern AI models is their ability to autonomously call custom functions in our code, and Gemini 2 is no exception.

The way it works is that we tell the model which functions are available to be called by registering them as tools. Then, by analyzing the prompt, the function names, and descriptions, the model will decide whether it wants to make a function call. When it decides to do so, it will send a special response with the name of the function it wants to call and the arguments.

To define a tool, we need to:

Write a Python function with the same name and arguments as defined in the schema.
Create the function schema, which is a dictionary with metadata about the function, such as its name, a textual description, and a specification of the arguments.
Provide the function schema to the AI model.
Execute the function when the model requests it.

To illustrate this, let’s define a tool that can read a file, enabling the model to answer questions about files on our local machine.

Defining the function

This part is just regular Python code. The model expects the answer to be a dictionary with a ”result” key if the function was successful and ”error” otherwise.

def load_file_content(filename):
    try:
      with open(filename, "rt") as f:
          return {
              "result": f.read()
          }
    except Exception as e:
      return {
          "error": "Could not load file content",
      }

Specifying the schema

Here’s how we can define a schema for this function:

load_file_content_schema = {
    "name": "load_file_content",
    "description": "Load the content of a file",
    "parameters": {
        "type": "object",
        "properties": {
            "filename": {
                "type": "string",
                "description": "The name of the file",
            },
        },
        "required": ["filename"],
    },
    "output": {
        "type": "string",
        "description": "The text content of the file",
    },
}

We provide four fields:

”name”: The name of the function.
”description”: A textual description. This is used by the model to decide whether to call the function.
”properties”: Description of the function arguments.
”output”: Description of the function output.

Check the official documentation for more information on function schemas.

Providing the function to the model

To let the model know about our function, we provide the function schema in the model configuration:

config = {
       "tools": [{"function_declarations": [load_file_content_schema]}], 
       "response_modalities": ["TEXT"]
}

Processing the function call request from the model

When the model decides to perform a function call, it will add a tool_call to the response. This will contain the name of the function and the arguments. It could contain several call requests, so we need to iterate over all of them, call the corresponding functions, and send the result back to the model:

# A dictionary mapping the function names to the actual functions
FUNCTIONS = {"load_file_content": load_file_content}
if response.tool_call is not None:
    for fc in tool_call.function_calls:
        f = FUNCTIONS.get(fc.name)
        tool_response = types.LiveClientToolResponse(
            function_responses=[
                types.FunctionResponse(
                    name=fc.name,
                    id=fc.id,
                    response=f(**fc.args),
                )
            ]
        )
    await session.send(tool_response)

The full implementation of this example is provided in the tool.py file in the repository containing the full script with these changes applied. The function and schema definitions are in the tool_spec.py file.

Web access

Using tools, we can also give the model the ability to access the web by adding the Google search tool:

search_tool = {"google_search": {}}
config = {
    "tools": [search_tool],
    "response_modalities": ["TEXT"],
}

This tool is built-in, and for that reason, we don’t need to provide a function.

Code execution

Another built-in function is code execution. Code execution allows the model to write and run Python code to answer complex questions, usually involving math. For example, with this tool, if we ask it to compute the sum of the first 10 prime numbers, it will first write Python code to calculate this, execute it, and then provide the answer.

To activate the tool, we do:

code_execution_tool = {"code_execution": {}}
config = {
    "tools": [code_execution_tool],
    "response_modalities": ["TEXT"],
}

Here’s an example of the model’s behavior with code execution:

> add the first 10 prime numbers
Okay, I understand. You want me to add the first 10 prime numbers.
Here's my plan:
1. **Identify the first 10 prime numbers:** I will use a python code to find prime numbers.
2. **Sum the prime numbers:** I will also sum them using python.
3. **Report the result**
The first 10 prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, and 29.  Their sum is 129.

Step 6: Build a Visual Assistant

In this last section, we explore Gemini 2's visual capabilities. The aim is to create an AI assistant that can understand the content on our screen and answer questions about it. This can be useful, for example, when we ask it to explain an error we see in our terminal or provide information on something currently displayed on the screen.

In the previous examples, we used asynchronous programming to connect to the AI model and send data back and forth while processing the responses in real time. Unfortunately, the current version of the SDK doesn’t yet support real-time communication with images. Instead, we provide image data through a request-response workflow. Note that this isn't a limitation of Gemini 2.0 itself—it’s just that the current beta API doesn’t support it yet.

Sending a synchronous request to Google GenAI

Let’s start by learning how to send a request with image data to the Google GenAI API. Here’s how we can send a synchronous request:

from google import genai
from dotenv import load_dotenv
import os

load_dotenv()

client = genai.Client(
    api_key=os.getenv("GOOGLE_API_KEY"),
    http_options={"api_version": "v1alpha"},
)

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=["Hello"],
)

print(response.text)

The main difference in this example is that we use client.models.generate_content to send a request to the API. This is a synchronous request, which means it doesn't provide a real-time conversation experience.

Sending an image

We can send an image by loading it and adding it to the contents list. We use the PIL package to create a function called load_and_resize_image() that loads and resizes the image.

from PIL import Image

def load_and_resize_image(image_path):
    with Image.open(image_path) as img:
        aspect_ratio = img.height / img.width
        new_height = int(img.width * aspect_ratio)
        return img.resize((img.width, new_height), Image.Resampling.LANCZOS)    

image = load_and_resize_image("example_image.jpeg")

response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents=["Describe the image", image],
)

print(response.text)

Creating the AI visual assistant

The AI visual assistant processes a textual prompt along with a screenshot to help us answer questions about what's on the screen. I've experimented with this extensively, and the model can understand the screen content even when multiple windows are open.

The simplest way to provide the model with the screen content is by taking a screenshot. For this, we use the pyautogui package, a cross-platform library for programmatically controlling the mouse and keyboard to automate tasks. In our case, we'll use it just for taking screenshots, though we could extend its functionality to let the AI perform tasks on our computer autonomously.

Here's a function to take a screenshot:

def capture_screen():
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    filename = f"screenshot_{timestamp}.jpeg"
    screenshot = pyautogui.screenshot()
    screenshot = screenshot.convert("RGB")
    screenshot.save(filename, format="JPEG")
    return filename

To make it interactive, we repeat the following steps until the user decides to exit the application:

Ask the user to input a prompt.
Take a screenshot and send it to the AI model along with the prompt.
Display the result to the user.

We need to keep one thing in mind with this approach. The application will run in the terminal, which is also displayed on the screen. Therefore, it is better to instruct the model to ignore the terminal window. This can be done by adding a system_instruction to the configuration:

client.models.generate_content(
        model="gemini-2.0-flash-exp",
        contents=[prompt, screen],
        config=types.GenerateContentConfig(
            system_instruction="Ignore the terminal window in the image when analyzing the image",
        ),
)

Putting it all together, here’s our visual AI assistant. The file with the full code is also available as the vision.py file in the repository.

from google import genai
from google.genai import types
from PIL import Image
import pyautogui
import time
import os
from dotenv import load_dotenv
load_dotenv()

# Initialize the GenAI client
client = genai.Client(
    api_key=os.getenv("GOOGLE_API_KEY"),
    http_options={"api_version": "v1alpha"},
)
def capture_screen():
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    filename = f"screenshot_{timestamp}.jpeg"
    screenshot = pyautogui.screenshot()
    screenshot = screenshot.convert("RGB")
    screenshot.save(filename, format="JPEG")
    return filename

def load_and_resize_image(image_path):
    with Image.open(image_path) as img:
        aspect_ratio = img.height / img.width
        new_height = int(img.width * aspect_ratio)
        return img.resize((img.width, new_height), Image.Resampling.LANCZOS)    

def get_genai_response(prompt):
    print("Analyzing screen...")
    screen = load_and_resize_image(capture_screen())
    response = client.models.generate_content(
        model="gemini-2.0-flash-exp",
        contents=[prompt, screen],
        config=types.GenerateContentConfig(
            system_instruction="Ignore the terminal window in the image when analyzing the image",
        ),
    )
    return response.text

def main():
    while True:
        prompt = input("> ")
        print()
        if prompt == "exit":
            break
        answer = get_genai_response(prompt)
        print(answer)
        print()

if __name__ == "__main__":
    main()

The current version is a bit clunky, requiring the terminal to be open on top of the current view for us to trigger it. A natural next step would be to run it in the background and use voice input and output instead of text.

Conclusion

We’ve learned how to use the Gemini 2.0 Flash model for various applications, such as developing chatbots that can engage in real-time conversations with text and voice, enabling the AI model to perform actions using function calls and a visual assistant capable of analyzing our computer screen content.

If you want to explore further, Google provides two examples showcasing the model’s ability to detect objects in an image by providing their bounding boxes with labels. This example focuses on the 2D spatial understanding by asking the model to identify and label cupcakes in a picture. Gemini can also understand the 3D context of an image, as showcased here.

Overall, Gemini 2.0 is quite promising despite the fact that the current version of the API doesn’t yet allow us to use it to its full potential. I’m excited to fully use multimodal real-time capabilities in the near future.

Author

François Aubry

Full-stack engineer & founder at CheapGPT. Teaching has always been my passion. From my early days as a student, I eagerly sought out opportunities to tutor and assist other students. This passion led me to pursue a PhD, where I also served as a teaching assistant to support my academic endeavors. During those years, I found immense fulfillment in the traditional classroom setting, fostering connections and facilitating learning. However, with the advent of online learning platforms, I recognized the transformative potential of digital education. In fact, I was actively involved in the development of one such platform at our university. I am deeply committed to integrating traditional teaching principles with innovative digital methodologies. My passion is to create courses that are not only engaging and informative but also accessible to learners in this digital age.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.

Get Certified, Get Hired

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

Developing AI Applications

0 min

Learn to create AI-powered applications with the latest AI developer tools, including the OpenAI API, Hugging Face, and LangChain.

See Details

Start Course

Course

AI Security and Risk Management

2 hr

6.7K

Learn the fundamentals of AI security to protect systems from threats, align security with business goals, and mitigate key risks.

See Details

Start Course

Course

Retrieval Augmented Generation (RAG) with LangChain

3 hr

12.9K

Learn cutting-edge methods for integrating external data with LLMs using Retrieval Augmented Generation (RAG) with LangChain.

See Details

Start Course

blog

Gemini 2.0 Flash Thinking Experimental: A Guide With Examples

Learn about Gemini 2.0 Flash Thinking Experimental, including its features, benchmarks, limitations, and how it compares to other reasoning models.

Alex Olteanu

8 min

Tutorial

Building Multimodal AI Application with Gemini 2.0 Pro

Build a chat app that can understand text, images, audio, and documents, as well as execute Python code. Truly a multimodal application closer to AGI.

Abid Ali Awan

Tutorial

Gemini 1.5 Pro API Tutorial: Getting Started With Google's LLM

To connect to the Gemini 1.5 Pro API, obtain your API key from Google AI for Developers, install the necessary Python libraries, and send requests and receive responses from the Gemini 1.5 Pro model.

Natasha Al-Khatib

Tutorial

Introducing Google Gemini API: Discover the Power of the New Gemini AI Models

Learn how to use Gemini Python API and its various functions to build AI-enabled applications for free.

Abid Ali Awan

Tutorial

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Gemini defines a family of multimodal LLMs capable of understanding texts, images, videos, and audio. It’s also said to be capable of performing complex tasks in math and physics, as well as being able to generate high-quality code in several programming languages.

Kurtis Pykes

Tutorial

Llama 3.3: Step-by-Step Tutorial With Demo Project

Learn how to build a multilingual code explanation app using Llama 3.3, Hugging Face, and Streamlit.

Dr Ana Rojo-Echeburúa

See More See More

Step 1: Set Up the API Key for Google AI Studio

Develop AI Applications

Step 2: Install Python Dependencies

Step 3: Create A Text Chatbot With Google GenAI API

Creating and connecting to the Google GenAI client

Making asynchronous API calls

Making it interactive

Step 4: Add Audio Mode

Step 5: Add Code Execution With Tools

Defining the function

Specifying the schema

Providing the function to the model

Processing the function call request from the model

Web access

Code execution

Step 6: Build a Visual Assistant

Sending a synchronous request to Google GenAI

Sending an image

Creating the AI visual assistant

Conclusion

Earn a Top AI Certification

Gemini 2.0 Flash Thinking Experimental: A Guide With Examples

Building Multimodal AI Application with Gemini 2.0 Pro

Gemini 1.5 Pro API Tutorial: Getting Started With Google's LLM

Introducing Google Gemini API: Discover the Power of the New Gemini AI Models

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Llama 3.3: Step-by-Step Tutorial With Demo Project

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Developing AI Applications

AI Security and Risk Management

Retrieval Augmented Generation (RAG) with LangChain

Gemini 2.0 Flash Thinking Experimental: A Guide With Examples

Building Multimodal AI Application with Gemini 2.0 Pro

Gemini 1.5 Pro API Tutorial: Getting Started With Google's LLM

Introducing Google Gemini API: Discover the Power of the New Gemini AI Models

What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival

Llama 3.3: Step-by-Step Tutorial With Demo Project

Developing AI Applications