track
Gemini 2.0 Flash: Step-by-Step Tutorial With Demo Project
Google recently announced Gemini 2.0, with the Gemini 2.0 Flash model at its core—a faster, more powerful version designed to improve image and sound processing.
In this tutorial, I’ll walk you through the steps to use Gemini 2.0 Flash to create a visual assistant that can read on-screen content and answer questions about it.
Here’s a demo of what we’ll be building:
Step 1: Set Up the API Key for Google AI Studio
To set up the API key, navigate to Google AI Studio and click the “Create API Key” button. Remember to copy the key and then paste it into a file named .env
, with the following format:
GOOGLE_API_KEY=replace_this_with_api_key
If you have already worked with the Google Cloud Platform using the same Google account, Google AI Studio will prompt you to choose one of your projects to activate the API.
To follow along with this tutorial, the code Python code must be in the same folder as the .env
file.
Develop AI Applications
Step 2: Install Python Dependencies
For this project, we’ll be using the following packages:
google-genai
: A Python library for integrating Google’s generative AI models into our applications.pyautogui
: A cross-platform library for programmatically controlling the mouse and keyboard to automate tasks. In our case, we use it to provide the screen content to the AI model.python-dotenv
: A library to manage environment variables by loading them from .env files into our Python application.sounddevice
: A Python library for recording and playing sound using simple APIs for audio input and output.numpy
: A fundamental library for numerical computing in Python, providing support for arrays, matrices, and a wide range of mathematical operations.
To install the dependencies, we can use pip:
pip install google-genai pyautogui python-dotenv sounddevice numpy
Alternatively, we download the requirements.txt file from the GitHub repository I set up for this project and use it to create a Conda environment:
conda create --name gemini python=3.11
conda activate gemini
pip install -r requirements.txt
Step 3: Create A Text Chatbot With Google GenAI API
Let’s start by creating a simple command-line AI chat interface using Google’s Gemini 2 Flash model with the google.genai
library. I recommend checking out the official documentation of Gemini 2.0 in case of any hiccups along the way.
The complete code for this example is available in the text.py
file from my GitHub repository.
Creating and connecting to the Google GenAI client
The first step is to load the API key securely and initialize the Google GenAI client. The script uses dotenv
to load environment variables from the .env
file.
Here’s how to set up the client with the necessary credentials:
from google import genai
from dotenv import load_dotenv
import os
# Load environment variables from a .env file
load_dotenv()
client = genai.Client(
api_key=os.getenv("GOOGLE_API_KEY"),
http_options={"api_version": "v1alpha"},
)
print("Connected to the AI model!")
Making asynchronous API calls
When working with APIs like Google GenAI, we often need to manage asynchronous operations. Asynchronous programming allows other operations to continue while waiting for network requests, making your application more responsive. This is particularly important when dealing with high-latency operations such as network requests.
In Python, asynchronous programming is made possible using the asyncio
library and the async
/await
syntax.
Here’s how we can make an asynchronous request to Google GenAI:
from google import genai
from dotenv import load_dotenv
import os
import asyncio
# Load environment variables from a .env file
load_dotenv()
async def main():
client = genai.Client(
api_key=os.getenv("GOOGLE_API_KEY"),
http_options={"api_version": "v1alpha"},
)
# Define the AI model and configuration
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}
async with client.aio.live.connect(model=model_id, config=config) as session:
await session.send("Hello", end_of_turn=True)
# Process responses from the AI
async for response in session.receive():
if not response.server_content.turn_complete:
for part in response.server_content.model_turn.parts:
print(part.text, end="", flush=True)
# Run the main function
asyncio.run(main())
This version connects to the AI model and sends a single “Hello” message. The response is printed word by word to the console.
Making it interactive
To make the application interactive, allowing the user to chat back and forth with the AI model, we add a loop that lets the user send multiple messages. The loop continues until the user types "exit."
from google import genai
from dotenv import load_dotenv
import os
import asyncio
# Load environment variables from a .env file
load_dotenv()
async def main():
client = genai.Client(
api_key=os.getenv("GOOGLE_API_KEY"),
http_options={"api_version": "v1alpha"},
)
# Define the AI model and configuration
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}
async with client.aio.live.connect(model=model_id, config=config) as session:
while True:
message = input("> ")
print()
# Exit the loop if the user types "exit"
if message == "exit":
print("Exiting...")
break
# Send the user's message to the AI model, marking the end of the turn
await session.send(message, end_of_turn=True)
# Receive responses asynchronously and process each response
async for response in session.receive():
if not response.server_content.turn_complete:
for part in response.server_content.model_turn.parts:
print(part.text, end="", flush=True)
print()
# Run the main function
asyncio.run(main())
And that’s it! With the above script, we’ve created a command line AI chatbot using the Google GenAI API. Here’s what it looks like:
Step 4: Add Audio Mode
Audio mode enables the model to respond with voice instead of text. To adjust the previous example for handling audio responses, we:
- Import
sounddevice
for audio playback andnumpy
to process audio data. - Change the response modality from
TEXT
toAUDIO
:
config = {"response_modalities": ["AUDIO"]}
- Initialize an audio stream before connecting to the client:
with sd.OutputStream(
samplerate=24000,
channels=1,
dtype="int16",
) as audio_stream:
- Access the audio data from the response part and add it to the audio stream from playback:
for part in response.server_content.model_turn.parts:
# Get the audio data from the response part and add it to the steam
inline_data = part.inline_data
audio_data = np.frombuffer(inline_data.data, dtype="int16")
audio_stream.write(audio_data)
The audio.py
file in the repository contains the full script with these changes applied. The script contains comments on the lines that have changed.
Step 5: Add Code Execution With Tools
One of the great features of modern AI models is their ability to autonomously call custom functions in our code, and Gemini 2 is no exception.
The way it works is that we tell the model which functions are available to be called by registering them as tools. Then, by analyzing the prompt, the function names, and descriptions, the model will decide whether it wants to make a function call. When it decides to do so, it will send a special response with the name of the function it wants to call and the arguments.
To define a tool, we need to:
- Write a Python function with the same name and arguments as defined in the schema.
- Create the function schema, which is a dictionary with metadata about the function, such as its name, a textual description, and a specification of the arguments.
- Provide the function schema to the AI model.
- Execute the function when the model requests it.
To illustrate this, let’s define a tool that can read a file, enabling the model to answer questions about files on our local machine.
Defining the function
This part is just regular Python code. The model expects the answer to be a dictionary with a ”result”
key if the function was successful and ”error”
otherwise.
def load_file_content(filename):
try:
with open(filename, "rt") as f:
return {
"result": f.read()
}
except Exception as e:
return {
"error": "Could not load file content",
}
Specifying the schema
Here’s how we can define a schema for this function:
load_file_content_schema = {
"name": "load_file_content",
"description": "Load the content of a file",
"parameters": {
"type": "object",
"properties": {
"filename": {
"type": "string",
"description": "The name of the file",
},
},
"required": ["filename"],
},
"output": {
"type": "string",
"description": "The text content of the file",
},
}
We provide four fields:
”name”
: The name of the function.”description”
: A textual description. This is used by the model to decide whether to call the function.”properties”
: Description of the function arguments.”output”
: Description of the function output.
Check the official documentation for more information on function schemas.
Providing the function to the model
To let the model know about our function, we provide the function schema in the model configuration:
config = {
"tools": [{"function_declarations": [load_file_content_schema]}],
"response_modalities": ["TEXT"]
}
Processing the function call request from the model
When the model decides to perform a function call, it will add a tool_call
to the response
. This will contain the name of the function and the arguments. It could contain several call requests, so we need to iterate over all of them, call the corresponding functions, and send the result back to the model:
# A dictionary mapping the function names to the actual functions
FUNCTIONS = {"load_file_content": load_file_content}
if response.tool_call is not None:
for fc in tool_call.function_calls:
f = FUNCTIONS.get(fc.name)
tool_response = types.LiveClientToolResponse(
function_responses=[
types.FunctionResponse(
name=fc.name,
id=fc.id,
response=f(**fc.args),
)
]
)
await session.send(tool_response)
The full implementation of this example is provided in the tool.py
file in the repository containing the full script with these changes applied. The function and schema definitions are in the tool_spec.py
file.
Web access
Using tools, we can also give the model the ability to access the web by adding the Google search tool:
search_tool = {"google_search": {}}
config = {
"tools": [search_tool],
"response_modalities": ["TEXT"],
}
This tool is built-in, and for that reason, we don’t need to provide a function.
Code execution
Another built-in function is code execution. Code execution allows the model to write and run Python code to answer complex questions, usually involving math. For example, with this tool, if we ask it to compute the sum of the first 10 prime numbers, it will first write Python code to calculate this, execute it, and then provide the answer.
To activate the tool, we do:
code_execution_tool = {"code_execution": {}}
config = {
"tools": [code_execution_tool],
"response_modalities": ["TEXT"],
}
Here’s an example of the model’s behavior with code execution:
> add the first 10 prime numbers
Okay, I understand. You want me to add the first 10 prime numbers.
Here's my plan:
1. **Identify the first 10 prime numbers:** I will use a python code to find prime numbers.
2. **Sum the prime numbers:** I will also sum them using python.
3. **Report the result**
The first 10 prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, and 29. Their sum is 129.
Step 6: Build a Visual Assistant
In this last section, we explore Gemini 2's visual capabilities. The aim is to create an AI assistant that can understand the content on our screen and answer questions about it. This can be useful, for example, when we ask it to explain an error we see in our terminal or provide information on something currently displayed on the screen.
In the previous examples, we used asynchronous programming to connect to the AI model and send data back and forth while processing the responses in real time. Unfortunately, the current version of the SDK doesn’t yet support real-time communication with images. Instead, we provide image data through a request-response workflow. Note that this isn't a limitation of Gemini 2.0 itself—it’s just that the current beta API doesn’t support it yet.
Sending a synchronous request to Google GenAI
Let’s start by learning how to send a request with image data to the Google GenAI API. Here’s how we can send a synchronous request:
from google import genai
from dotenv import load_dotenv
import os
load_dotenv()
client = genai.Client(
api_key=os.getenv("GOOGLE_API_KEY"),
http_options={"api_version": "v1alpha"},
)
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=["Hello"],
)
print(response.text)
The main difference in this example is that we use client.models.generate_content
to send a request to the API. This is a synchronous request, which means it doesn't provide a real-time conversation experience.
Sending an image
We can send an image by loading it and adding it to the contents list. We use the PIL package to create a function called load_and_resize_image()
that loads and resizes the image.
from PIL import Image
def load_and_resize_image(image_path):
with Image.open(image_path) as img:
aspect_ratio = img.height / img.width
new_height = int(img.width * aspect_ratio)
return img.resize((img.width, new_height), Image.Resampling.LANCZOS)
image = load_and_resize_image("example_image.jpeg")
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=["Describe the image", image],
)
print(response.text)
Creating the AI visual assistant
The AI visual assistant processes a textual prompt along with a screenshot to help us answer questions about what's on the screen. I've experimented with this extensively, and the model can understand the screen content even when multiple windows are open.
The simplest way to provide the model with the screen content is by taking a screenshot. For this, we use the pyautogui
package, a cross-platform library for programmatically controlling the mouse and keyboard to automate tasks. In our case, we'll use it just for taking screenshots, though we could extend its functionality to let the AI perform tasks on our computer autonomously.
Here's a function to take a screenshot:
def capture_screen():
timestamp = time.strftime("%Y%m%d-%H%M%S")
filename = f"screenshot_{timestamp}.jpeg"
screenshot = pyautogui.screenshot()
screenshot = screenshot.convert("RGB")
screenshot.save(filename, format="JPEG")
return filename
To make it interactive, we repeat the following steps until the user decides to exit the application:
- Ask the user to input a prompt.
- Take a screenshot and send it to the AI model along with the prompt.
- Display the result to the user.
We need to keep one thing in mind with this approach. The application will run in the terminal, which is also displayed on the screen. Therefore, it is better to instruct the model to ignore the terminal window. This can be done by adding a system_instruction
to the configuration:
client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=[prompt, screen],
config=types.GenerateContentConfig(
system_instruction="Ignore the terminal window in the image when analyzing the image",
),
)
Putting it all together, here’s our visual AI assistant. The file with the full code is also available as the vision.py
file in the repository.
from google import genai
from google.genai import types
from PIL import Image
import pyautogui
import time
import os
from dotenv import load_dotenv
load_dotenv()
# Initialize the GenAI client
client = genai.Client(
api_key=os.getenv("GOOGLE_API_KEY"),
http_options={"api_version": "v1alpha"},
)
def capture_screen():
timestamp = time.strftime("%Y%m%d-%H%M%S")
filename = f"screenshot_{timestamp}.jpeg"
screenshot = pyautogui.screenshot()
screenshot = screenshot.convert("RGB")
screenshot.save(filename, format="JPEG")
return filename
def load_and_resize_image(image_path):
with Image.open(image_path) as img:
aspect_ratio = img.height / img.width
new_height = int(img.width * aspect_ratio)
return img.resize((img.width, new_height), Image.Resampling.LANCZOS)
def get_genai_response(prompt):
print("Analyzing screen...")
screen = load_and_resize_image(capture_screen())
response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=[prompt, screen],
config=types.GenerateContentConfig(
system_instruction="Ignore the terminal window in the image when analyzing the image",
),
)
return response.text
def main():
while True:
prompt = input("> ")
print()
if prompt == "exit":
break
answer = get_genai_response(prompt)
print(answer)
print()
if __name__ == "__main__":
main()
The current version is a bit clunky, requiring the terminal to be open on top of the current view for us to trigger it. A natural next step would be to run it in the background and use voice input and output instead of text.
Conclusion
We’ve learned how to use the Gemini 2.0 Flash model for various applications, such as developing chatbots that can engage in real-time conversations with text and voice, enabling the AI model to perform actions using function calls and a visual assistant capable of analyzing our computer screen content.
If you want to explore further, Google provides two examples showcasing the model’s ability to detect objects in an image by providing their bounding boxes with labels. This example focuses on the 2D spatial understanding by asking the model to identify and label cupcakes in a picture. Gemini can also understand the 3D context of an image, as showcased here.
Overall, Gemini 2.0 is quite promising despite the fact that the current version of the API doesn’t yet allow us to use it to its full potential. I’m excited to fully use multimodal real-time capabilities in the near future.
Earn a Top AI Certification
Learn AI with these courses!
course
Retrieval Augmented Generation (RAG) with LangChain
course
AI Security and Risk Management
tutorial
Gemini 1.5 Pro API Tutorial: Getting Started With Google's LLM
Natasha Al-Khatib
8 min
tutorial
Introducing Google Gemini API: Discover the Power of the New Gemini AI Models
tutorial
What is Google Gemini? Everything You Need To Know About Google’s ChatGPT Rival
tutorial
Llama 3.2 and Gradio Tutorial: Build a Multimodal Web App
tutorial
Qwen 2.5 Coder: A Guide With Examples
code-along