How to Set Up and Run GPT-OSS Locally With Ollama

Learn how to install, set up, and run GPT-OSS locally with Ollama and build a simple Streamlit application.

Aug 6, 2025 · 12 min read

GPT-OSS is OpenAI’s first open-weight model series, and you can run it directly on your own computer. Designed for powerful reasoning, agentic tasks, and flexible developer use cases, GPT-OSS stands out for its configurable reasoning effort and transparent chain-of-thought capabilities.

In this tutorial, I'll walk you through setting up OpenAI's GPT-OSS models locally using Ollama and demonstrate how to build an interactive demo that showcases the model's unique reasoning capabilities. We'll explore the model's configurable chain-of-thought features and build a Streamlit application that visualizes the model's thinking process.

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

What Is OpenAI’s GPT-OSS?

OpenAI's GPT-OSS series marks their first open-weight release since GPT-2, allowing access to high-performance reasoning models under the Apache 2.0 license. These state-of-the-art language models deliver strong performance at low cost and outperform similarly sized open models on reasoning tasks.

There are two variants: gpt-oss-120b and gpt-oss-20b.

Source: OpenAI

gpt-oss-120b

gpt-oss-120b is OpenAI’s flagship open-weight model, with performance that rivals proprietary systems like o4‑mini on tough reasoning benchmarks. Despite its size of 117 billion parameters, it’s optimized to run on a single 80 GB GPU. This makes it a realistic option for research labs and enterprise teams looking to host powerful language models locally.

The model is designed to support complex workflows such as tool use, multi-step reasoning, and fine-grained control over how much “thinking” effort it puts into each task. Its open-weight nature and Apache 2.0 license make it a flexible choice for teams that want to deeply inspect, customize, or fine-tune the model.

gpt-oss-20b

gpt-oss-20b packs surprising reasoning strength into a footprint that fits on consumer laptops. With just 21 billion total parameters, it still outperforms OpenAI’s o3-mini on some benchmarks and runs efficiently on devices with 16 GB of memory. It’s quantized and prepped for rapid local deployment, and compatible with tools like Ollama, vLLM, and even Apple’s Metal platform.

This model is ideal for developers who want to build responsive, private AI apps without relying on cloud infrastructure. Like its larger sibling, it supports adjustable reasoning depth and clear intermediate steps—great for experiments, lightweight assistants, and edge device applications.

Why Run GPT-OSS Locally?

Running GPT-OSS locally offers several compelling advantages over cloud-based alternatives:

Privacy and control: All inference happens on your hardware, i.e., no external dependencies, outages, or API changes, and you get complete access to the model’s internal reasoning, not just its final answers.
Performance: Local models eliminate network latency problems for instant responses, with unlimited usage (no rate limits or token caps), and ensure 24/7 availability.
Cost savings: The model requires one-time download and setup with no subscription fees, and the user can use their existing compute resources without ongoing API costs.
Transparent reasoning: User can flip between low, medium, and high “effort” modes to balance speed with depth, and inspect the full chain-of-thought.
Customizability and extensibility: We can fine-tune the open-weight models to our specific tasks, tweak sampling parameters, and integrate agentic tools seamlessly.

Setting Up GPT-OSS Locally With Ollama

Ollama makes running large language models locally remarkably simple by handling downloads, quantization, and execution. I will guide you step-by-step through setting up GPT-OSS on your system. At the end, our app will look like this:

Step 1: Install Ollama

First, download and install Ollama from the official website.

Once installed, you can verify the installation by opening a terminal and typing:

ollama --version

This will return the latest version of Ollama installed on your system.

Step 2: Download the GPT-OSS models

OpenAI provides two GPT-OSS variants optimized for different use cases, but for the sake of this demo, we will pull only the gpt-oss-20b model variant:

# For production and high reasoning tasks 
ollama pull gpt-oss:120b
# For lower latency and local deployment 
ollama pull gpt-oss:20b

The 20B model is perfect for local development and can run on systems with 16GB+ memory, while the 120B model offers more capabilities but requires 80GB+ memory.

Step 3: Test the installation

Let's verify everything works correctly:

ollama run gpt-oss:20b

You should see a prompt where you can interact with the model directly. Try asking: "Prove that √2 is irrational."

Using GPT-OSS Locally: Interactive ChatBot

Now let's quickly build a simple app that showcases GPT-OSS's reasoning capabilities. Our application will feature:

Interactive reasoning controls
Chain-of-thought visualization
Performance monitoring
Conversation history

Step 1: Install dependencies

Run the following commands to install the necessary dependencies:

pip install streamlit ollama

The above code snippet installs Streamlit for building the web interface and the Ollama Python client for local GPT-OSS model interaction.

Step 2: Import libraries and set up the page

Before we dive into the core chat logic, we need to bring in all the tools and set up our user interface.

import streamlit as st
import json
import ollama
import time
from typing import Dict, Any, List
import re
from datetime import datetime

# Configure page
st.set_page_config(
    page_title="GPT-OSS Chat Demo", 
    layout="wide",
    page_icon="💬"
)
# Custom CSS for UI
st.markdown("""
<style>
.reasoning-box {
    background-
    
    
    border-radius: 10px;
    border-left: 4px solid #00d4aa;
    
}
.answer-box {
    background-
    
    border-radius: 10px;
    border-left: 4px solid #007bff;
    
}
.metric-card {
    background-
    
    border-radius: 10px;
    box-shadow: 0 2px 4px rgba(0,0,0,0.1);
    text-align: center;
    
}
.chat-message {
    
    
    border-radius: 10px;
}
.user-message {
    background-
    border-left: 4px solid #2196f3;
}
.assistant-message {
    background-
    border-left: 4px solid #9c27b0;
}
</style>
""", unsafe_allow_html=True)

We import a few core packages like Streamlit for our web interface, Ollama to call the local GPT-OSS model, along with utility modules like json, time, typing, and re. We then use st.set_page_config() to give the page a title, a wide layout, and an icon, and inject lightweight CSS so that reasoning steps and answers appear in distinct panels.

Step 3: Core model integration

With our UI scaffold in place, let’s build our core model integration call_model() function:

def call_model(messages: List[Dict], model_name: str = "gpt-oss:20b", temperature: float = 1.0) -> Dict[str, Any]:
    try:
        start_time = time.time()        
        # Prepare options for Ollama
        options = {
            'temperature': temperature,
            'top_p': 1.0,
        }        
        response = ollama.chat(
            model=model_name, 
            messages=messages,
            options=options
        )
        end_time = time.time()        
        if isinstance(response, dict) and 'message' in response:
            content = response['message'].get('content', '')
        elif hasattr(response, 'message'):
            content = getattr(response.message, 'content', '')
        else:
            content = str(response)         
        return {
            'content': content, 
            'response_time': end_time - start_time, 
            'success': True
        }
    except Exception as e:
        return {
            'content': f"Error: {e}", 
            'response_time': 0, 
            'success': False
        }

The above code:

Accepts a list of chat messages (system + user), so we can easily maintain multi-turn context.
It then records start_time and end_time around the ollama.chat() call to report response times back to the user.
It also normalizes the responses from the Ollama client by extracting the raw text output.
Finally, the call_model() function wraps the entire call in a try/except block to catch errors.

Step 4: Chain-of-thought parser

To make our demo truly transparent, we need to split the model’s “inner monologue” from its final verdict. The parse_reasoning_response() function does exactly that:

def parse_reasoning_response(content: str) -> Dict[str, str]:
    patterns = [
        r"<thinking>(.*?)</thinking>",
        r"Let me think.*?:(.*?)(?=\n\n|\nFinal|Answer:)",
        r"Reasoning:(.*?)(?=\n\n|\nAnswer:|\nConclusion:)",
    ]
    reasoning = ""
    answer = content    
    for pat in patterns:
        m = re.search(pat, content, re.DOTALL | re.IGNORECASE)
        if m:
            reasoning = m.group(1).strip()
            answer = content.replace(m.group(0), "").strip()
            break    
    if not reasoning and len(content.split('\n')) > 3:
        lines = content.split('\n')
        for i, l in enumerate(lines):
            if any(k in l.lower() for k in ['therefore', 'in conclusion', 'final answer', 'answer:']):
                reasoning = '\n'.join(lines[:i]).strip()
                answer = '\n'.join(lines[i:]).strip()
                break
    return {
        'reasoning': reasoning or "No explicit reasoning detected.", 
        'answer': answer or content
    }

We separate the model’s hidden “thinking” from its final verdict so users can optionally inspect every reasoning step. Here is how the response is parsed:

Looking for explicit markers: First, the model searches for <thinking>…</thinking> tags or phrases like “Reasoning:” and “Let me think through” with regular expressions.
Extracting and cleaning: When a match is found, that segment becomes the reasoning, and the remainder is treated as the answer.
Heuristic fallback: If no tags are present but the output spans multiple lines, it scans for conclusion keywords (“therefore,” “in conclusion,” “final answer,” or “Answer:”) and splits there.
Output: Regardless of format, the parse_reasoning_response() function always returns an answer.

Step 5: Complete Streamlit application

Now that we’ve built each building block, next we assemble them into an interactive chat app.

# Initialize history
if 'history' not in st.session_state:
    st.session_state.history = []
# Sidebar for settings
with st.sidebar:
    st.header("Configuration")    
    model_choice = st.selectbox(
        "Model", 
        ["gpt-oss:20b", "gpt-oss:120b"],
        help="Choose between 20B (faster) or 120B (more capable)"
    )    
    effort = st.selectbox(
        "Reasoning Effort", 
        ["low", "medium", "high"], 
        index=1,
        help="Controls depth of reasoning shown"
    )    
    temperature = st.slider(
        "Temperature", 
        0.0, 2.0, 1.0, 0.1,
        help="Controls response randomness"
    )    
    show_reasoning = st.checkbox(
        "Show Chain-of-Thought", 
        True,
        help="Display model's thinking process"
    )    
    show_metrics = st.checkbox(
        "Show Performance Metrics",
        True,
        help="Display response time and model info"
    )    
    st.markdown("---")        
    if st.button("Clear Conversation"):
        st.session_state.history = []
        st.rerun()  
# Main Chat Interface
st.title("💬 GPT-OSS Interactive Chat")
examples = [
    "",
    "If a train travels 120km in 1.5 hours, then 80km in 45 minutes, what's its average speed?",
    "Prove that √2 is irrational.",
    "Write a function to find the longest palindromic substring.",
    "Explain quantum entanglement in simple terms.",
    "How would you design a recommendation system?",
]
selected_from_dropdown = st.selectbox("Choose from examples:", examples)
# Get the question text
question_value = ""
if hasattr(st.session_state, 'selected_example'):
    question_value = st.session_state.selected_example
    del st.session_state.selected_example
elif selected_from_dropdown:
    question_value = selected_from_dropdown
question = st.text_area(
    "Or Enter your question:", 
    value=question_value, 
    height=100,
    placeholder="Ask anything! Try different reasoning effort levels to see how the model's thinking changes..."
)
# Submit button
col1, col2 = st.columns([1, 4])
with col1:
    submit_button = st.button("Ask GPT-OSS", type="primary")
if submit_button and question.strip():
    # System prompts 
    system_prompts = {
        'low': 'You are a helpful assistant. Provide concise, direct answers.',
        'medium': f'You are a helpful assistant. Show brief reasoning before your answer. Reasoning effort: {effort}',
        'high': f'You are a helpful assistant. Show complete chain-of-thought reasoning step by step. Think through the problem carefully before providing your final answer. Reasoning effort: {effort}'
    }    
    # Message history
    msgs = [{'role': 'system', 'content': system_prompts[effort]}]    
    msgs.extend(st.session_state.history[-6:])
    msgs.append({'role': 'user', 'content': question})    
    with st.spinner(f"GPT-OSS thinking ({effort} effort)..."):
        res = call_model(msgs, model_choice, temperature)
    if res['success']:
        parsed = parse_reasoning_response(res['content'])        
        st.session_state.history.append({'role': 'user', 'content': question})
        st.session_state.history.append({'role': 'assistant', 'content': res['content']})        
        col1, col2 = st.columns([3, 1] if show_metrics else [1])        
        with col1:
            if show_reasoning and parsed['reasoning'] != 'No explicit reasoning detected.':
                st.markdown("### Chain-of-Thought Reasoning")
                st.markdown(f"<div class='reasoning-box'>{parsed['reasoning']}</div>", unsafe_allow_html=True)            
            st.markdown("### Answer")
            st.markdown(f"<div class='answer-box'>{parsed['answer']}</div>", unsafe_allow_html=True)       
        if show_metrics:
            with col2:
                st.markdown(f"""
                <div class="metric-card">
                    <h4> Metrics</h4>
                    <p><strong>Time:</strong><br>{res['response_time']:.2f}s</p>
                    <p><strong>Model:</strong><br>{model_choice}</p>
                    <p><strong>Effort:</strong><br>{effort.title()}</p>
                </div>
                """, unsafe_allow_html=True)
    else:
        st.error(f"{res['content']}")
if st.session_state.history:
    st.markdown("---")
    st.subheader("Conversation History")    
    recent_history = st.session_state.history[-8:]  # Last 8 messages (4 exchanges)    
    for i, msg in enumerate(recent_history):
        if msg['role'] == 'user':
            st.markdown(f"""
            <div class="chat-message user-message">
                <strong>You:</strong> {msg['content']}
            </div>
            """, unsafe_allow_html=True)
        else:
            parsed_hist = parse_reasoning_response(msg['content'])
            with st.expander(f"GPT-OSS Response", expanded=False):
                if parsed_hist['reasoning'] != 'No explicit reasoning detected.':
                    st.markdown("**Reasoning Process:**")
                    st.code(parsed_hist['reasoning'], language='text')
                st.markdown("**Final Answer:**")
                st.markdown(parsed_hist['answer'])
# Footer
st.markdown("---")
st.markdown("""
<div style='text-align: center;'>
    <p><strong>Powered by GPT-OSS-20B via Ollama</strong></p>
</div>
""", unsafe_allow_html=True)

In this final step, we bring together all the pieces of our app, including session state, sidebar controls, example prompts, message assembly, model invocation, parsing, and display into a single loop:

Session state initialization: We use st.session_state.history to remember every user–assistant exchange across reruns.
Sidebar configuration: This configuration allows users to select

Model: gpt-oss:20b or gpt-oss:120b
Reasoning Effort: low/medium/high
Temperature: from 0 to 2.0, which controls the model’s randomness
Chain-of-Thought toggle: To show or hide the model’s internal reasoning
Clear conversation: This wipes history and restarts the app

Example prompts and input: A dropdown of starter questions helps onboard new users, and whatever the user selects (or types) becomes the user’s query in the text area.
Building the request: When Ask GPT-OSS is clicked, we:

Create a system prompt based on the chosen effort level.
Append the last six messages from history to maintain context.
Add the new user question.

Model invocation: We use the call_model() function, which:

Sends the assembled messages to Ollama
Measures response time
Returns a normalized content string

Parsing and display: Using the parse_reasoning_response() function, we split the model’s output into a chain-of-thought response called “reasoning” and a cleaned-up response called “answer”.
History management: Each question and response pair is appended to st.session_state.history list, so the next run includes them for context.

To try it yourself, save the code as gpt_oss_demo.py and launch:

streamlit run gpt_oss_demo.py

Topics

Artificial Intelligence

OpenAI

Learn AI with these courses!

Course

Deploying AI into Production with FastAPI

4 hr

Learn how to use FastAPI to develop APIs that support AI models, built to meet real-world demands.

See Details

Start Course

Course

Building AI Agents with Google ADK

1 hr

3.2K

Build a customer-support assistant step-by-step with Google’s Agent Development Kit (ADK).

See Details

Start Course

Course

Multi-Agent Systems with LangGraph

2 hr 45 min

2.8K

Build powerful multi-agent systems by applying emerging agentic design patterns in the LangGraph framework.

See Details

Start Course

Tutorial

How to Set Up and Run Qwen 3 Locally With Ollama

Learn how to install, set up, and run Qwen3 locally with Ollama and build a simple Gradio-based application.

Aashi Dutt

Tutorial

How to Set Up and Run QwQ 32B Locally With Ollama

Learn how to install, set up, and run QwQ-32B locally with Ollama and build a simple Gradio application.

Aashi Dutt

Tutorial

How to Set Up and Run Gemma 3 Locally With Ollama

Learn how to install, set up, and run Gemma 3 locally with Ollama and build a simple file assistant on your own device.

François Aubry

Tutorial

How to Run Llama 3 Locally With Ollama and GPT4ALL

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain and Chroma DB.

Abid Ali Awan

Tutorial

How to Set Up and Run DeepSeek R1 Locally With Ollama

Learn how to install, set up, and run DeepSeek-R1 locally with Ollama and build a simple RAG application.

Aashi Dutt

Tutorial

AutoGPT Guide: Creating And Deploying Autonomous AI Agents Locally

Learn how to set up AutoGPT, create custom AI agents with a low-code interface, and extend functionality with Python blocks. This hands-on tutorial covers installation, UI basics, and agent creation.

Bex Tuychiev

See More See More

What Is OpenAI’s GPT-OSS?

gpt-oss-120b

gpt-oss-20b

Why Run GPT-OSS Locally?

Setting Up GPT-OSS Locally With Ollama

Step 1: Install Ollama

Step 2: Download the GPT-OSS models

Step 3: Test the installation

Using GPT-OSS Locally: Interactive ChatBot

Step 1: Install dependencies

Step 2: Import libraries and set up the page

Step 3: Core model integration

Step 4: Chain-of-thought parser

Step 5: Complete Streamlit application

Conclusion

How to Set Up and Run Qwen 3 Locally With Ollama

How to Set Up and Run QwQ 32B Locally With Ollama

How to Set Up and Run Gemma 3 Locally With Ollama

How to Run Llama 3 Locally With Ollama and GPT4ALL

How to Set Up and Run DeepSeek R1 Locally With Ollama

AutoGPT Guide: Creating And Deploying Autonomous AI Agents Locally

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Deploying AI into Production with FastAPI

Building AI Agents with Google ADK

Multi-Agent Systems with LangGraph

How to Set Up and Run Qwen 3 Locally With Ollama

How to Set Up and Run QwQ 32B Locally With Ollama

How to Set Up and Run Gemma 3 Locally With Ollama

How to Run Llama 3 Locally With Ollama and GPT4ALL

How to Set Up and Run DeepSeek R1 Locally With Ollama

AutoGPT Guide: Creating And Deploying Autonomous AI Agents Locally

Deploying AI into Production with FastAPI