Skip to main content

How to Set Up and Run Qwen 3 Locally With Ollama

Learn how to install, set up, and run Qwen3 locally with Ollama and build a simple Gradio-based application.
Apr 30, 2025  · 12 min read

Qwen3 is Alibaba's latest generation of open-weight large language models. With support for 100+ languages and strong performance across reasoning, coding, and translation tasks, Qwen3 rivals many top-tier models available today, including DeepSeek-R1, o3-mini, and Gemini 2.5.

In this tutorial, I’ll explain step-by-step how to run Qwen3 locally using Ollama.

We’ll also build a local lightweight application with Qwen 3. The app will enable you to switch between Qwen3’s reasoning modes and translate between different languages.

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

Why Run Qwen3 Locally?

Running Qwen3 locally provides several key benefits:

  • Privacy: Your data never leaves your machine.
  • Latency: Local inference is faster without API round-trips.
  • Cost-efficiency: No token charges or cloud bills.
  • Control: You can tune your prompts, choose models, and configure thinking modes.
  • Offline access: You can work without an internet connection after downloading the model.

Qwen3 is optimized for both deep reasoning (thinking mode) and fast responses (non-thinking mode), and supports 100+ languages. Let's set it up locally.

Setting Up Qwen3 Locally With Ollama

Ollama is a tool that lets you run language models like Llama or Qwen locally on your computer with a simple command-line interface.

Step 1: Install Ollama

Download Ollama for macOS, Windows, or Linux from: https://ollama.com/download.

Follow the installer instructions, and after installation, verify by running this in the terminal:

ollama --version

Step 2: Download and run Qwen3

Ollama offers a growing range of Qwen3 models designed to suit a variety of hardware configurations, from lightweight laptops to high-end servers.

ollama run qwen3

Running the command above will launch the default Qwen3 model in Ollama, which currently defaults to qwen3:8b. If you're working with limited resources or want faster startup times, you can explicitly run smaller variants like the 4B model:

ollama run qwen3:4b

Qwen3 is currently available in several variants, starting from the smallest 0.6b(523MB) to the largest 235b(142GB) parameter models. These smaller variants offer impressive performance for reasoning, translation, and code generation, especially when used in thinking mode.

The MoE models (30b-a3b, 235b-a22b) are particularly interesting as they activate only a subset of experts per inference step, allowing for massive total parameter counts while keeping runtime costs efficient.

In general, use the largest model your hardware can handle, and fall back to the 8B or 4B models for responsive local experiments on consumer machines.

Here’s a quick recap of all the Qwen3 models you can run:

Model

Ollama Command

Best For

Qwen3-0.6B

ollama run qwen3:0.6b

Lightweight tasks, mobile applications, and edge devices

Qwen3-1.7B

ollama run qwen3:1.7b

Chatbots, assistants, and low-latency applications

Qwen3-4B

ollama run qwen3:4b

General-purpose tasks with balanced performance and resource usage

Qwen3-8B

ollama run qwen3:8b

Multilingual support and moderate reasoning capabilities

Qwen3-14B

ollama run qwen3:14b

Advanced reasoning, content creation, and complex problem-solving

Qwen3-32B

ollama run qwen3:32b

High-end tasks requiring strong reasoning and extensive context handling

Qwen3-30B-A3B (MoE)

ollama run qwen3:30b-a3b

Efficient performance with 3B active parameters, suitable for coding tasks

Qwen3-235B-A22B (MoE)

ollama run qwen3:235b-a22b

Massive-scale applications, deep reasoning, and enterprise-level solutions

Step 3: Run Qwen3 in the background (optional)

To serve the model via API, run this command in the terminal:

ollama serve

This will make the model available for integration with other applications at http://localhost:11434.

Using Qwen3 Locally

In this section, I’ll walk you through several ways you can use Qwen3 locally, from basic CLI interaction to integrating the model with Python.

Option 1: Running inference via CLI

Once the model is downloaded, you can interact with Qwen3 directly in the terminal. Run the following command in your terminal:

echo "What is the capital of Brazil? /think" | ollama run qwen3:8b

This is useful for quick tests or lightweight interaction without writing any code. The /think tag at the end of the prompt instructs the model to engage in deeper, step-by-step reasoning. You can replace this with /no_think for a faster, shallower response or omit it entirely to use the model’s default reasoning mode.

running qwen 3 locally with ollama (inference)

Option 2: Accessing Qwen3 via API

Once ollama serve is running in the background, you can interact with Qwen3 programmatically using an HTTP API, which is perfect for backend integration, automation, or testing REST clients.

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3:8b",
  "messages": [{ "role": "user", "content": "Define entropy in physics. /think" }],
  "stream": false
}'

Here is how it works:

  • curl makes a POST request (how we call the API) to the local Ollama server running at localhost:11434.
  • The payload is a JSON object with:
    • "model": Specifies the model to use (here it is: qwen3:8b).
    • "messages": A list of chat messages containing role and content.
    • "stream": false: Ensures the response is returned all at once, not token-by-token.

Accessing Qwen3 locally via API

Option 3: Accessing Qwen3 via Python

If you’re working in a Python environment (like Jupyter, VSCode, or a script), the easiest way to interact with Qwen3 is via the Ollama Python SDK. Start by installing ollama:

pip install ollama

Then, run your Qwen3 model with this script (we’re using qwen3:8b below):

import ollama
response = ollama.chat(
    model="qwen3:8b",
    messages=[
        {"role": "user", "content": "Summarize the theory of evolution. /think"}
    ]
)
print(response["message"]["content"])

In the above code: 

  • ollama.chat(...) sends a chat-style request to the local Ollama server.
  • You specify the model (qwen3:8b) and a list of messages in a format similar to OpenAI’s API.
  • The /think tag tells the model to reason step by step.
  • Finally, the response is returned as a dictionary, and you can access the model’s answer using ["message"]["content"].

This approach is ideal for local experimentation, prototyping, or building LLM-backed apps without relying on cloud APIs.

Accessing Qwen3 locally via Python

Building a Local Reasoning App With Qwen3

Qwen3 supports hybrid inference behavior using /think (deep reasoning) and /no_think (fast response) tags. In this section, we’ll use Gradio to create an interactive local web app with two separate tabs:

  1. A reasoning interface to switch between thinking modes.
  2. A multilingual interface to translate or process text in different languages.

Step 1: Hybrid reasoning demo 

In this step, we build our hybrid reasoning tab with /think and /no_think tags.

import gradio as gr
import subprocess
def reasoning_qwen3(prompt, mode):
    prompt_with_mode = f"{prompt} /{mode}"
    result = subprocess.run(
        ["ollama", "run", "qwen3:8b"],
        input=prompt_with_mode.encode(),
        stdout=subprocess.PIPE
    )
    return result.stdout.decode()
reasoning_ui = gr.Interface(
    fn=reasoning_qwen3,
    inputs=[
        gr.Textbox(label="Enter your prompt"),
        gr.Radio(["think", "no_think"], label="Reasoning Mode", value="think")
    ],
    outputs="text",
    title="Qwen3 Reasoning Mode Demo",
    description="Switch between /think and /no_think to control response depth."
)

In the above code:

  • The function reasoning_qwen3() takes a user prompt and a reasoning mode ("think" or "no_think").
  • It appends the selected mode as a suffix to the prompt.
  • Then, the subprocess.run() method runs the command ollama run qwen3:8b, feeding the prompt as standard input.
  • Finally, the output (response from Qwen3) is captured and returned as a decoded string.

Once the output-generating function is defined, the gr.Interface() function wraps it into an interactive web UI by specifying input components—a Textbox for the prompt and a Radio button for selecting the reasoning mode—and mapping them to the function's inputs.

Step 2: Multilingual application demo

Now, let’s set up our multilingual application tab.

import gradio as gr
import subprocess
def multilingual_qwen3(prompt, lang):
    if lang != "English":
        prompt = f"Translate to {lang}: {prompt}"
    result = subprocess.run(
        ["ollama", "run", "qwen3:8b"],
        input=prompt.encode(),
        stdout=subprocess.PIPE
    )
    return result.stdout.decode()
multilingual_ui = gr.Interface(
    fn=multilingual_qwen3,
    inputs=[
        gr.Textbox(label="Enter your prompt"),
        gr.Dropdown(["English", "French", "Hindi", "Chinese"], label="Target Language", value="English")
    ],
    outputs="text",
    title="Qwen3 Multilingual Translator",
    description="Use Qwen3 locally to translate prompts to different languages."
)

Similar to the previous step, this code works as follows:

  • The multilingual_qwen3() function takes a prompt and a target language.
  • If the target is not English, it prepends the instruction “Translate to {lang}:” to guide the model.
  • Again, the model runs locally via subprocess using Ollama.
  • The result is returned as plain text.

Step 3: Launch both tabs in Gradio

Let’s bring both the tabs together in a Gradio application.

demo = gr.TabbedInterface(
    [reasoning_ui, multilingual_ui],
    tab_names=["Reasoning Mode", "Multilingual"]
)
demo.launch(debug = True)

Here is what we are doing in the above code:

  • The gr.TabbedInterface() function creates a UI with two tabs:
    • One for controlling reasoning depth.
    • One for multilingual prompt translation.
  • The demo.launch(debug=True) function runs the app locally and opens it in the browser with debugging enabled.

Local Gradio app with Qwen3

Multilingual Application DemoConclusion

Qwen3 brings advanced reasoning, fast decoding, and multilingual support to your local machine using Ollama.

With minimal setup, you can:

  • Run local LLM inference without cloud dependency
  • Switch between fast and thoughtful responses
  • Use APIs or Python to build intelligent applications

To learn more about Qwen3, I recommend:


Aashi Dutt's photo
Author
Aashi Dutt
LinkedIn
Twitter

I am a Google Developers Expert in ML(Gen AI), a Kaggle 3x Expert, and a Women Techmakers Ambassador with 3+ years of experience in tech. I co-founded a health-tech startup in 2020 and am pursuing a master's in computer science at Georgia Tech, specializing in machine learning.

Topics

Learn AI with these courses!

Track

Developing AI Applications

0 min
Learn to create AI-powered applications with the latest AI developer tools, including the OpenAI API, Hugging Face, and LangChain.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

How to Set Up and Run QwQ 32B Locally With Ollama

Learn how to install, set up, and run QwQ-32B locally with Ollama and build a simple Gradio application.
Aashi Dutt's photo

Aashi Dutt

Tutorial

How to Set Up and Run Gemma 3 Locally With Ollama

Learn how to install, set up, and run Gemma 3 locally with Ollama and build a simple file assistant on your own device.
François Aubry's photo

François Aubry

Tutorial

How to Set Up and Run DeepSeek R1 Locally With Ollama

Learn how to install, set up, and run DeepSeek-R1 locally with Ollama and build a simple RAG application.
Aashi Dutt's photo

Aashi Dutt

Tutorial

How to Run Llama 3 Locally With Ollama and GPT4ALL

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain and Chroma DB.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Local AI with Docker, n8n, Qdrant, and Ollama

Learn how to build secure, local AI applications that protect your sensitive data using a low/no-code automation framework.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Llama 3.2 and Gradio Tutorial: Build a Multimodal Web App

Learn how to use the Llama 3.2 11B vision model with Gradio to create a multimodal web app that functions as a customer support assistant.
Aashi Dutt's photo

Aashi Dutt

See MoreSee More