Skip to main content

How to Set Up and Run QwQ 32B Locally With Ollama

Learn how to install, set up, and run QwQ-32B locally with Ollama and build a simple Gradio application.
Mar 10, 2025  · 12 min read

QwQ-32B is Qwen’s reasoning model, and it’s designed to excel in complex problem-solving and reasoning tasks. Despite having only 32 billion parameters, the model achieves performance comparable to the much larger DeepSeek-R1, which has 671 billion parameters.

In this tutorial, I’ll guide you through setting up and running QwQ-32B locally using Ollama, a tool that simplifies local LLM inference. This guide includes: 

  • Executing via terminal or IDE: It includes concise code snippets to run QwQ-32B via the terminal or an IDE of your choice.
  • Hands-on demo: Building a hands-on demo project that utilizes QwQ-32B’s structured thinking capabilities.

Why Run QwQ-32B Locally?

Despite its size, QwQ-32B can be quantized to run efficiently on consumer hardware. Running QwQ-32B locally gives you complete control over model execution without dependency on external servers. Here are a few advantages to running QwQ-32B locally:

  • Privacy & security: No data leaves your system.
  • Uninterrupted access: Avoid rate limits, downtime, or service disruptions.
  • Performance: Get faster responses with local inference, avoiding API latency.
  • Customization: Modify parameters, fine-tune prompts, and integrate the model into local applications.
  • Cost efficiency: Eliminate API fees by running the model locally.
  • Offline availability: Work without an internet connection once the model is downloaded.

Setting Up QwQ-32B Locally With Ollama

Ollama simplifies running LLMs locally by handling model downloads, quantization, and execution.

Step 1: Install Ollama

Download and install Ollama from the official website

Downloading Ollama

Once the download is complete, install the Ollama application like you would do for any other application. 

Step 2: Download and run QwQ-32B

Let’s test the setup and download our model. Launch the terminal and type the following command to download and run the QwQ-32B model:

ollama run qwq:32b

Downloading Qwen QwQ 32B model via Ollama

QwQ-32B is a large model. If your system has limited resources, you can opt for smaller quantized versions. For instance, below, we use the Q4_K_M version, which is a 19.85GB model that balances performance and size:

ollama run qwq:Q4_K_M

qwq-32b quantized versions

Source: Hugging Face

You can find more quantized models here.

Step 3: Running QwQ-32B in the background

To run QwQ-32B continuously and serve it via an API, start the Ollama server:

ollama serve

This will make the model available for applications which are discussed in the next section.

Using QwQ-32B Locally

Now that QwQ-32B is set up, let's explore how to interact with it.

Step 1: Running inference via CLI

Once the model is downloaded, you can interact with the QwQ-32B model directly in the terminal:

ollama run qwq
How many r's are in the word "strawberry”?

QwQ 32b model running in terminal

The model response is generally its thinking response (encapsulated in <think> </think> tabs) followed by the final answer.

Step 2: Accessing QwQ-32B via API

To integrate QwQ-32B into applications, you can use the Ollama API with curl. Run the following curl command in your terminal.

curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
  "model": "qwq",
  "messages": [{"role": "user", "content": "Explain Newton second law of motion"}], 
  "stream": false
}'

curl is a command-line tool native to Linux but also works on macOS. It allows users to make HTTP requests directly from the terminal, making it an excellent tool for interacting with APIs.

Running QwQ 32b model using curl command

Note: Ensure proper placement of quotation marks and selection of the correct localhost port to prevent dquote errors.

Step 3: Running QwQ-32B with Python

We can run Ollama in any integrated development environment (IDE). You can install the Ollama Python package using the following code: 

pip install ollama

Once Ollama is installed, use the following script to interact with the model:

import ollama
response = ollama.chat(
    model="qwq",
    messages=[
        {"role": "user", "content": "Explain Newton's second law of motion"},
    ],
)
print(response["message"]["content"])

The ollama.chat() function takes the model name and a user prompt, processing it as a conversational exchange. The script then extracts and prints the model's response.

Running QwQ 32b model using Python

Running a Logical Reasoning App With QwQ-32B Locally

We can create a simple logical reasoning assistant using QwQ-32B and Gradio, which will accept user-inputted questions and generate structured, logical responses. This application will use QwQ-32B’s stepwise thinking approach to provide clear, well-reasoned answers, making it useful for problem-solving, tutoring, and AI-assisted decision-making.

Step 1: Prerequisites

Before diving into the implementation, let’s ensure that we have the following tools and libraries installed:

  • Python 3.8+
  • Gradio: To create a user-friendly web interface.
  • Ollama: A library to access models locally

Run the following commands to install the necessary dependencies:

pip install gradio ollama

Once the above dependencies are installed, run the following import commands:

import gradio as gr
import ollama
import re

Step 2: Querying QwQ 32B using Ollama

Now that we have our dependencies in place, we will build a query function to pass our question on to the model and get a structured response.

def query_qwq(question):
    response = ollama.chat(
        model="qwq",
        messages=[{"role": "user", "content": question}]
    )
    full_response = response["message"]["content"]
    # Extract the <think> part and the final answer
    think_match = re.search(r"<think>(.*?)</think>", full_response, re.DOTALL)
    think_text = think_match.group(1).strip() if think_match else "Thinking process not explicitly provided."
    final_response = re.sub(r"<think>.*?</think>", "", full_response, flags=re.DOTALL).strip()
    return think_text, final_response

The query_qwq() function interacts with the Qwen QwQ-32B model via Ollama, sending a user-provided question and receiving a structured response. It extracts two key components:

  1. Thinking process: It includes the model’s reasoning steps (extracted from <think>...</think> tags).
  2. Final response:  This field includes the structured final answer after reasoning. (excluding the <think> section)

This isolates the reasoning steps and the final response separately, ensuring transparency in how the model arrives at its conclusions.

Step 3: Creating the Gradio interface

Now that we have the core function set up, we will build the Gradio UI.

interface = gr.Interface(
    fn=query_qwq,
    inputs=gr.Textbox(label="Ask a logical reasoning question"),
    outputs=[gr.Textbox(label="Thinking Process"), gr.Textbox(label="Final Response")],
    title="QwQ-32B Powered: Logical Reasoning Assistant",
    description="Ask a logical reasoning question and the assistant will provide an explanation."
)
interface.launch(debug = True)

This Gradio interface sets up a logical reasoning assistant that takes in a user-inputted logical reasoning question via the gr.Textbox() function and processes it using the query_qwq()  function. Finally, the interface.launch() function starts the Gradio app with debugging enabled, allowing real-time error tracking and logs for troubleshooting.

QwQ 32b demo with gradio

Conclusion

Running QwQ-32B locally with Ollama enables private, fast, and cost-effective model inference. With this tutorial, you can explore its advanced reasoning capabilities in real time. This model can be used for applications in AI-assisted tutoring, logic-based problem-solving, and more.


Aashi Dutt's photo
Author
Aashi Dutt
LinkedIn
Twitter

I am a Google Developers Expert in ML(Gen AI), a Kaggle 3x Expert, and a Women Techmakers Ambassador with 3+ years of experience in tech. I co-founded a health-tech startup in 2020 and am pursuing a master's in computer science at Georgia Tech, specializing in machine learning.

Topics

Learn AI with these courses!

track

Developing AI Applications

23hrs hr
Learn to create AI-powered applications with the latest AI developer tools, including the OpenAI API, Hugging Face, and LangChain.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

tutorial

How to Set Up and Run DeepSeek R1 Locally With Ollama

Learn how to install, set up, and run DeepSeek-R1 locally with Ollama and build a simple RAG application.
Aashi Dutt's photo

Aashi Dutt

12 min

tutorial

How to Run Llama 3 Locally: A Complete Guide

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain, Chroma DB, and Ollama.
Abid Ali Awan's photo

Abid Ali Awan

15 min

tutorial

Local AI with Docker, n8n, Qdrant, and Ollama

Learn how to build secure, local AI applications that protect your sensitive data using a low/no-code automation framework.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

Llama 3.2 and Gradio Tutorial: Build a Multimodal Web App

Learn how to use the Llama 3.2 11B vision model with Gradio to create a multimodal web app that functions as a customer support assistant.
Aashi Dutt's photo

Aashi Dutt

10 min

tutorial

How to Use Qwen2.5-VL Locally

Learn about the new flagship vision-language model and run it on a laptop with 8GB VRAM.
Abid Ali Awan's photo

Abid Ali Awan

10 min

tutorial

RAG With Llama 3.1 8B, Ollama, and Langchain: Tutorial

Learn to build a RAG application with Llama 3.1 8B using Ollama and Langchain by setting up the environment, processing documents, creating embeddings, and integrating a retriever.
Ryan Ong's photo

Ryan Ong

12 min

See MoreSee More