Sari la conținutul principal

How to Run GLM-5.2 Locally Using RunPod and llama.cpp

Run GLM-5.2 privately with llama.cpp, secure it with your own API key, test it through the Web UI and cURL, and connect it to OpenCode for a powerful local coding workflow.
30 iun. 2026  · 9 min. citire

GLM-5.2 is Z.ai’s latest flagship open model, built for long-horizon coding, reasoning, and agentic engineering tasks. It comes with a 1M-token context window, multiple thinking modes, tool-calling support, and improvements designed to help the model stay consistent across large codebases and multi-step tasks. 

While the full model is massive, GGUF quantizations make it possible to run GLM-5.2 locally using llama.cpp on the right hardware.

GLM-5.2 official benchmark results

Source: GLM-5.2: Built for Long-Horizon Tasks 

In this guide,  I will show you how to install the prebuilt llama.cpp package and use it to serve GLM-5.2 on a RunPod GPU instance. 

You will start the server with an API key, test its OpenAI-compatible endpoint with cURL, and use llama.cpp’s built-in Web UI in your browser. 

Next, you will expose the server through RunPod’s proxy URL so it can be reached securely from your laptop or other applications. 

Finally, you will connect that hosted GLM-5.2 server to OpenCode running locally beside your project, allowing OpenCode to read files, edit code, run tests, and use your local shell while GLM-5.2 handles the reasoning remotely.

1. Configure a RunPod GPU Instance for GLM-5.2

Go to your RunPod dashboard and create a new Pod. Before launching it, make sure your account has at least $25 in credit, as GLM-5.2 requires a large multi-GPU setup.

Select a machine with 4× RTX PRO 6000 GPUs, which provides:

  • 384 GB VRAM
  • 752 GB system RAM
  • At least 550 GB of disk space

Before deploying, edit the Pod template. Increase the container disk space to at least 550 GB and add the following under Expose HTTP Ports:

8910

This port will be used later for the llama.cpp server, Web UI, and OpenAI-compatible API.

For faster and more reliable model downloads, add your Hugging Face token as an environment variable in the template:

HF_TOKEN=your_hugging_face_token

Editing the Runpod Pytorch template

Once everything is configured, deploy the Pod. After it starts, click Connect and open JupyterLab. Launch a new terminal and run:

nvidia-smi

You should see all four RTX PRO 6000 GPUs listed and available. This confirms that the Pod is ready to download and run GLM-5.2.

nvidia-smi report of all four RTX PRO 6000 GPUs

2. Install llama.cpp to Serve the GLM-5.2 Model

Rather than compiling llama.cpp from source, install the latest prebuilt version using the official llama.app installer. Run the following command in your JupyterLab terminal:

curl -LsSf https://llama.app/install.sh | sh

Next, add the llama.cpp installation folder to your PATH so you can run the llama command from any terminal:

echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc

Reload your Bash configuration to apply the change:

source ~/.bashrc

Finally, confirm that llama.cpp was installed correctly:

llama help

You should see the available llama.cpp commands.

llama available commands

3. Configure Your Hugging Face Cache and API Security

Next, configure a persistent location for the model files. 

RunPod’s /workspace directory remains available even when you pause the pod, so it is a better place to store the Hugging Face cache than the default location.

Run the following commands in the JupyterLab terminal:

export HF_HOME="/workspace/huggingface"
mkdir -p "$HF_HOME"

This ensures that downloaded model files are stored in /workspace/huggingface.

Now create an API key for your llama.cpp server. Use a long, random value and keep it private, as you will need the same key later when testing the API and connecting OpenCode:

export LLAMA_API_KEY="replace-this-with-a-long-random-secret"

Finally, set a simple alias for the model:

export MODEL_ALIAS="glm-5.2-iq3s"

OpenCode will use this exact model alias later, so keep it unchanged throughout the guide.

4. Run the GLM-5.2 GGUF Model with llama.cpp

You are now ready to start the GLM-5.2 server. Run the following command in the same terminal:

CUDA_VISIBLE_DEVICES=0,1,2,3 llama serve \
  -hf unsloth/GLM-5.2-GGUF:UD-IQ3_S \
  --alias "$MODEL_ALIAS" \
  --host 0.0.0.0 \
  --port 8910 \
  --api-key "$LLAMA_API_KEY" \
  --n-gpu-layers 999 \
  --split-mode layer \
  --tensor-split 1,1,1,1 \
  --ctx-size 100000 \
  --parallel 1 \
  --flash-attn on \
  --jinja

The first time you run this command, llama.cpp will download the UD-IQ3_S GGUF quantization of GLM-5.2 from Hugging Face and store it in the cache directory you configured earlier. 

The download may take some time because the model is very large.

llama.cpp download the model shard before loading them into the memory

After the download finishes, llama.cpp will load the model across all four GPUs. The --split-mode layer and --tensor-split 1,1,1,1 settings divide the model evenly across the available GPUs, while Flash Attention helps improve performance.

Once the model has loaded successfully, the local server will be available at:

http://127.0.0.1:8910

the llama.cpp server is running and providing access to the GLM 5.2 model

The server is protected by the API key you set earlier. Keep this terminal open while using the model, as closing it will stop the server.

5. Open the llama.cpp Web UI

Open your RunPod Pod and go to the Connect tab. Under the exposed HTTP ports, click the link associated with port 8910. This will open the llama.cpp Web UI in your browser.

opening Runpod proxy linked to the 8910 port.

The URL will follow this format:

https://YOUR_POD_ID-8910.proxy.runpod.net

Replace YOUR_POD_ID with your actual RunPod Pod ID if you need to enter the URL manually.

setting the API key which allows the Web UI to authenticate its requests

In the llama.cpp Web UI, open Settings and go to General. Paste the same API key that you used when starting the llama.cpp server. 

This allows the Web UI to authenticate its requests and communicate with the protected server.

You can now test the model with a simple coding prompt:

Write a Python function that validates an email address without external packages. 
Include three pytest tests.

testing the GLM 5.2 model in the llama.cpp webui

In this setup, GLM-5.2 generated at around 41 tokens per second on average, which is a good speed for a model of this size. 

The response quality was also strong, producing a structured implementation with clear validation rules and test cases.

6. Test the Local API with cURL

Open a second terminal in JupyterLab. The first terminal must remain open because it is running the llama.cpp server.

In the new terminal, set the local API URL, reuse the same API key, and set the model alias:

export BASE_URL="http://127.0.0.1:8910/v1"
export LLAMA_API_KEY="replace-this-with-the-same-server-key"
export MODEL_ALIAS="glm-5.2-iq3s"

First, check that the server is running and that GLM-5.2 is available:

curl --fail-with-body -sS \
  "$BASE_URL/models" \
  -H "Authorization: Bearer $LLAMA_API_KEY"

You should see the model alias in the response:

glm-5.2-iq3s

Next, send a test request to the OpenAI-compatible chat completions endpoint:

glm-5.2-iq3s

Next, send a test request to the OpenAI-compatible chat completions endpoint:
curl --fail-with-body -sS \
  --connect-timeout 15 \
  --max-time 600 \
  -X POST "$BASE_URL/chat/completions" \
  -H "Authorization: Bearer $LLAMA_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<JSON
{
  "model": "$MODEL_ALIAS",
  "messages": [
    {
      "role": "system",
      "content": "You are a precise senior software engineer."
    },
    {
      "role": "user",
      "content": "Write a Python function that validates an email address without external packages. Include three pytest tests."
    }
  ],
  "temperature": 0.2,
  "max_tokens": 1500,
  "stream": false
}
JSON

The server will return a JSON response containing the model’s answer. 

In this test, GLM-5.2 produced a structured Python implementation with validation logic and pytest test cases at an average generation speed of roughly 41 tokens per second.

This local URL only works inside the RunPod Pod. To call the same server from your laptop, OpenCode, or another external application, use the RunPod proxy URL instead:

export BASE_URL="https://YOUR_POD_ID-8910.proxy.runpod.net/v1"

Replace YOUR_POD_ID with your actual RunPod Pod ID, and continue using the same API key in the Authorization header.

7. Install and Connect OpenCode to GLM-5.2

Install OpenCode on the computer where your code project is stored. Open a terminal and run:

curl -fsSL https://opencode.ai/install | bash

Next, move into your project folder:

cd /path/to/your/project

Export the same API key that you used when starting the llama.cpp server on RunPod:

export LLAMA_API_KEY="replace-with-the-same-server-key"

OpenCode runs locally alongside your project, while GLM-5.2 continues to run remotely on your RunPod Pod. This setup allows OpenCode to read your files, edit code, run tests, and use your local terminal, while GLM-5.2 handles the reasoning through the secured RunPod API.

Create a file named opencode.json in your project root and add the following configuration:

{
  "$schema": "https://opencode.ai/config.json",

  "enabled_providers": ["llama-runpod"],

  "provider": {
    "llama-runpod": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "GLM-5.2 on RunPod",

      "options": {
        "baseURL": "https://YOUR_POD_ID-8910.proxy.runpod.net/v1",
        "apiKey": "{env:LLAMA_API_KEY}",
        "timeout": 600000,
        "chunkTimeout": 120000
      },

      "models": {
        "glm-5.2-iq3s": {
          "name": "GLM-5.2 UD-IQ3_S",
          "limit": {
            "context": 100000,
            "output": 32000
          }
        }
      }
    }
  },

  "model": "llama-runpod/glm-5.2-iq3s",
  "small_model": "llama-runpod/glm-5.2-iq3s"
}

Replace YOUR_POD_ID with your actual RunPod Pod ID. The URL must match the RunPod proxy URL you used to open the llama.cpp Web UI.

Once the opencode.json file is saved, open a terminal in the same project folder and start OpenCode:

Launching the local terminal in the Windows 11

opencode

Then run:

/models

Select:

GLM-5.2 UD-IQ3_S

Select the local running model int the Opencode

OpenCode is now connected to your GLM-5.2 server. It will use the remote model for reasoning while keeping project files, terminal commands, code edits, and test execution on your own laptop.

8. Test OpenCode as a Coding Agent 

Start with a simple test to confirm that OpenCode can reach your GLM-5.2 server and return a response.

In OpenCode, type:

hey

testing the GLM 5.2 model in the Opencode

Next, ask OpenCode to inspect and explain your existing project:

Explain the project in 3-5 short bullet points, including its purpose, main technologies, 
entry point, and how the main parts work together.

testing the GLM 5.2 model in the Opencode

OpenCode reads the project files and gives a concise overview instead of guessing. In this example, it correctly identified that the project is a bilingual English/Urdu scam-checking assistant for Pakistani notices, bills, SMS messages, and bank alerts. 

It also explained the main stack, the app.py entry point, the assessment flow, and the supporting test and telemetry files.

Prompt: 

Suggest one useful new feature that fits the project's current scope.

testing the GLM 5.2 model in the Opencode

It suggested a useful feature: a local directory of verified official sender IDs, bank helplines, courier headers, and public short codes. 

To test OpenCode on a larger task, create a new project folder on your laptop:

mkdir ml-app
cd ml-app
opencode

Then give OpenCode the following prompt:

Build and test a complete Python-based web UI for this machine learning application.

build the new project form scratch using the GLM 5.2 model in the Opencode

OpenCode first creates a task list and breaks the project into manageable steps. 

It then creates the required application files, machine-learning logic, Streamlit interface, dependencies, and test suite. 

Once the implementation is complete, it runs the tests, fixes any issues it finds, and provides a clear summary of the finished project along with the command needed to launch it. 

build the new project form scratch using the GLM 5.2 model in the Opencode

In this test, OpenCode completed 10 passing tests and verified that the Streamlit application launched successfully. Start the machine learning application with: 

streamlit run app.py

The resulting application looks clean and works as expected. 

Machine Learning App created by the GLM 5.2 model in the Opencode

Even with the 3-bit quantized version of GLM-5.2, the reasoning quality was strong in these tests. 

It understood the existing project, proposed a relevant feature, created a complete web application, used tools to inspect and modify files, and ran tests to verify its work. 

Final Thoughts

This setup gives you something that standard API providers do not: your own privately hosted GLM-5.2 server.

Instead of sending every request to a shared model platform with fixed limits, model settings, and per-token pricing, you rent the GPU machine, deploy the model yourself, and control the complete serving stack. 

You choose the model quantization, GPU configuration, context window, server settings, API key, and who can access the endpoint.

Your code, prompts, project context, and API responses remain within the infrastructure you control: your own laptop and your own RunPod deployment. 

They are not sent to an additional hosted inference provider for processing. This is especially useful when you are working with private repositories, internal tools, sensitive code, or company data.

You also avoid the cost and effort of buying, running, and maintaining a high-end multi-GPU server yourself. 

Instead, you can rent powerful GPUs only when you need them, serve GLM-5.2 with llama.cpp, secure the endpoint with your own API key, and connect from your laptop through OpenCode.

In this guide, you configured a multi-GPU RunPod machine, installed the prebuilt llama.cpp package, downloaded and served the GLM-5.2 GGUF model, and protected the server with an API key. 

You then tested the model through both the llama.cpp Web UI and its OpenAI-compatible cURL API before exposing the secured RunPod URL for external access.

Finally, you connected that private model endpoint to OpenCode running on your laptop. This creates a practical hybrid workflow: GLM-5.2 runs on powerful rented GPUs, while OpenCode stays inside your local project and can inspect files, edit code, run tests, and use your shell. 

You get the performance of a top-tier model, the flexibility of self-hosting, and far more control than you would have with a standard hosted API.

FAQs

How large is the GLM-5.2 model, and what is its architecture?

GLM-5.2 is a massive Mixture-of-Experts (MoE) model with roughly 744 to 753 billion total parameters. However, the MoE architecture ensures that only about 40 billion parameters are "active" for any given token during inference. This design gives the model the vast knowledge capacity of a 700B+ model while keeping its computational compute requirements closer to that of a 40B dense model.

What license is GLM-5.2 released under?

Unlike many "open-weight" models that come with restrictive community or non-commercial clauses, GLM-5.2 was released under the highly permissive MIT License. This means you are free to self-host, heavily modify, and use the model for full commercial applications without worrying about enterprise lock-in or restrictive usage policies.

How does GLM-5.2 perform against frontier closed-source models?

GLM-5.2 is currently regarded as the strongest open-weight model for agentic engineering and long-horizon coding tasks. On major real-world coding benchmarks like SWE-bench Pro and Terminal-Bench 2.1, it decisively outperforms GPT-5.5 and lands within just a few percentage points of Claude Opus 4.8.

How does the model efficiently process a massive 1-million-token context?

Processing a million tokens usually results in a massive computational bottleneck. To solve this, GLM-5.2 introduces an architectural innovation called IndexShare. Instead of computing a separate attention index for every single layer, the model reuses the same lightweight indexer across every four sparse attention layers. This reduces the per-token computational load (FLOPs) by nearly 2.9x at extreme context lengths, making project-wide reasoning economically viable.


Abid Ali Awan's photo
Author
Abid Ali Awan
LinkedIn
Twitter

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Subiecte

Top DataCamp Courses

track

Inginer AI asociat pentru oamenii de știință ai datelor

40 oră
Antrenează și ajustează cele mai noi modele AI pentru producție, inclusiv LLM-uri precum Llama 3. Începe-ți astăzi călătoria spre a deveni Inginer AI!
Vezi detaliiRight Arrow
Începeți cursul
Vezi mai multRight Arrow
Înrudite

tutorial

How to Run GLM 5.1 Locally For Agentic Coding

Learn how to run GLM 5.1 locally on an H100 GPU with llama.cpp, test it, use the WebUI, and integrate Claude Code.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

Run GLM-5 Locally For Agentic Coding

Run GLM-5, the best open-weight AI model, on a single GPU with llama.cpp, and connect it to Aider to turn it into a powerful local coding agent.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

How to Run GLM-4.7 Locally with llama.cpp: A High-Performance Guide

Setting up llama.cpp to run the GLM-4.7 model on a single NVIDIA H100 80GB GPU, achieving up to 20 tokens per second using GPU offloading, Flash Attention, optimized context size, efficient batching, and tuned CPU threading.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

How to Run Kimi K2.7 Code Locally Using llama.cpp

Learn how to run Kimi K2.7 Code locally in five minutes with the prebuilt llama.cpp binary on four RTX PRO 6000 GPUs, then use its web UI and Pi coding agent through an OpenAI-compatible API.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

How to Run Kimi K2.5 Locally

Learn how to run a top open-source model locally with llama.cpp, connect it to the Kimi CLI, and one-shot an interactive game using vibe coding.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

How to Run Llama 3 Locally With Ollama and GPT4ALL

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain and Chroma DB.
Abid Ali Awan's photo

Abid Ali Awan

Vezi mai multVezi mai mult