Skip to main content

GGUF Format: A Complete Guide to Local LLM Inference

GGUF packages model weights, tokenizer data, and metadata into a single portable file. Learn how to choose the right quantization level and get started with Ollama.
Jun 17, 2026  · 15 min read

So, you’ve found a 7B-parameter language model you want to try locally. You now face a problem: FP16 weights alone are around 14 GB, and your laptop only has 16 GB of RAM. 

Even before accounting for your operating system, inference runtime, context cache, and temporary buffers, the model is already pushing the limits of your hardware. This is exactly the problem GGUF was designed to solve.

GGUF has become one of the most important formats for running open-weight large language models locally. Instead of needing an enterprise GPU or a cloud API, GGUF makes it practical to run quantized models on laptops, desktops, Apple Silicon machines, and even some mobile or edge devices.

In this article, I will introduce the GGUF format and how it works, tell you how quantization reduces model size and how to choose the right quantization level, and finally, how to get started with Ollama and llama.cpp.

In a Nutshell

  • GGUF (GGML Unified Format) is a binary file format that packages model weights, tokenizer data, architecture metadata, and quantization info into a single portable file
  • It replaced the older GGML format in 2023 and is now the dominant format for distributing quantized LLMs on Hugging Face
  • GGUF is used by llama.cpp, Ollama, LM Studio, GPT4All, KoboldCpp, and other local inference tools
  • Quantization is the key feature: a 7B model in FP16 is ~14 GB; a Q4_K_M version is ~4–5 GB
  • Common quantization levels range from Q2_K (smallest, lowest quality) to Q8_0 (largest, near full precision) — Q4_K_M is the standard starting point for most hardware
  • GGUF runs on CPUs, Apple Silicon (Metal), NVIDIA GPUs (CUDA), AMD GPUs (ROCm/Vulkan), and more
  • Choosing the right quant level means balancing memory, output quality, inference speed, and context length

Develop AI Applications

Learn to build AI applications using the OpenAI API.
Start Upskilling For Free

What Is GGUF?

GGUF, short for GGML Unified Format, is a binary file format that packages model weights, tokenizer data, architecture metadata, and quantization information into a single, portable file for inference with GGML-based runtimes, especially llama.cpp.

GGUF solves an LLM deployment problem. Many model formats require users to keep several files together, including model weights, tokenizer files, configuration files, and architecture-specific loading code. GGUF simplifies this by making the model file largely self-describing.

A GGUF file typically contains:

  • Model tensors
  • Quantized or unquantized weights
  • Tokenizer vocabulary
  • Tokenizer configuration
  • Model architecture metadata
  • Context length settings
  • Embedding dimensions
  • Attention head counts
  • RoPE configuration
  • Tensor names, shapes, and data types

The key idea is that the file describes itself. The runtime can inspect the metadata, understand the architecture, load the tokenizer, and map the tensors without relying on a separate config.json or tokenizer folder.

This does not mean every GGUF file is universally compatible with every runtime forever. The runtime still needs to support the model architecture and tensor types used in the file. However, GGUF makes that compatibility far easier than older formats because the file carries much more structured information.

Four defining characteristics of GGUF are:

  1. Single-file deployment
  2. Memory mapping support for efficient loading
  3. Extensible typed key-value metadata
  4. Support for many quantization types, from aggressive low-bit formats to full precision

GGUF was introduced as part of the llama.cpp and GGML ecosystem in 2023. It is now the dominant format for distributing quantized local LLMs on Hugging Face.

GGUF vs. GGML

The GGML (Georgi Gerganov Machine Learning) format was the predecessor to GGUF. It was important because it helped make early local inference possible. However, it had practical limitations as the ecosystem expanded beyond the original LLaMA models.

Common GGML pain points included:

  • Less flexible metadata handling
  • More architecture-specific loading assumptions
  • Tokenizer and configuration handling that was less self-contained
  • Harder extensibility as new model families appeared

GGUF addressed those limitations with a more structured format. It introduced typed metadata, better tokenizer embedding, and a clearer file layout. This made it easier for llama.cpp and related tools to support more architectures without constantly redesigning the loading pipeline.

For users, the important difference is simple: GGUF is the modern format. If you are downloading models today, you should almost always choose GGUF rather than older GGML files.

GGUF vs. GPTQ and AWQ

In your research of the file formats, you must have come across GGUF, GPTQ (Generative Post-Training Quantization), and AWQ (Activation-Aware Weight Quantization). I find them often discussed together because all three are used to make LLM inference more efficient. However, they are not identical categories.

GGUF is primarily a file format and deployment container. It supports many quantization types and is closely associated with llama.cpp-style local inference.

GPTQ and AWQ are quantization methods and ecosystems commonly used for GPU-optimized inference, especially on NVIDIA hardware through frameworks such as Transformers, ExLlama, AutoGPTQ, and vLLM-compatible workflows.

Feature

GGUF

GPTQ

AWQ

Primary target

Portable local inference

GPU inference

GPU inference

Common hardware

CPU, Apple Silicon, NVIDIA, AMD, Vulkan, mobile

NVIDIA GPUs

NVIDIA GPUs

CPU support

Strong

Limited

Limited

Portability

Very high

Moderate

Moderate

Typical ecosystem

llama.cpp, Ollama, LM Studio, GPT4All

Transformers, ExLlama, AutoGPTQ

Transformers, TensorRT-LLM-style workflows

GPU throughput

Good, especially with offload

Often very strong

Often very strong

Best use case

Local and mixed-hardware inference

High-throughput GPU serving

High-throughput GPU serving

If your goal is maximum compatibility across laptops, desktops, Apple Silicon, and mixed hardware, GGUF is usually the safer choice.

If your goal is maximum throughput on dedicated NVIDIA inference servers, GPTQ, AWQ, FP8, or other GPU-optimized serving formats may be more appropriate.

Why Use GGUF?

GGUF became popular because it solves practical deployment problems. I’ve also come to find them so convenient when deploying locally without all the setup mess.

Running local LLMs used to involve fragmented tooling, large uncompressed weights, incompatible model formats, and complicated setup steps. GGUF can now help you standardize a large part of that workflow.

Instead of thinking about many separate files and loading scripts, users can focus on selecting the right model, choosing a quantization level, and running inference.

Run models locally

GGUF allows you to run LLMs on your own machine. This means:

  • No per-token API cost
  • No dependency on a hosted inference provider
  • No need to send prompts to a third-party API
  • Offline inference is possible after the model is downloaded

This is especially useful for privacy-sensitive workflows. Developers may not want to send proprietary code, internal documents, customer records, or confidential prompts to an external API.

Local inference is not automatically secure by itself. You still need to manage your machine, logs, applications, and access control properly. But GGUF makes private local deployment much more accessible.

For hands-on practice running models locally, see our tutorials on serving Mistral Medium 3.5 with SGLang, running DeepSeek V4 Flash locally, running the efficient Bonsai 1-bit model on an old laptop, and running MiniMax M2 locally as a coding assistant.

Hardware flexibility

GGUF is useful because it works across many hardware configurations.

Depending on the runtime and backend, GGUF models can run on:

  • CPU-only machines
  • NVIDIA GPUs through CUDA
  • Apple Silicon through Metal
  • AMD GPUs through HIP or Vulkan
  • Intel GPUs through SYCL or Vulkan
  • Some ARM and mobile environments

This flexibility is a major reason llama.cpp became influential. It was not designed only for high-end server GPUs. It was designed to make local inference possible on a broad range of hardware.

For example, a Mac user may rely on Metal acceleration, while a Linux desktop user may use CUDA or Vulkan. A CPU-only user may still run smaller quantized models, although generation speed will be slower.

Broad ecosystem support

GGUF is supported by many local inference tools. Examples include:

  • llama.cpp for command-line and server inference
  • Ollama for CLI-first model management and API access
  • LM Studio for a desktop GUI
  • GPT4All for privacy-focused local chat
  • KoboldCpp for local roleplay and text-generation workflows
  • Jan and Open WebUI for local AI interfaces

This matters because users are not locked into one interface. The same general model format can be used across different workflows.

A developer might benchmark a model with llama.cpp, chat with it in LM Studio, serve it through Ollama, and connect it to a browser UI through Open WebUI.

Hugging Face distribution

Hugging Face has become a major distribution hub for GGUF models.

Source: Hugging Face

Many popular open-weight models receive community-uploaded GGUF variants shortly after release. These repositories often include several quantization options so users can pick a model that fits their hardware.

Common upload variants include:

  • Q4_K_M
  • Q5_K_M
  • Q6_K
  • Q8_0
  • IQ4_XS
  • IQ3_M
  • IQ2_XXS

This means manual conversion is often unnecessary. For the most popular models, someone in the community has already created GGUF files for common quantization levels.

Granular size-quality control

GGUF gives users fine-grained control over the size-quality tradeoff. You can choose:

  • Smaller quantizations for low-memory machines
  • Mid-range quantizations for balanced daily use
  • Higher-bit quantizations for coding, reasoning, or structured output
  • Full or near-full precision when memory is not a constraint

This flexibility is one of the format's biggest advantages. Instead of one fixed deployment target, GGUF lets users adapt the same model family to many hardware tiers.

How Does GGUF Work?

A GGUF file is organized into three major parts:

  1. Header
  2. Metadata and tensor information
  3. Tensor data

The exact structure is defined by the GGUF specification. The important idea is that metadata and tensor information appear before the raw tensor data, allowing a runtime to understand what it is about to load.

The header

The header identifies the file as GGUF and tells the runtime how to parse the rest of the file. It includes:

  • Magic number for GGUF
  • Format version
  • Tensor count
  • Metadata key-value count

Modern GGUF files commonly use GGUF version 3.

Inference engines check the magic number first. If the file does not begin with the expected GGUF identifier, the runtime can reject it before trying to parse tensors or allocate memory.

This is a simple but important safety and reliability step. It prevents a runtime from accidentally treating an unrelated binary file as a model.

Metadata key-value pairs

GGUF metadata is a typed key-value store. This metadata can describe:

  • General model information
  • Architecture family
  • Context length
  • Embedding size
  • Number of layers
  • Number of attention heads
  • RoPE parameters
  • Tokenizer vocabulary
  • Special tokens
  • Quantization information

Keys are usually namespaced. Examples include:

  • general.architecture
  • general.alignment
  • llama.context_length
  • tokenizer.ggml.tokens

Namespacing is important because it allows GGUF to support many architectures without changing the entire file format. A LLaMA-family model can use llama.* keys, while other model families can use their own architecture-specific metadata.

This is one reason GGUF adapted well to models beyond the original LLaMA family, including architectures such as Qwen, Mistral, Gemma, DeepSeek, Phi, and others.

Tensor information and tensor data

After the metadata, the file stores tensor information and tensor data.

Tensor information describes:

  • Tensor name
  • Shape
  • Data type
  • Offset into the tensor data section

The tensor data section contains the actual model weights. These weights may be stored in full precision or in one of GGUF's supported quantized tensor types.

GGUF uses an alignment value defined in metadata, commonly general.alignment. Many GGUF files use 32-byte alignment, but the correct way to describe this is that alignment is metadata-controlled rather than permanently hardcoded.

Alignment matters because it allows runtimes to access tensor blocks efficiently.

Memory mapping

One of GGUF's practical advantages is memory mapping, often called mmap.

With memory mapping, the operating system can map the model file into virtual memory instead of forcing the runtime to copy the entire file into RAM upfront.

This can make model startup feel much faster, especially on SSDs. It also allows the operating system to page model data in and out as needed.

However, memory mapping is not magic. The model still needs enough practical memory bandwidth and available RAM or VRAM to run well. If your system is constantly paging from disk, inference may become slow.

A better way to think about mmap is this:

  • It improves loading efficiency
  • It reduces unnecessary copying
  • It lets the OS manage paging
  • It does not eliminate the memory requirements of inference

Understanding GGUF Quantization Types

Quantization compresses model weights into lower-precision representations.

Instead of storing every weight as a 16-bit floating point value, a quantized model stores approximate values using fewer bits. This reduces disk size, RAM and VRAM usage, and memory bandwidth pressure.

The key insight is that many neural network weights do not need full floating-point precision during inference. A carefully quantized model can preserve much of the original model's behavior while becoming dramatically smaller.

GGUF quantization naming

GGUF quantization names usually follow this pattern:

  • Q means quantized
  • The number suggests approximate bits per weight
  • K refers to the k-quant family
  • S, M, and L usually mean small, medium, and large variants

Examples include:

  • Q4_K_M
  • Q5_K_M
  • Q6_K
  • Q8_0

The name is a useful guide, but it is not always an exact statement of total file size. Real file size depends on tensor mix, architecture, metadata, tokenizer size, and whether some tensors remain at higher precision.

Common GGUF quantization types

Quantization

Approximate behavior

Approximate 7B file size

Quality note

Q2_K

Very low-bit quantization

Around 2.5–3 GB

Small, but quality loss is often obvious

Q3_K_M

Low-bit balanced quantization

Around 3.5–4 GB

Usable for lightweight chat, but not ideal for reasoning

Q4_K_M

Balanced 4-bit quantization

Around 4–5 GB

Strong default for most local users

Q5_K_M

Higher-quality 5-bit quantization

Around 5.5–6.5 GB

Better for coding, reasoning, and structured tasks

Q6_K

High-quality quantization

Around 7–8 GB

Often close to higher-precision behavior

Q8_0

8-bit quantization

Around 8–9 GB

High quality, but much larger than Q4/Q5

These numbers are approximations for 7B-class dense models. Newer architectures, mixture-of-experts models, larger tokenizers, and different tensor layouts can change the actual file size.

In practice, Q4_K_M became a popular default because it provides a strong balance between size and quality. Many users find it good enough for general chat, summarization, rewriting, and exploratory local AI work.

Q5_K_M and Q6_K are often better choices for more demanding workloads, such as coding or multi-step instruction following

The reason is simple: these tasks are more sensitive to small quality degradation.

K-quants vs. I-quants

K-quants are the widely used quantization family behind formats such as Q4_K_M, Q5_K_M, and Q6_K.

They use grouped quantization schemes with scaling information that helps preserve model behavior while reducing memory requirements. They are popular because they are reliable, broadly supported, and easy to find in community GGUF releases.

I-quants, often written as IQ formats, are newer quantization types such as:

  • IQ4_XS
  • IQ3_M
  • IQ2_XXS
  • IQ1_S

I-quants are designed to achieve better quality at very small sizes. They can use techniques such as importance-aware quantization and non-linear quantization codebooks. Some workflows use an importance matrix, often called an imatrix, to help preserve more important weights during quantization.

K quants vs I quants

The tradeoff is complexity. I-quants can produce excellent size-quality results, especially at very low bitrates, but they may require more careful quantization workflows and runtime support.

For most beginners, K-quants remain the easiest starting point.

Choosing a quantization level for your hardware

The following table gives practical starting points. Treat these as rules of thumb, not strict guarantees. Context length, operating system overhead, GPU offloading, KV cache size, and the specific model architecture can all change memory requirements.

Hardware tier

7B/8B models

13B/14B models

30B/34B models

70B-class models

8 GB RAM/VRAM

Q4_K_M or smaller

Q2_K/Q3_K may run slowly

Not practical

Not practical

16 GB RAM/VRAM

Q5_K_M or Q6_K

Q4_K_M

Not practical or very constrained

Not practical

24 GB RAM/VRAM

Q8_0 or Q6_K

Q5_K_M/Q6_K

Q3_K/Q4_K with constraints

Not practical for most users

32 GB RAM/VRAM

Q8_0

Q6_K/Q8_0

Q4_K_M/Q5_K_M

Q2_K/Q3_K only for experiments

48 GB+ RAM/VRAM

Q8_0 or FP16/BF16 where supported

Q8_0

Q5_K_M/Q6_K

Q4_K_M possible with constraints

64 GB+ RAM/VRAM

High precision

High precision

Q6_K/Q8_0

Q4_K_M/Q5_K_M more practical

General rules of thumb:

  • Use Q4_K_M as the safe default for most local inference.
  • Use Q5_K_M when quality matters more than saving every gigabyte.
  • Use Q6_K or Q8_0 when memory is available, and you want better fidelity.
  • Avoid Q2_K for serious work unless you are testing extreme memory-constrained scenarios.
  • Leave extra memory for the KV cache, especially when using long context windows.

The KV cache is easy to overlook. A model may fit into RAM at a short context length but fail or slow down at a much longer context length because the cache grows with sequence length.

The GGUF Ecosystem

GGUF's adoption is driven as much by tooling as by the format itself.

A format only becomes useful when users can easily download, run, inspect, convert, and serve models. GGUF benefits from a strong ecosystem across command-line tools, desktop apps, APIs, and hosted model repositories.

1. llama.cpp

llama.cpp is the original and most important GGUF runtime. It is a lightweight C/C++ inference engine created by Georgi Gerganov and maintained by the GGML community. Its main goal is to enable efficient LLM inference with minimal setup across many hardware platforms.

Modern llama.cpp supports many backends, including:

  • CPU
  • CUDA for NVIDIA GPUs
  • Metal for Apple devices
  • Vulkan
  • HIP for AMD GPUs through ROCm
  • SYCL for Intel GPUs
  • OpenCL in selected environments
  • Other specialized backends such as CANN, OpenVINO, and WebGPU, depending on platform support

It also includes tools for conversion, quantization, serving, benchmarking, and command-line inference. Common tools include:

  • convert_hf_to_gguf.py
  • llama-quantize
  • llama-cli
  • llama-server
  • llama-bench

The commands to create a basic CPU CMake build are:

cmake -B build
cmake --build build --config Release

For some configurations, certain flags need to be added to the first of those two commands:

  • Disable Apple Metal on macOS (enabled by default): -DGGML_METAL=OFF
  • Vulkan build: -DGGML_VULKAN=1
  • CUDA build for NVIDIA GPUs: -DGGML_CUDA=ON

Do take note that the current builds use GGML_* CMake options such as GGML_CUDA, GGML_VULKAN, and GGML_HIP.

2. Ollama

Ollama is one of the easiest ways to run local models. It provides:

  • A simple CLI
  • Model pulling and management
  • A local REST API
  • Official Python and JavaScript libraries
  • Integration with many local AI frontends

Ollama stores and manages models for you, so the user usually does not interact with .gguf files directly. However, Ollama is built around llama.cpp-compatible local inference and can also import GGUF files through a Modelfile workflow.

Ollama exposes a local API at:

http://localhost:11434/api

Two commonly used endpoints are:

  • /api/generate for prompt completion
  • /api/chat for chat-style messages

For beginners, Ollama is often the fastest path from zero to local inference.

3. LM Studio

LM studio

Source: LM Studio

LM Studio is a desktop application for discovering, downloading, and chatting with local models. It is useful for users who prefer a graphical interface instead of command-line tools.

4. GPT4All

gpt4all

Source: GPT4All

GPT4All is another cross-platform local AI application focused on private, local chatbot workflows. It supports GGUF models and provides a beginner-friendly environment for local inference.

These tools make GGUF accessible to non-specialists. Users do not need to understand CMake, tensor layouts, or quantization internals just to try a local model.

How to Get Started with GGUF Models

There are two practical ways to get started:

  1. Use Ollama for the simplest experience.
  2. Use llama.cpp directly for more control.

Running a model with Ollama

The simplest workflow is to download the model and start an interactive chat session:

ollama pull llama3.3
ollama run llama3.3

To call the model from Python using the REST API:

import requests

payload = {
    "model": "llama3.3",
    "prompt": "Give me three practical use cases for GGUF.",
    "stream": False
}

response = requests.post(
    "http://localhost:11434/api/generate",
    json=payload
)

print(response.json()["response"])

For chat-style applications, use /api/chat:

import requests

payload = {
    "model": "llama3.3",
    "messages": [
        {"role": "user", "content": "What is GGUF used for?"}
    ],
    "stream": False
}

response = requests.post(
    "http://localhost:11434/api/chat",
    json=payload
)

print(response.json()["message"]["content"])

The stream: false field is important for simple scripts. Without it, Ollama returns a stream of JSON objects rather than one final JSON response.

You can also use Ollama's official Python library:

from ollama import chat

response = chat(
    model="llama3.3",
    messages=[
        {"role": "user", "content": "Explain GGUF quantization simply."}
    ]
)

print(response.message.content)

Running a GGUF file with llama.cpp

If you already have a .gguf file, you can run it directly with llama.cpp after building the project.

Example:

./build/bin/llama-cli \
  -m models/model.Q4_K_M.gguf \
  -p "Explain the difference between GGUF and GPTQ." \
  -n 256

If you have GPU support enabled, you can offload layers to the GPU:

./build/bin/llama-cli \
  -m models/model.Q4_K_M.gguf \
  -p "Summarize GGUF in five bullet points." \
  -n 256 \
  -ngl 99

The -ngl flag controls the number of layers offloaded to the GPU. A high value such as 99 is commonly used to offload as much as possible, assuming the model fits in VRAM.

For API serving, use llama-server:

./build/bin/llama-server \
  -m models/model.Q4_K_M.gguf \
  -ngl 99 \
  --host 127.0.0.1 \
  --port 8080

This gives you a local server interface for integrating llama.cpp into applications.

Converting a Hugging Face model to GGUF

Most users do not need to convert models manually because community GGUF releases are widely available.

However, manual conversion is useful when:

  • You have fine-tuned your own model
  • No GGUF version exists yet
  • You want to control the quantization process yourself
  • You need a specific quantization type

A typical workflow is:

  1. Download a Hugging Face model.
  2. Convert it to GGUF.
  3. Quantize the GGUF file.

Example:

huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir mistral-7b

Then convert to GGUF:

python convert_hf_to_gguf.py mistral-7b \
  --outfile mistral-f16.gguf \
  --outtype f16

Then quantize:

./build/bin/llama-quantize \
  mistral-f16.gguf \
  mistral-q4_k_m.gguf \
  Q4_K_M

In current llama.cpp workflows, convert_hf_to_gguf.py and llama-quantize are the relevant tools. Older tutorials may refer to deprecated conversion scripts or older binary names.

Advantages and Limitations of the GGUF Format

GGUF is optimized for practical local inference. It is not a universal replacement for every model format or serving stack.

Advantages

Limitations

Single-file model deployment

Not designed for training from scratch

Strong local inference ecosystem

Very low-bit quantization can hurt quality

Works across many hardware backends

Large models still need significant memory

Supports memory mapping

GPU throughput may be lower than specialized GPU serving stacks

Many quantization choices

Runtime must still support the model architecture and tensor types

Easy distribution on Hugging Face

Context length can increase memory use through the KV cache

For CPU-first, Apple Silicon, mixed-hardware, and privacy-focused inference, GGUF is often an excellent choice.

For high-throughput NVIDIA server deployment, other formats and engines may be faster depending on the model, batch size, quantization method, and serving framework.

Final Thoughts

GGUF makes local LLM inference practical by packaging everything a runtime needs (weights, tokenizer, metadata, quantization info) into one portable file. Its real strength is the ecosystem around it: llama.cpp, Ollama, LM Studio, and Hugging Face have all made it the default format for local AI deployment.

For most users, the path is simple: install Ollama, pull a model, and run it. Q4_K_M is a solid default; step up to Q5_K_M or Q6_K when you need better reasoning or coding output and have the memory to spare.

If you want to go deeper into LLM deployment, model optimization, and local inference workflows, you should explore the Associate AI Engineer for Data Scientists or the Associate AI Engineer for Developers career track.

GGUF Format FAQs

What does GGUF stand for?

GGUF stands for GGML Unified Format. It is a binary file format designed for storing and running large language models locally. GGUF packages tensors, tokenizer data, metadata, and architecture information into a single portable file, making local deployment much simpler compared to older multi-file workflows.

Is GGUF better than GPTQ or AWQ?

GGUF is not necessarily “better” than GPTQ or AWQ in every scenario. GGUF is optimized for portability and broad hardware compatibility, especially for CPU, Apple Silicon, and mixed-hardware inference through tools like llama.cpp and Ollama. GPTQ and AWQ are typically more optimized for high-throughput NVIDIA GPU inference in server environments.

Which GGUF quantization should beginners use?

For most users, Q4_K_M is the safest starting point. It offers a strong balance between model quality, RAM usage, and inference speed. Users with more memory who want better reasoning or coding performance may prefer Q5_K_M or Q6_K, while lower-bit formats such as Q2_K are usually only suitable for experimentation.

Can GGUF models run without a GPU?

Yes. One of GGUF’s biggest advantages is strong CPU support. Tools such as llama.cpp can run GGUF models entirely on CPUs, although inference speed will usually be slower than GPU acceleration. Smaller quantized models, such as 7B or 8B Q4_K_M variants, are often practical on modern consumer CPUs.

Do I need to manually convert models into GGUF?

Usually not. Most popular open-weight models already have community-uploaded GGUF versions on Hugging Face. Manual conversion is mainly useful if you have fine-tuned your own model, need a specific quantization type, or want tighter control over the conversion and quantization process using llama.cpp.


Austin Chia's photo
Author
Austin Chia
LinkedIn

I'm Austin, a blogger and tech writer with years of experience both as a data scientist and a data analyst in healthcare. Starting my tech journey with a background in biology, I now help others make the same transition through my tech blog. My passion for technology has led me to my writing contributions to dozens of SaaS companies, inspiring others and sharing my experiences.

Topics

Top AI Courses

Track

AI Fundamentals

10 hr
Discover the fundamentals of AI, learn to leverage AI effectively for work, and dive into models like ChatGPT to navigate the dynamic AI landscape.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

Unsloth Studio Fine-Tuning LLMs Guide

Learn how to fine-tune Qwen3.5-9B on the Latex OCR dataset using QLoRA in Unsloth Studio, then test the model in chat and export it in GGUF format for local inference.
Abid Ali Awan's photo

Abid Ali Awan

Fine-Tune Llama 3.2

Tutorial

Fine-tuning Llama 3.2 and Using It Locally: A Step-by-Step Guide

Learn how to access Llama 3.2 lightweight and vision models on Kaggle, fine-tune the model on a custom dataset using free GPUs, merge and export the model to the Hugging Face Hub, and convert the fine-tuned model to GGUF format so it can be used locally with the Jan application.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Run GLM-5 Locally For Agentic Coding

Run GLM-5, the best open-weight AI model, on a single GPU with llama.cpp, and connect it to Aider to turn it into a powerful local coding agent.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

LlaMA-Factory WebUI Beginner's Guide: Fine-Tuning LLMs

Learn how to fine-tune LLMs on custom datasets, evaluate performance, and seamlessly export and serve models using the LLaMA-Factory's low/no-code framework.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Quantization for Large Language Models (LLMs): Reduce AI Model Sizes Efficiently

A Comprehensive Guide to Reducing Model Sizes
Andrea Valenzuela's photo

Andrea Valenzuela

Tutorial

How to Run GLM-4.7 Locally with llama.cpp: A High-Performance Guide

Setting up llama.cpp to run the GLM-4.7 model on a single NVIDIA H100 80GB GPU, achieving up to 20 tokens per second using GPU offloading, Flash Attention, optimized context size, efficient batching, and tuned CPU threading.
Abid Ali Awan's photo

Abid Ali Awan

See MoreSee More