LM Studio Tutorial: Get Started with Local LLMs

Discover how to install and run LLMs locally using LM Studio. Keep your data private, chat with documents using built-in RAG, and set up a local API.

14 thg 4, 2026 · 10 phút đọc

Khám phá với AI

Mở trong ChatGPT Mở trong Claude Mở trong Perplexity

Running large language models locally has become increasingly popular, especially when you don’t want to send private data to external servers. When everything runs on your machine, your prompts and data stay within your environment, which gives you more control and better privacy.

If you want that same power, I’ll walk you through how to use LM Studio to run and chat with LLMs locally. It’s a GUI-first tool, so you don’t need terminal experience or deep technical knowledge. The setup is straightforward, and you can get started quickly. Let’s jump in!

If you’re interested in running agentic tools locally, I recommend checking out our tutorials on setting up OpenClaw and Claude Code with Ollama, respectively.

What is LM Studio?

LM Studio is a cross-platform application that lets you download and run large language models locally on your machine so that your data never leaks to external servers.

It comes with a built-in model browser where you can search, browse, and download models directly from Hugging Face. You can download pretty much any model you want, including different versions of DeepSeek, Llama, Gemma, Phi, or Mistral. You don’t need any extra setup either.

LM Studio is also a great option for beginners, especially if you’re not comfortable working with command-line prompts. It gives you a user-friendly interface where you can select a model, adjust the configuration, and start chatting with it right away.

You can also upload your local files and chat with them: LM Studio can attach .docx, .pdf, and .txt files to chat sessions. If a document fits in context, it is added in full, and if it is very long, LM Studio can use retrieval-augmented generation (RAG)to pull relevant information from those files to answer your queries.

Since LM Studio is cross-platform, it works smoothly across Windows, Mac, and Linux, so you’re not limited by your setup. And once you get past the basics, there’s more you can do with it. You can connect your local LLMs to external tools, data sources, and APIs by integrating MCP servers, which makes it flexible enough for more advanced workflows.

LM Studio vs Ollama

LM Studio and Ollama are both designed to run and chat with large language models locally. However, there are a few key differences:

Feature	LM Studio	Ollama
Interface	GUI-first, user-friendly interface	CLI-first, terminal-based interface
Built-in RAG	Yes, no extra setup needed	Requires external tools
MCP support	Built-in	Limited / not native
Model downloads	Access to Huggingface from the app	Using commands such as `ollama pull`
Ease of setup	Very beginner-friendly	Slight learning curve if you’re new to CLI

LM Studio System Requirements and Model Choice

Before you start downloading models in LM Studio, it helps to understand what your system can actually handle. The model you choose directly depends on your available RAM, and picking the wrong one can slow things down or make the app unusable.

A quick lookup table with suitable models for different available RAM:

RAM	What you can run comfortably
8GB	Small models (1B–4B)
16GB	Mid-sized models (7B-9B)
32GB+	Larger models (13B and above)

GPU is optional, but it makes a noticeable difference. If you have one, model responses become much faster and smoother. NVIDIA GPUs with CUDA support work best, Apple Silicon uses Metal effectively, and AMD has partial support depending on the setup.

How to choose the right model for your hardware

Here are some practical recommendations for picking a model that actually runs well on your machine without pushing it too hard.

RAM/VRAM	Recommended models
8GB	Qwen 2.5 3B / 4B, Phi-3 Mini (3.8B), Gemma 2 2B
16GB	Llama 3 8B, Gemma 2 9B, Mistral 7B, Qwen 2.5 7B
24GB	Llama 3.1 8B (higher quality quant), Mixtral 8x7B (quantized), Qwen 2.5 14B
32GB+	Llama 3.1 70B (heavily quantized), Qwen 2.5 32B, Mixtral variants (better configs)

You’ll also notice different versions of the same model with labels like Q4_K_M or Q8_0. This refers to quantization levels, which basically tell how the model is compressed. Lower quantization, like Q4, reduces memory usage and runs faster, but you lose some quality. A higher quantization, like Q8, keeps better output quality, but it requires more RAM and runs slower.

If you’re unsure, Q4 or Q5 is usually a safe place to start, especially on a 16GB setup like mine.

Installing LM Studio

To get started with LM Studio, head over to the official website and download the app. The website automatically detects your operating system and offers to download the relevant version.

You might be asked to allow permissions depending on your system settings. In my case, on Mac, it was available as an application right after opening the installer I downloaded.

Downloading Your First Model In LM Studio

When you open LM Studio for the first time, you’ll land on a clean interface with the model browser. You can immediately search for models, explore available options, and begin downloading one to run locally.

Browsing the Discover tab

Open LM Studio and click the search icon from the left sidebar.

This is essentially your model marketplace where you can search for specific models, filter by size, and explore different options. When you search, each model comes with a model card, which gives you useful context like size, capabilities, and sometimes recommended use cases. It’s worth skimming this before downloading so you know what to expect.

If you’re following along and want a reliable starting point, go with something like Qwen 2.5 7B (Q4_K_M) for a 16GB system (or take one of the suggestions I made above). It strikes a good balance between performance and quality, and it runs smoothly without pushing your machine too hard.

Understanding the model formats

As you browse, you’ll notice most models are available in GGUF format. GGUF, which stands for GPT-Generated Unified format, is a binary format to store and run LLMs efficiently on consumer-grade hardware.

This format maps high-precision weights (e.g., Float16) of models to lower-bit integers (e.g., 4-bit, 5-bit) and packages the model weights, metadata, and configuration into a single optimized file. It makes loading faster and ensures compatibility with inference engines like llama.cpp, which LM Studio relies on under the hood.

Chatting with a Local LLM In LM Studio

Let’s get to the exciting part and put the model to use.

Loading a model and configuring parameters

Step 1: Open LM Studio and navigate to the My Models section from the left menu bar.

Step 2: Click the Settings icon on the model and click Load Model.

Once it’s loaded, go to the Inference tab on the same screen, and you’ll see controls for context length, temperature, and more.

Context length controls how much information the model can remember during a conversation. A higher value lets you work with longer inputs, but it also uses more memory. If you’re on limited RAM, it’s better to keep this moderate.
Temperature controls how creative or predictable the model is. Lower values make responses more deterministic, while higher values make them more varied.
System prompt sets the behavior of the model. This is where you define how the assistant should respond, including tone, style, and role.

Your first conversation

Once everything is set up, you can start chatting with the model just like you would with any AI assistant. Here’s a simple example:

I prompted, “What is the best model to use for 8GB RAM and image data?”

The model suggested using a model with a convolutional neural network (CNN) architecture, as shown in the image.

That’s the basic flow. The quality of responses depends a lot on your configurations. For example, if you set “Explain everything in simple terms with short answers” in the system prompt settings, the model stays consistent with that style across multiple responses.

Chatting with Your Documents in LM Studio

One of the most useful features in LM Studio is its built-in RAG support. You can directly upload your documents to the chat and start asking questions.

Setting up document Q&A

To get started, open a chat session with your loaded model. You’ll see a + icon for attaching files. Click it to upload documents like PDFs or text files directly into the chat.

Once the file is added, LM Studio prepares it for querying automatically, so you don’t need to configure anything manually. Under the hood, the document is split into smaller chunks so the model can work with it efficiently. These chunks are then converted into embeddings, which are numerical representations of the text.

When you ask a question, LM Studio retrieves the most relevant chunks and passes them to the model along with your query. This way, the model gets extra information from your documents and responds accordingly.

Querying your knowledge base

For example, I uploaded a research paper on artificial intelligence and asked, “What is your take on the future of AI?”

LM Studio pulls the most relevant sections from the document and sends them to the model along with your prompt. The model then generates a response based on both the provided context and its existing knowledge.

You can see the same thing happening visually in the image below:

There are a few limitations to keep in mind. The model still depends on its context window, so very large documents may not be fully considered at once. Retrieval quality also depends on how well the document is chunked, which means some answers might miss details if the relevant sections aren’t retrieved correctly.

Running LM Studio as a Local API Server

One of the more powerful things you can do with LM Studio is run it as a local API server. This lets you use your local LLM inside scripts, apps, or other tools.

Starting the server

Step 1: To enable this, Open LM Studio and click the Settings icon in the bottom-left corner of the screen.

Step 2: Go to the Developer section from the left sidebar and turn on the Developer Mode toggle.

Step 3: Go back to the chat interface and click the Developer icon from the left menu bar.

Step 4: Turn on the toggle next to Status to start the server, as in the image below. Once it’s running, you can copy the server address and test it with a simple curl request:

curl http://127.0.0.1:1234/v1/models

If everything is set up correctly, you’ll see a JSON response listing the available model.

Connecting from Python

Once your local server is running, you can treat it like any other API. The only difference is that instead of calling OpenAI’s servers, you’re calling your own machine.

Here’s a simple example:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Explain how local LLMs work"}
    ],
)

print(response.choices[0].message.content)

What’s happening here:

base_url tells the code to use your local LM Studio server instead of OpenAI
api_key can be anything (LM Studio doesn’t enforce it)
model refers to the model you loaded in LM Studio
messages is your prompt

When you run this, your request goes to “localhost:1234”, the model processes it, and you get a response back, just like any API call. This works because LM Studio follows the OpenAI API format.

Conclusion

LM Studio gives you a clean, practical interface to work with LLMs locally, and you can control the environment completely. You can select the models, configure the settings, chat with it, and even extend the setup to run it as a local API server.

What stands out is that running large language models locally used to involve a lot of setup and tooling. LM Studio reduces that to something that feels closer to installing and using a regular desktop app.

If you want to build on this, the next step is learning how to integrate these models into real workflows. You can explore courses like Working with the OpenAI API or broader AI fundamentals tracks to understand how to structure prompts, build applications, and work with models effectively.

Is LM Studio free?

Is LM Studio completely offline?

Can I use LM Studio in my own applications?

Can LM Studio handle images?

How much RAM do I need to run LLMs locally?

Author

Srujana Maddula

Chủ đề

Large Language Models

Artificial Intelligence

AI Courses

Tracks

Cơ bản về Trí tuệ Nhân tạo

10 giờ

Khám phá những kiến thức cơ bản về Trí tuệ Nhân tạo (AI), học cách ứng dụng AI một cách hiệu quả trong công việc, và tìm hiểu sâu về các mô hình như ChatGPT để nắm bắt xu hướng phát triển của lĩnh vực AI.

Xem chi tiết

Bắt Đầu Khóa Học

Tracks

Cơ bản về Hugging Face

12 giờ

Tìm kiếm các mô hình AI mã nguồn mở, bộ dữ liệu và ứng dụng mới nhất, phát triển các tác nhân AI và tinh chỉnh các mô hình ngôn ngữ lớn (LLMs) với Hugging Face. Hãy tham gia cộng đồng AI lớn nhất ngay hôm nay!

Xem chi tiết

Bắt Đầu Khóa Học

Courses

Làm việc với OpenAI API

3 giờ

151K

Bắt đầu hành trình phát triển ứng dụng tích hợp AI với OpenAI API. Tìm hiểu về chức năng làm nền tảng cho các ứng dụng AI phổ biến như ChatGPT.

Xem chi tiết

Bắt Đầu Khóa Học

Xem thêm

Có liên quan

blogs

AnythingLLM: A Complete Guide to Setup, Features, and Use Cases

Learn how to install AnythingLLM with Docker and Ollama, set up RAG pipelines for private document chat, and choose between AnythingLLM, ChatGPT, and Open WebUI.

Khalid Abdelaty

11 phút

blogs

13 LLM Projects For All Levels: From Low-Code to AI Agents

Discover 13 LLM project ideas with easy-to-follow guides and code. Build RAG systems, AI apps, and autonomous agents using DeepSeek, LangGraph, and OpenAI.

Abid Ali Awan

10 phút

Tutorials

Run LLMs Locally: 6 Simple Methods

Run LLMs locally (Windows, macOS, Linux) by using these easy-to-use LLM frameworks: Ollama, LM Studio, vLLM, llama.cpp, Jan, and llamafile.

Abid Ali Awan

Tutorials

Docker Ollama: Run LLMs Locally for Privacy and Zero Cost

Set up Ollama in Docker to run local LLMs like Llama and Mistral. Keep your data private, eliminate API costs, and build AI apps that work offline.

Dario Radečić

Tutorials

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

Learn how to set up and run vLLM (Virtual Large Language Model) locally using Docker and in the cloud using Google Cloud.

François Aubry

Tutorials

How to Run Llama 3 Locally With Ollama and GPT4ALL

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain and Chroma DB.

Abid Ali Awan

Xem Thêm Xem Thêm

What is LM Studio?

LM Studio vs Ollama

LM Studio System Requirements and Model Choice

How to choose the right model for your hardware

Installing LM Studio

Downloading Your First Model In LM Studio

Browsing the Discover tab

Understanding the model formats

Chatting with a Local LLM In LM Studio

Loading a model and configuring parameters

Your first conversation

Chatting with Your Documents in LM Studio

Setting up document Q&A

Querying your knowledge base

Running LM Studio as a Local API Server

Starting the server

Connecting from Python

Conclusion

LM Studio FAQs

Can I use LM Studio in my own applications?

Can LM Studio handle images?

How much RAM do I need to run LLMs locally?

AnythingLLM: A Complete Guide to Setup, Features, and Use Cases

13 LLM Projects For All Levels: From Low-Code to AI Agents

Run LLMs Locally: 6 Simple Methods

Docker Ollama: Run LLMs Locally for Privacy and Zero Cost

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

How to Run Llama 3 Locally With Ollama and GPT4ALL

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Cơ bản về Trí tuệ Nhân tạo

Cơ bản về Hugging Face

Làm việc với OpenAI API

AnythingLLM: A Complete Guide to Setup, Features, and Use Cases

13 LLM Projects For All Levels: From Low-Code to AI Agents

Run LLMs Locally: 6 Simple Methods

Docker Ollama: Run LLMs Locally for Privacy and Zero Cost

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

How to Run Llama 3 Locally With Ollama and GPT4ALL

Cơ bản về Trí tuệ Nhân tạo