Chuyển đến nội dung chính

LM Studio Tutorial: Get Started with Local LLMs

Discover how to install and run LLMs locally using LM Studio. Keep your data private, chat with documents using built-in RAG, and set up a local API.
14 thg 4, 2026  · 10 phút đọc

Running large language models locally has become increasingly popular, especially when you don’t want to send private data to external servers. When everything runs on your machine, your prompts and data stay within your environment, which gives you more control and better privacy.

If you want that same power, I’ll walk you through how to use LM Studio to run and chat with LLMs locally. It’s a GUI-first tool, so you don’t need terminal experience or deep technical knowledge. The setup is straightforward, and you can get started quickly. Let’s jump in!

If you’re interested in running agentic tools locally, I recommend checking out our tutorials on setting up OpenClaw and Claude Code with Ollama, respectively.

What is LM Studio?

LM Studio is a cross-platform application that lets you download and run large language models locally on your machine so that your data never leaks to external servers. 

It comes with a built-in model browser where you can search, browse, and download models directly from Hugging Face. You can download pretty much any model you want, including different versions of DeepSeek, Llama, Gemma, Phi, or Mistral. You don’t need any extra setup either. 

LM Studio is also a great option for beginners, especially if you’re not comfortable working with command-line prompts. It gives you a user-friendly interface where you can select a model, adjust the configuration, and start chatting with it right away. 

You can also upload your local files and chat with them: LM Studio can attach .docx, .pdf, and .txt files to chat sessions. If a document fits in context, it is added in full, and if it is very long, LM Studio can use retrieval-augmented generation (RAG)to pull relevant information from those files to answer your queries.

Since LM Studio is cross-platform, it works smoothly across Windows, Mac, and Linux, so you’re not limited by your setup. And once you get past the basics, there’s more you can do with it. You can connect your local LLMs to external tools, data sources, and APIs by integrating MCP servers, which makes it flexible enough for more advanced workflows.

LM Studio vs Ollama

LM Studio and Ollama are both designed to run and chat with large language models locally. However, there are a few key differences:

Feature

LM Studio

Ollama

Interface

GUI-first, user-friendly interface

CLI-first, terminal-based interface

Built-in RAG

Yes, no extra setup needed

Requires external tools

MCP support

Built-in 

Limited / not native

Model downloads

Access to Huggingface from the app

Using commands such as ollama pull

Ease of setup

Very beginner-friendly

Slight learning curve if you’re new to CLI

LM Studio System Requirements and Model Choice

Before you start downloading models in LM Studio, it helps to understand what your system can actually handle. The model you choose directly depends on your available RAM, and picking the wrong one can slow things down or make the app unusable.

A quick lookup table with suitable models for different available RAM:

RAM

What you can run comfortably

8GB

Small models (1B–4B)

16GB

Mid-sized models (7B-9B)

32GB+

Larger models (13B and above)

GPU is optional, but it makes a noticeable difference. If you have one, model responses become much faster and smoother. NVIDIA GPUs with CUDA support work best, Apple Silicon uses Metal effectively, and AMD has partial support depending on the setup.

How to choose the right model for your hardware

Here are some practical recommendations for picking a model that actually runs well on your machine without pushing it too hard.

RAM/VRAM

Recommended models 

8GB

Qwen 2.5 3B / 4B, Phi-3 Mini (3.8B), Gemma 2 2B

16GB

Llama 3 8B, Gemma 2 9B, Mistral 7B, Qwen 2.5 7B

24GB

Llama 3.1 8B (higher quality quant), Mixtral 8x7B (quantized), Qwen 2.5 14B

32GB+

Llama 3.1 70B (heavily quantized), Qwen 2.5 32B, Mixtral variants (better configs)

You’ll also notice different versions of the same model with labels like Q4_K_M or Q8_0. This refers to quantization levels, which basically tell how the model is compressed. Lower quantization, like Q4, reduces memory usage and runs faster, but you lose some quality. A higher quantization, like Q8, keeps better output quality, but it requires more RAM and runs slower.

If you’re unsure, Q4 or Q5 is usually a safe place to start, especially on a 16GB setup like mine.

Installing LM Studio

To get started with LM Studio, head over to the official website and download the app. The website automatically detects your operating system and offers to download the relevant version.

Downloading LM StudioYou might be asked to allow permissions depending on your system settings. In my case, on Mac, it was available as an application right after opening the installer I downloaded.

Downloading Your First Model In LM Studio

When you open LM Studio for the first time, you’ll land on a clean interface with the model browser. You can immediately search for models, explore available options, and begin downloading one to run locally.

Browsing the Discover tab

Open LM Studio and click the search icon from the left sidebar.

LM Studio Discover Tab

This is essentially your model marketplace where you can search for specific models, filter by size, and explore different options. When you search, each model comes with a model card, which gives you useful context like size, capabilities, and sometimes recommended use cases. It’s worth skimming this before downloading so you know what to expect.

Downloading Qwen2.5-VL-7B in LM Studio

If you’re following along and want a reliable starting point, go with something like Qwen 2.5 7B (Q4_K_M) for a 16GB system (or take one of the suggestions I made above). It strikes a good balance between performance and quality, and it runs smoothly without pushing your machine too hard.

Understanding the model formats

As you browse, you’ll notice most models are available in GGUF format. GGUF, which stands for GPT-Generated Unified format, is a binary format to store and run LLMs efficiently on consumer-grade hardware. 

This format maps high-precision weights (e.g., Float16) of models to lower-bit integers (e.g., 4-bit, 5-bit) and packages the model weights, metadata, and configuration into a single optimized file. It makes loading faster and ensures compatibility with inference engines like llama.cpp, which LM Studio relies on under the hood.

Chatting with a Local LLM In LM Studio

Let’s get to the exciting part and put the model to use.

Loading a model and configuring parameters

Step 1: Open LM Studio and navigate to the My Models section from the left menu bar.

LM Studio My Models

Step 2: Click the Settings icon on the model and click Load Model.

LM Studio Model Settings

Once it’s loaded, go to the Inference tab on the same screen, and you’ll see controls for context length, temperature, and more.

LM Studio Inference Tab

  • Context length controls how much information the model can remember during a conversation. A higher value lets you work with longer inputs, but it also uses more memory. If you’re on limited RAM, it’s better to keep this moderate. 
  • Temperature controls how creative or predictable the model is. Lower values make responses more deterministic, while higher values make them more varied. 
  • System prompt sets the behavior of the model. This is where you define how the assistant should respond, including tone, style, and role.

Your first conversation

Once everything is set up, you can start chatting with the model just like you would with any AI assistant. Here’s a simple example:

I prompted, “What is the best model to use for 8GB RAM and image data?”

The model suggested using a model with a convolutional neural network (CNN) architecture, as shown in the image.

Chatting with Qwen2.5 in LM Studio

That’s the basic flow. The quality of responses depends a lot on your configurations. For example, if you set “Explain everything in simple terms with short answers” in the system prompt settings, the model stays consistent with that style across multiple responses. 

Chatting with Your Documents in LM Studio

One of the most useful features in LM Studio is its built-in RAG support. You can directly upload your documents to the chat and start asking questions.

Setting up document Q&A

To get started, open a chat session with your loaded model. You’ll see a + icon for attaching files. Click it to upload documents like PDFs or text files directly into the chat.

Attaching file for RAG in LM Studio

Once the file is added, LM Studio prepares it for querying automatically, so you don’t need to configure anything manually. Under the hood, the document is split into smaller chunks so the model can work with it efficiently. These chunks are then converted into embeddings, which are numerical representations of the text. 

When you ask a question, LM Studio retrieves the most relevant chunks and passes them to the model along with your query. This way, the model gets extra information from your documents and responds accordingly.

Querying your knowledge base

For example, I uploaded a research paper on artificial intelligence and asked, “What is your take on the future of AI?”

LM Studio pulls the most relevant sections from the document and sends them to the model along with your prompt. The model then generates a response based on both the provided context and its existing knowledge.

You can see the same thing happening visually in the image below:

Using RAG in LM Studio

There are a few limitations to keep in mind. The model still depends on its context window, so very large documents may not be fully considered at once. Retrieval quality also depends on how well the document is chunked, which means some answers might miss details if the relevant sections aren’t retrieved correctly.

Running LM Studio as a Local API Server

One of the more powerful things you can do with LM Studio is run it as a local API server. This lets you use your local LLM inside scripts, apps, or other tools.

Starting the server

Step 1: To enable this, Open LM Studio and click the Settings icon in the bottom-left corner of the screen.

Step 2: Go to the Developer section from the left sidebar and turn on the Developer Mode toggle.

Turn on Developer Mode in LM Studio

Step 3: Go back to the chat interface and click the Developer icon from the left menu bar.

LM Studio Developer Menu

Step 4: Turn on the toggle next to Status to start the server, as in the image below. Once it’s running, you can copy the server address and test it with a simple curl request: 

curl http://127.0.0.1:1234/v1/models

Running LM Studio as local API server

If everything is set up correctly, you’ll see a JSON response listing the available model.

LM Studio API server response

Connecting from Python

Once your local server is running, you can treat it like any other API. The only difference is that instead of calling OpenAI’s servers, you’re calling your own machine.

Here’s a simple example:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")

response = client.chat.completions.create(
    model="local-model",
    messages=[
        {"role": "user", "content": "Explain how local LLMs work"}
    ],
)

print(response.choices[0].message.content)

What’s happening here:

  • base_url tells the code to use your local LM Studio server instead of OpenAI

  • api_key can be anything (LM Studio doesn’t enforce it)

  • model refers to the model you loaded in LM Studio

  • messages is your prompt

When you run this, your request goes to “localhost:1234”, the model processes it, and you get a response back, just like any API call. This works because LM Studio follows the OpenAI API format

Conclusion

LM Studio gives you a clean, practical interface to work with LLMs locally, and you can control the environment completely. You can select the models, configure the settings, chat with it, and even extend the setup to run it as a local API server.

What stands out is that running large language models locally used to involve a lot of setup and tooling. LM Studio reduces that to something that feels closer to installing and using a regular desktop app.

If you want to build on this, the next step is learning how to integrate these models into real workflows. You can explore courses like Working with the OpenAI API or broader AI fundamentals tracks to understand how to structure prompts, build applications, and work with models effectively.

LM Studio FAQs

Is LM Studio free?

Yes, LM Studio is free to download and use. There are no subscription fees, usage limits, or paywalled features for the core app.

Is LM Studio completely offline?

Yes, once you download a model, everything runs locally. You only need an internet connection to download models from Hugging Face.

Can I use LM Studio in my own applications?

Yes, LM Studio can run as a local API server that follows the OpenAI API format. This means you can connect it to scripts, apps, or other tools running on your machine.

Can LM Studio handle images?

Yes, but only with vision-enabled models. If you load a multimodal model like LLaVA or Qwen-VL, you can upload images and ask questions about them. Standard text-only models won’t support image inputs.

How much RAM do I need to run LLMs locally?

It depends on the model size. 8GB can handle smaller models, 16GB is sufficient for many use cases, but 32GB or more is mostly required for larger models.


Srujana Maddula's photo
Author
Srujana Maddula
LinkedIn

Srujana is a freelance tech writer with the four-year degree in Computer Science. Writing about various topics, including data science, cloud computing, development, programming, security, and many others comes naturally to her. She has a love for classic literature and exploring new destinations.

Chủ đề

AI Courses

Tracks

Cơ bản về Trí tuệ Nhân tạo

10 giờ
Khám phá những kiến thức cơ bản về Trí tuệ Nhân tạo (AI), học cách ứng dụng AI một cách hiệu quả trong công việc, và tìm hiểu sâu về các mô hình như ChatGPT để nắm bắt xu hướng phát triển của lĩnh vực AI.
Xem chi tiếtRight Arrow
Bắt đầu khóa học
Xem thêmRight Arrow
Có liên quan

blogs

AnythingLLM: A Complete Guide to Setup, Features, and Use Cases

Learn how to install AnythingLLM with Docker and Ollama, set up RAG pipelines for private document chat, and choose between AnythingLLM, ChatGPT, and Open WebUI.
Khalid Abdelaty's photo

Khalid Abdelaty

11 phút

blogs

13 LLM Projects For All Levels: From Low-Code to AI Agents

Discover 13 LLM project ideas with easy-to-follow guides and code. Build RAG systems, AI apps, and autonomous agents using DeepSeek, LangGraph, and OpenAI.
Abid Ali Awan's photo

Abid Ali Awan

10 phút

Tutorials

Run LLMs Locally: 6 Simple Methods

Run LLMs locally (Windows, macOS, Linux) by using these easy-to-use LLM frameworks: Ollama, LM Studio, vLLM, llama.cpp, Jan, and llamafile.
Abid Ali Awan's photo

Abid Ali Awan

Tutorials

Docker Ollama: Run LLMs Locally for Privacy and Zero Cost

Set up Ollama in Docker to run local LLMs like Llama and Mistral. Keep your data private, eliminate API costs, and build AI apps that work offline.
Dario Radečić's photo

Dario Radečić

Tutorials

vLLM: Setting Up vLLM Locally and on Google Cloud for CPU

Learn how to set up and run vLLM (Virtual Large Language Model) locally using Docker and in the cloud using Google Cloud.
François Aubry's photo

François Aubry

Tutorials

How to Run Llama 3 Locally With Ollama and GPT4ALL

Run LLaMA 3 locally with GPT4ALL and Ollama, and integrate it into VSCode. Then, build a Q&A retrieval system using Langchain and Chroma DB.
Abid Ali Awan's photo

Abid Ali Awan

Xem thêmXem thêm