Track
Run LLMs Locally: 7 Simple Methods
Using large language models (LLMs) on local systems is becoming increasingly popular thanks to their improved privacy, control, and reliability. Sometimes, these models can be even more accurate and faster than ChatGPT.
We’ll show seven ways to run LLMs locally with GPU acceleration on Windows 11, but the methods we cover also work on macOS and Linux.
LLM frameworks that help us run LLMs locally. Image by Abid Ali Awan.
If you want to learn about LLMs from scratch, a good place to start is this course on Large Learning Models (LLMs).
Let’s start by exploring our first LLM framework.
1. GPT4All
The nomic-ai/gpt4all is an LLM framework and chatbot application for all operating systems. We can run the LLMs locally and then use the API to integrate them with any application, such as an AI coding assistant on VSCode. This is the most beginner-friendly and simple method of downloading and running LLMs on your local machines.
A. Downloading the client
Download the installer from the nomic-ai/gpt4all GitHub repository. Your choice depends on your operating system—for this tutorial, we choose Windows.
B. Downloading the model
Install the GPT4All package by selecting the default options. When we launch the GPT4All application, we’ll be prompted to download the language model before using it. Download a model of your choice.
C. Selecting the model
Once the downloading is complete, close the model page to access the chat user interface.
Select the model you’ve downloaded—we chose Nous Hermes 2 Mistral DPO.
D. Generating response
If you have a CUDA install, it’ll automatically start using a GPU to accelerate the response generation. If not, and you have an Nvidia GPU, you might want to install CUDA Toolkit 12.4 first.
We can use the application similar to how we use ChatGPT online. Notice that it’s much faster than the typical GPT-4 response.
E. Model settings
We can customize the model response by going to the settings and playing around with model parameters.
We can also connect a local folder with the files to get a context-aware response.
Moreover, we can enable the API server so that any application can use our model using an API key.
F. Accessing OpenAI models
We can access GPT-3.5 and GPT-4 models by providing the OpenAI API key.
We need to go to the model’s page, scroll down, provide the API key to the GPT-4 model, and press the install button.
Then, we select the ChatGPT-4 model at the chat user interface.
We can now start using it as if we’re using it on our browser.
Let’s move on to our next LLM. This tutorial on LLM classification will help you choose the best LLM for your application.
2. LM Studio
LM Studio provides options similar to GPT4All, except it doesn’t allow connecting a local folder to generate context-aware answers.
A. Installation
We can download the installer from LM Studio’s home page.
Once the download is complete, we install the app with default options.
Finally, we launch LM Studio!
B. Downloading the model
We can download any model from Hugging Face using the search function.
In our case, we'll download the smallest model, Google’s Gemma 2B Instruct.
C. Generating the response
We can select the downloaded model from the drop-down menu at the top and chat with it as usual. LM Studio offers more customization options than GPT4All.
D. Local inference server
Like GPT4All, we can customize the model and launch the API server with one click. To access the model, we can use the OpenAI API Python package, CURL, or directly integrate with any application.
E. Using multiple models
The key feature of LM Studio is that it offers the option to run and serve multiple models at once. This allows users to compare different model results and use them for multiple applications. In order to run multiple model sessions, we need a high GPU VRAM.
Fine-tuning is another way of generating context-aware and customized responses. You can learn to fine-tune your Google Gemma model by following the tutorial Fine Tuning Google Gemma: Enhancing LLMs with Customized Instructions. You'll learn to run inference on GPUs/TPUs and fine-tune the latest Gemma 7b-it model on a role-play dataset.
3. Jan
One of the most popular and best-looking local LLM applications is Jan. It’s faster than any local LLM application—it generates a response at 53.26 tokens/sec. For comparison, GPT4All’s rate is 31 tokens/sec.
A. Installation
We can download the installer from Jan.ai.
Once we install the Jan application with default settings, we’re ready to launch the application.
B. Importing the model
When we covered GPT4All and LM Studio, we already downloaded two models. Instead of downloading another one, we'll import the ones we already have by going to the model page and clicking the Import Model button.
Then, we go to the applications directory, select the GPT4All and LM Studio models, and import each.
- GPT4All: "C:/Users/<user_name>/AppData/Local/nomic.ai/GPT4All/"
- LM Studio: "C:/Users/<user_name>/.cache/lm-studio/models"
C. Accessing the local models
To access the local models, we go to the chat user interface and open the model section in the right panel.
We see our imported models are already there. We can select the one we want and start using it immediately!
D. Generating the response
The response generation is very fast. The user interface feels natural, similar to ChatGPT, and does not slow down your laptop or PC.
Jan's unique feature is that it allows us to install extensions and use proprietary models from OpenAI, MistralAI, Groq, TensorRT, and Triton RT.
E. Local API server
Like LM Studio and GPT4All, we can also use Jan as a local API server. It provides more logging capabilities and control over the LLM response.
4. llama.cpp
Another popular open-source LLM framework is llama.cpp. It's written purely in C/C++, which makes it fast and efficient.
Many local and web-based AI applications are based on llama.cpp. Thus, learning to use it locally will give you an edge in understanding how other LLM applications work behind the scenes.
A. Downloading the llama.cpp
First, we need to go to our project directory using the cd
command in the shell—you can learn more about the terminal in this Introduction to Shell course.
Then, we clone all the files from the GitHub server using the command below:
$ git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
B. Using MakeFile on Windows
The make
command line tool is available by default in Linux and MacOS. For Windows, however, we need to take the following steps:
- Download the latest w64devkit Fortran version of w64devkit for Windows.
- Extract w64devkit on our local directory.
- In the main folder, we need to find the file w64devkit.exe and run it.
- Use the
$ cd C:/Repository/GitHub/llama.cpp
command to access the llama.cpp folder. - Type
$ make
and press Enter to install llama.cpp.
B. Starting the llama.cpp’s WebUI server
After we complete the installation, we run the llama.cpp web UI server by typing out the command below. (Note: We’ve copied the model file from the GPT4All folder to the llama.cpp folder so we can easily access the model).
$ ./server -m Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf -ngl 27 -c 2048 --port 6589
The web server is running at http://127.0.0.1:6589/. You can copy this URL and paste it into your browser to access the llama.cpp web interface.
Before interacting with the chatbot, we should modify the settings and model’s parameters.
Check out this llama.cpp tutorial if you want to learn more!
D. Generating the response
The response generation is slow because we run it on CPU, not GPU. We must install a different version of llama.cpp to run it on GPU.
$ make LLAMA_CUDA=1
5. llamafile
If you find llama.cpp a bit too complex, try llamafile. This framework simplifies LLMs for both developers and end-users by combining llama.cpp with Cosmopolitan Libc into a single-file executable. It removes all the complexities associated with LLMs, making them more accessible.
A. Downloading the model file
We can download the model file we want from llamafile’s GitHub repository.
We'll download LLaVA 1.5 because it can also understand images.
B. Making changes for Windows
Windows users must add .exe to file names in the terminal. To do this, right-click the downloaded file and select Rename.
C. Running the LlamaFile
We first go to llamafile directory by using the cd
command in the terminal. Then, we run the command below to start the llama.cpp web server.
$ ./llava-v1.5-7b-q4.llamafile -ngl 9999
The web server uses the GPU without requiring you to install or configure anything.
It'll also automatically launch the default web browser with the llama.cpp web application running. If it doesn’t, we can use the URL http://127.0.0.1:8080/ to access it directly.
D. Generating the response
After we settle on the model’s configuration, we can start using the web application.
Running the llama.cpp using the llamafile is easier and more efficient. We generated the response with 53.18 tokens/sec (without llamafile, the rate was 10.99 tokens/sec).
6. Ollama
Ollama is a tool that allows us to easily access through the terminal LLMs such as Llama 3, Mistral, and Gemma.
Additionally, multiple applications accept an Ollama integration, which makes it an excellent tool for faster and easier access to language models on our local machine.
A. Installing Ollama
We can download Ollama from the download page.
Once we install it (use default settings), the Ollama logo will appear in the system tray.
B. Running Ollama
We can download the Llama 3 model by typing the following terminal command:
$ ollama run llama3
Llama 3 is now ready to use! Bellow, we see a list of commands we need to use if we want to use other LLMs:
C. Running custom models
To access models that have already been downloaded and are available in the llama.cpp folder, we need to:
- Go to the llama.cpp folder using the
cd
command.
$ cd C:/Repository/GitHub/llama.cpp
- Create a filename called
Modelfile
and add the line"FROM ./Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf"
.
$ echo "FROM ./Nous-Hermes-2-Mistral-7B-DPO.Q4_0.gguf" > Modelfile
- Build the model by providing the model’s name.
$ ollama create NHM-7b -f Modelfile
- Run the NHM-7b model.
$ ollama run NHM-7b
- Use it as any other chat application.
With this method, we can download any LLM from Hugging Face with the .gguf extension and use it in the terminal. If you want to learn more, check out this course on Working with Hugging Face.
7. NextChat
NextChat, previously known as ChatGPT-Next-Web, is a chat application that allows us to use GPT-3, GPT-4, and Gemini Pro via an API key.
It’s also available on the web UI, and we can even deploy our own web instant using one click on Vercel.
You can learn more about NextChat by following this detailed introduction to ChatGPT Next Web (NextChat).
A. Installation
We can download the installer from the GitHub repository. For Windows, select the .exe file.
As before, we’ll install the package using the default settings.
B. Setting up the API key
The NextChat application won't run until we add a Google AI or OpenAI API key.
To get the API key for Google AI, we need to go to Gemini API and click the blue button Get API key in Google AI Studio.
After that, we need to click the Get API key button and then create and copy the API key. It’s free, with no token limitation.
Once we have the API key, we navigate to NextChat’s settings and scroll down to the Model Provider section. We select Google as the model provider and paste the API key into the designated field.
C. Generating the response
On the main chat user interface page, click the robot (🤖) button above the chat input and select the gemini-pro model.
We use Google Gemini locally and have full control over customization. The user data is also saved locally.
Similarly, we can use the OpenAI API key to access GPT-4 models, use them locally, and save on the monthly subscription fee.
Conclusion
Installing and using LLMs locally can be a fun and exciting experience. We can experiment with the latest open-source models on our own, enjoy privacy, control, and an enhanced chat experience.
Using LLMs locally also has practical applications, such as integrating it with other applications using API servers and connecting local folders to provide context-aware responses. In some cases, it is essential to use LLMs locally, especially when privacy and security are critical factors.
You can learn more about LLMs and building AI applications by following these resources:
As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.
Build your AI career with DataCamp!
Course
Large Language Models for Business
Course
AI Ethics
blog
The Pros and Cons of Using LLMs in the Cloud Versus Running LLMs Locally
blog
12 LLM Projects For All Levels
blog
8 Top Open-Source LLMs for 2024 and Their Uses
tutorial
How to Run Llama 3 Locally: A Complete Guide
tutorial
How to Build LLM Applications with LangChain Tutorial
tutorial