Skip to main content
HomeTutorialsArtificial Intelligence (AI)

Vicuna-13B Tutorial: A Guide to Running Vicuna-13B

A complete guide to running the Vicuna-13B model through a FastAPI server.
Nov 2023  · 15 min read

The emergence of large language models has transformed industries, bringing the power of technologies like OpenAI's GPT-3.5 and GPT-4 to various applications. However, the high cost and data confidentiality concerns often deter potential adopters. This is where open-source models like Vicuna-13B come in, offering similar performance without the significant expense and risk to sensitive information.

To learn more about GPT-3.5 and GPT-4, our tutorial Using GPT-3.5 and GPT-4 via the OpenAI API in Python is a good starting point to understand how to work with the OpenAI Python package to programmatically have conversations with ChatGPT.

What is Vicuna-13B?

Vicuna-13B is an open-source conversational model trained from fine-tuning the LLaMa 13B model using user-shared conversations gathered from ShareGPT.

A preliminary evaluation using GPT-4 as a judge showed Vicuna-13B achieving more than 90% quality of chatGPT and Google Bard, then outperformed other models like LLaMa and Alpaca in more than 90% of cases. The overall fine-tuning and evaluation workflow is illustrated below:

Vicuna-13B fine-tuning and evaluation workflow (source)

Vicuna-13B fine-tuning and evaluation workflow (source)

While this highlights the strength of Vicuna-13B, it has its own limitations and is not limited to tasks such as mathematics and reasoning.

Image from LMSYSORG modified by author

Image from LMSYSORG modified by author

Applications of Vicuna-13B

As an intelligent chatbot, the applications of Vicuna-13B are countless, and some of them are illustrated below in different industries, such as customer service, healthcare, education, finance, and travel/hospitality.

Five different applications of Vicuna-13 in real-life

Five different applications of Vicuna-13 in real-life

  • Customer service: with their ability to provide 24/7 services, chatbots can be used to support customers and automate repetitive and simple queries. For instance, Sephora uses chatbots to assist their customers in finding products, check their order status, and get makeup tips from their website.
  • Healthcare: patients can quickly get personalized healthcare information, allowing healthcare specialists to focus on other research for new treatment discoveries. For instance, Babylon Health leverages AI-powered chatbots to conduct health assessments to better monitor patients healthcare.
  • Education: many renowned educational institutions, such as Duolingo have been using AI bots in a more creative way to enhance students’ engagement, facilitating their understanding of more advanced concepts.
  • Finance: The first interaction I have with my bank account when I need assistance is made through an intelligent bot, and I believe that many banks, such as Wells Fargo, are using similar technologies to assist their clients with repetitive tasks such as checking their account balances, transferring money, manage accounts.
  • Travel/hospitality: Similar to banking services, chatbots can be used to help with finding the best destinations based on users’ budget and travel requirements, and even as a virtual concierge. The Equinox Hotel New York’s hospitality chatbot, Omar, handles almost 85% of their customer queries. It allowed them to process 2600 new inquiries in less than three months and generate 150 qualified leads in only four months.

The above list is not exhaustive, and our article How NLP is changing the future of Data Science discusses how NLP is driving the future of data science and machine learning, its future applications, risks, and how to mitigate them.

Importance and relevance of learning how to run Vicuna-13B

Based on these real-life applications, there is no doubt that there is a tremendous demand for AI-powered conversational agents like Vicuna-13B across major industries.

Hence, mastering how to implement and integrate such technology into an industry workflow is becoming an increasing skill as more companies use these technologies to empower their customer engagements, generate more leads, and overall increase their return on investments.

The good news is that the whole purpose of this tutorial is to help you develop the required skills to run the Vicuna-13B model and be able to integrate it into any business information system workflow.


Before diving into the running process of the Vicuna-13B model, it is important to properly set up the prerequisites, which includes:

  • Understanding the Vicuna-13B workflow
  • Setting up the necessary hardware and software
  • Obtaining the LLaMa weights for Vicuna-13B

Once these requirements are met, we will have a better understanding of the input and output formats and be able to successfully run the model.

Understanding the Vicuna-13B workflow

Having a better understanding of the input and output format of the Vicuna-13B can provide the developer with a minimum knowledge of the factors that could affect the model decision-making.

This section explains the main required parameters in the input along with the expected output format of the Vicuna-13B model.

Minimalist workflow of the end-to-end workflow

Minimalist workflow of the end-to-end workflow

Let’s understand the above workflow. First, the user submits the desired input, which includes the prompt, which is the main textual request the user might want a response for.

The prompt is provided along with the following additional parameters:

  • Max-length: used to specify the maximum number of characters the model should consider in the final response.
  • Temperature: this value ranges between 0 and 1. The lower the value, the more deterministic the end result. On the other hand, a higher value leads to more randomness, hence more diverse and creative output.
  • Top-p: is used to control the diversity of the predictions, meaning that it selects the most probable tokens whose cumulative probability exceeds a given threshold. Starting from zero, a higher value increases the chance of finding a better output but requires additional computations.
  • Repetition_penalty: if a given word is mentioned often, this parameter helps avoid over-mentioning that specific word, making the end result more varied.
  • Seed: to ensure a consistent result for a given input across multiple runs.
  • Debug: this parameter can be useful whenever the user wants to activate the debugging mode, which can be beneficial for diagnosing the model behavior and performance.

After submitting the prompt with the above parameters, the Vicuna-13B model performs the relevant processing under the hood to generate the required output, which corresponds to the following schema:

  • Text: the textual output generated by the Vicuna model
  • Metadata: some additional information generated by the model. An in-depth explanation is provided in the next sections.

Setting up the necessary hardware and software

Now that we have a better understanding of the input and output, we can proceed with the identification of the main libraries to successfully run the vicuna model.

  • Conda: ensures a clean and isolated environment for the vicuna model’s dependencies.
  • FastAPI: provides a modern framework to develop the Vicuna model’s API with automatic analysis of the user’s prompt.
  • Uvicorn: serves the Vicuna model’s FastAPI application, ensuring efficient and reliable performance.
  • LLaMa.cpp: handles the processing complexity of the large language models that forms the vicuna model.

Create the virtual environment

All the commands in this section are run from a terminal. Using the conda create statement, we create a virtual environment called vicuna-env.

conda create --name vicuna-env

After successful creating the virtual environment, we activate the above virtual environment using the conda activate statement, as follows from :

conda activate vicuna-env

The above statement should display the name of the environment variable between brackets at the beginning of the terminal as follows:

Name of the virtual environment after activation

Name of the virtual environment after activation

Installation of the libraries

The above libraries can be installed in two ways. The first one is by manually running the pip install for each library, and this can be time-consuming when dealing with a significant amount of libraries and is not optimal when working at an industry level.

The second one is the best option and uses a requirements.txt file containing each one of the required libraries. Below is the content of the requirements.txt file, along with the installation using a single pip install command.

# Content of the requirements.txt file


All three libraries can be installed as follows:

pip install -r requirements.txt

Obtaining the LLaMa weight

From the Hugging Face platform, we can download the Vicuna model weights as shown below:

Two main steps to download Vicuna-13B weight from Hugging face

Two main steps to download Vicuna-13B weight from Hugging face

For a better organization of the code, we can move the downloaded model’s weight to a new model folder.

Packaging and Building an API

While all the prerequisites have been met to run the model via the command line, deploying an API that interfaces with the model is a more practical approach. This will make the model easily accessible to a wider range of users.

To achieve that level, we need the following steps:

  • Load the model from the local machine
  • Build an API to expose the model

The overall structure of the source code is given below, and the complete code is available on GitHub.

------ model
         |------- ggml-vicuna-13b-4bit.bin

------ requirements.txt
  • ggml-vicuna-13b-4bit.bin is the downloaded model weight in the model folder
  • requirements.txt contains the dependencies to run the model
  • contains the Python code to create the API, which is the main focus of this section.

Definition of the main routes of the API

In our case, the following two main routes are required. Before that, we need to import the relevant libraries, load the model, and create an instance of the FastAPI module.

# Import the libraries
from fastapi import FastAPI, HTTPException
from llama_cpp import Llama

# Load the model 
vicuna_model_path = "./model/ggml-vicuna-13b-4bit.bin"
vicuna_model = Llama(model_path=vicuna_model_path)

app = FastAPI()
  • First route: The default route "/” returns the following JSON format “Message”: “Welcome to the Vicuna Demo FastAPI” through the home() function which does not take a parameter.
# Define the default route
def home():
   return {"Message": "Welcome to the Vicuna Demo FastAPI"}
  • Second route:/vicuna_says” responsible for triggering the answer_prompt() function, which takes as a parameter the user’s prompt, and returns a JSON response containing the the Vicuna model’s result and some additional metadata."/vicuna_says")
def answer_prompt(user_prompt):

   if (not (user_prompt)):
       raise HTTPException(status_code=400,
                           detail="Please Provide a valid text message")

   response = vicuna_model(user_prompt)

   return {"Answer": response}

Execution of the routes

We can finally run the API with the following instruction from the command line, from the location of the file.

uvicorn vicuna_app:app --reload
  • uvicorn starts the unicorn server.
  • vicuna_app: corresponds to the python It is the one on the left of the “:” sign.
  • app: corresponds to the object created inside the file with the instruction app = FastAPI(). If the name of the file was the instruction would be uvicorn main:app --reload
  • --reload: an option used to restart the server after any code changes. Keep in mind that this is to be used only when we are in the development stage, not during the deployment stage.

Here is the result we get from running the previous command, and we can see that the server is running on the localhost on port 8000 (

Url of the API running on localhost

Url of the API running on localhost

Accessing the default route is straightforward. Just type the following URL on any browser.

Response from the default route

Response from the default route

Here is an interesting part. The URL provides a complete dashboard for interacting with the API, and below is the result.

Graphical visualization of the two routes

Graphical visualization of the two routes

From the previous Image, when we select the /vicuna_says in the green box, we get the complete guide to interact with our API, as shown below.

The three main steps to execute the user’s prompt

The three main steps to execute the user’s prompt

Just select the Try it out tab and provide the message you want the prediction for in the text_message zone.

We finally get the following result after selecting the Execute button.

Server response from the Vicuna model to the user’s query

Server response from the Vicuna model to the user’s query

The Vicuna model response is provided in the “text” field, and the result looks pretty impressive.

Additional metadata is provided, such as the model being used, the response creation date, the number of tokens in the prompt, and the number of tokens in the response, just to mention a few.

Some insights on how deployments facilitate the use of Vicuna-13B

Deploying such a powerful the model enhances user experience and operational efficiency in several notable ways, and not limited to the ones below:

  • Ease of Access: Deployment allows users to access the model via an API, eliminating the need to download and set up the model locally.

  • Resource management: Adequate server resources can handle multiple simultaneous requests, providing a reliable service.

  • Scalability: Resources can be scaled up to meet growing user demands, ensuring consistent performance.

  • Updates and maintenance: Centralized updates and maintenance ensure users always have access to the latest version of the model.

  • Integration: Easier integration with other systems and applications expands the model's potential use cases and versatility.

Overall, deployment facilitates a more accessible, reliable, and versatile use of the vicuna-13B model, broadening its applicability for various users and applications.

Vicuna-13B Best Practices and Tips

When working with the Vicuna-13B model, a few best practices and tips can help developers optimize their experience, results and troubleshoot common issues.

Tips on optimizing the performance of the Vacune-13B model

By following these best practices and tips, they can ensure a smoother experience and better results when working with the Vicuna-13B model.

4 tips to optimize the performance of the Vicuna-13B model

4 tips to optimize the performance of the Vicuna-13B model

  • Hardware optimization: Ensure the server hosting Vicuna-13B has ample RAM and CPU resources to manage the model's computational needs efficiently.
  • Batch pocessing: Leverage the model's parallel processing capabilities by processing data in batches, whenever possible.
  • Caching results: Implement caching for the results of common queries to reduce redundant computations and enhance response times.
  • Input preprocessing: Preprocess input data to confirm it is in the optimal format for the model, thus aiding in improving both accuracy and performance.

Troubleshooting common issues

Transitioning from performance optimization, it is also important to be adept at troubleshooting common issues that might arise during the use of Vicuna-13B, and below are four examples:

Some common troubleshooting issues

Some common troubleshooting issues

  • Memory errors: If memory errors are encountered, consider options such as increasing the server's RAM or optimizing the input size to make it more manageable for the model.
  • Slow response times: Upgrade the server's CPU or employ batch processing to ameliorate slow response times from the model.
  • Incorrect outputs: In case of incorrect or unexpected outputs, meticulously review the input data for any anomalies or errors that could be affecting the results.
  • Integration issues: Should there be challenges in integrating Vicuna-13B with other systems or applications, thoroughly review the API documentation and verify that all integration points are configured correctly.


In summary, this article has provided a comprehensive overview on setting up and utilizing the Vicuna-13B model. We covered the necessary prerequisites, including hardware, software, and obtaining the LLaMa weights essential for running the model successfully.

Detailed instructions were provided for setting up the working environment, installing the required software, and setting up Vicuna-13B's weights, emphasizing the significance of weights in language models.

Furthermore, we delved into the packaging and deployment of Vicuna-13B, highlighting how deployment facilitates its usage as a web interface and API.

Finally, Practical insights on interacting with Vicuna-13B, examples of its capabilities, and best practices for optimizing its performance and troubleshooting common issues were also shared.

Ready to dive deeper into the world of AI and machine learning? Enhance your skills with the powerful deep learning framework used by AI professionals. Join the Deep Learning with PyTorch course today.

Photo of Zoumana Keita
Zoumana Keita

Zoumana develops LLM AI tools to help companies conduct sustainability due diligence and risk assessments. He previously worked as a data scientist and machine learning engineer at Axionable and IBM. Zoumana is the founder of the peer learning education technology platform ETP4Africa. He has written over 20 tutorials for DataCamp.


Start Your AI Journey Today!


Understanding Artificial Intelligence

2 hr
Uncover AI's challenges and societal implications in this theoretical course. Discover machine learning, deep learning, NLP, generative models & more.
See DetailsRight Arrow
Start Course
See MoreRight Arrow


How to Run Alpaca-LoRA on Your Device

Learn how to run Alpaca-LoRA on your device with this comprehensive guide. Discover how this open-source model leverages LoRA technology to offer a powerful yet efficient AI chatbot solution.
Kurtis Pykes 's photo

Kurtis Pykes

7 min


Vertex AI Tutorial: A Comprehensive Guide For Beginners

Master the fundamentals of setting up Vertex AI and performing machine learning workflows.
Bex Tuychiev's photo

Bex Tuychiev

14 min


Mistral 7B Tutorial: A Step-by-Step Guide to Using and Fine-Tuning Mistral 7B

The tutorial covers accessing, quantizing, fine-tuning, merging, and saving this powerful 7.3 billion parameter open-source language model.
Abid Ali Awan's photo

Abid Ali Awan

12 min


SOLAR-10.7B Fine-Tuned Model Tutorial

A complete guide to using the SOLAR-10.7B fine-tuned model for instruction-based tasks in a real-world scenario.
Zoumana Keita 's photo

Zoumana Keita

13 min


A Comprehensive Guide to Working with the Mistral Large Model

A detailed tutorial on the functionalities, comparisons, and practical applications of the Mistral Large Model.
Josep Ferrer's photo

Josep Ferrer

12 min


FLAN-T5 Tutorial: Guide and Fine-Tuning

A complete guide to fine-tuning a FLAN-T5 model for a question-answering task using transformers library, and running optmized inference on a real-world scenario.
Zoumana Keita 's photo

Zoumana Keita

15 min

See MoreSee More