Skip to main content
HomeTutorialsArtificial Intelligence (AI)

Tiktoken Tutorial: OpenAI's Python Library for Tokenizing Text

Tiktoken is a fast BPE tokenizer developed by OpenAI, primarily used to count tokens for their large language models and ensure efficient text processing within specified limits.
Aug 7, 2024  · 5 min read

Tokenization is a fundamental task when working on NLP tasks. It involves breaking down text into smaller units, known as tokens, which can be words, subwords, or characters.

Efficient tokenization is crucial for the performance of language models, making it an essential step in various NLP tasks such as text generation, translation, and summarization.

Tiktoken is a fast and efficient tokenization library developed by OpenAI. It provides a robust solution for converting text into tokens and vice versa. Its speed and efficiency make it an excellent choice for developers and data scientists working with large datasets and complex models.

This guide is tailored for developers, data scientists, and anyone planning to use Tiktoken and needs a practical guide with examples.

OpenAI Fundamentals

Get Started Using the OpenAI API and More!

Start Now

Getting Started With Tiktoken

To start using Tiktoken, we need to install it in our Python environment (Tiktoken is also available for other programming languages). This can be done with the following command:

pip install tiktoken

You can check out the code for the open-source Python version of Tiktoken in the following GitHub repo.

To import the library, we run:

import tiktoken

Encoding Models

Encoding models in Tiktoken determine the rules for breaking down text into tokens. These models are crucial as they define how the text is split and encoded, impacting the efficiency and accuracy of language processing tasks. Different OpenAI models use different encodings.

Tiktoken provides three encoding models optimized for different use cases:

  • o200k_base: Encoding for the newest GPT-4o-Mini model.
  • cl100k_base: Encoding model for newer OpenAI models such as GPT-4 and GPT-3.5-Turbo.
  • p50k_base: Encoding for Codex models, these models are used for code applications.
  • r50k_base: Older encoding for different versions of GPT-3.

All of these models are available with OpenAI’s API. Notice that the API gives access to many more models than I have listed here. Fortunately, the Tiktoken library provides an easy way to check which encoding should be used with which model.

For example, if I need to know what encoding model the text-embedding-3-small model uses, I can run the following command and get the answer as an output:

print(tiktoken.encoding_for_model('text-embedding-3-small'))

We get <Encoding 'cl100k_base'> as an output. Before we get to working with Tiktoken directly, I want to mention that OpenAI has a tokenization web app where you can see how different strings are tokenized—you can access it here. There is also a third-party online tokenizer, Tiktokenizer, which supports non-OpenAI models.

Encoding Text Into Tokens

To encode text into tokens using Tiktoken, you first need to obtain an encoding object. There are two ways to initialize it. First, you can do it with the tokenizer’s name:

encoding = tiktoken.get_encoding("[name of the tokenizer]")

Or, you can run the previously mentioned encoding_for_model function to get the encoder for a specific model:

encoding = tiktoken.encoding_for_model("[name of the model]")

Now, we can run the encode method of our encoding object to encode a string. For example, we can encode the “I love DataCamp” string in the following way—here I use the cl100k_base encoder:

print(encoding.encode("I love DataCamp"))

We get [40, 3021, 2956, 34955] as an output.

Decoding Tokens to Text

To decode the tokens back into text, we can use the .decode() method on the encoding object.

Let’s decode the following tokens [40, 4048, 264, 2763, 505, 2956, 34955]:

print(encoding.decode([40, 4048, 264, 2763, 505, 2956, 34955]))

The tokens decode to “I learn a lot from DataCamp.”

Practical Use Cases and Tips

Outside of encoding and decoding, there are two other use cases I can think of.

Cost estimation and management

Knowing the token count before sending a request to the OpenAI API can help you manage costs effectively. Since OpenAI's billing is based on the number of tokens processed, pre-tokenizing your text allows you to estimate the cost of your API usage. Here's how you can count the tokens in your text using Tiktoken:

tokens = encoding.encode(text)
print(len(tokens))

We simply see how many tokens we got by checking the length of the array. By knowing the number of tokens in advance, you can decide whether to shorten the text or adjust your usage to stay within budget.

You can read more about this approach in this tutorial on Estimating The Cost of GPT Using The tiktoken Library in Python.

Input length validation

When using OpenAI models from the API you are constrained by the maximum number of tokens for inputs and outputs. Exceeding these limits can result in errors or truncated outputs. Using Tiktoken, you can validate the input length and ensure it complies with the token limits.

Conclusion

Tiktoken is an open-source tokenization library offering speed and efficiency tailored to OpenAI’s language models.

Understanding how to encode and decode text using Tiktoken, along with its various encoding models, can greatly enhance your work with large language models.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.

Photo of Dimitri Didmanidze
Author
Dimitri Didmanidze
LinkedIn
I'm Dimitri Didmanidze, a data scientist currently pursuing a Master's degree in Mathematics with a focus on Machine Learning. My academic journey has also included research about the capabilities of transformer-based models and teaching at the university level, enriching my understanding of complex theoretical concepts. I have also worked in the banking industry, where I've applied these principles to tackle real-world data challenges.
Topics

Learn AI with these courses!

Course

Introduction to Embeddings with the OpenAI API

3 hr
4.1K
Unlock more advanced AI applications, like semantic search and recommendation engines, using OpenAI's embedding model!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

What is Tokenization?

Tokenization breaks text into smaller parts for easier machine analysis, helping machines understand human language.
Abid Ali Awan's photo

Abid Ali Awan

9 min

cheat-sheet

The OpenAI API in Python

ChatGPT and large language models have taken the world by storm. In this cheat sheet, learn the basics on how to leverage one of the most powerful AI APIs out there, then OpenAI API.
Richie Cotton's photo

Richie Cotton

3 min

tutorial

Estimating The Cost of GPT Using The tiktoken Library in Python

Learn to manage GPT model costs with tiktoken in Python. Explore tokenization, BPE, and estimate OpenAI API expenses efficiently.
Moez Ali's photo

Moez Ali

7 min

tutorial

NLTK Sentiment Analysis Tutorial for Beginners

Python NLTK (natural language toolkit) sentiment analysis tutorial. Learn how to create and develop sentiment analysis using Python. Follow specific steps to mine and analyze text for natural language processing.
Moez Ali's photo

Moez Ali

13 min

tutorial

How to use the OpenAI Text-to-Speech API

OpenAI’s TTS API is an endpoint that enables users to interact with their TTS AI model that converts text to natural-sounding spoken language.
Kurtis Pykes 's photo

Kurtis Pykes

12 min

code-along

Getting Started with the OpenAI API and ChatGPT

Get an introduction to the OpenAI API and the GPT-3 model.
Richie Cotton's photo

Richie Cotton

See MoreSee More