Tiktoken Tutorial: OpenAI's Python Library for Tokenizing Text

Tiktoken is a fast BPE tokenizer developed by OpenAI, primarily used to count tokens for their large language models and ensure efficient text processing within specified limits.

Aug 7, 2024 · 5 min read

Tokenization is a fundamental task when working on NLP tasks. It involves breaking down text into smaller units, known as tokens, which can be words, subwords, or characters.

Efficient tokenization is crucial for the performance of language models, making it an essential step in various NLP tasks such as text generation, translation, and summarization.

Tiktoken is a fast and efficient tokenization library developed by OpenAI. It provides a robust solution for converting text into tokens and vice versa. Its speed and efficiency make it an excellent choice for developers and data scientists working with large datasets and complex models.

This guide is tailored for developers, data scientists, and anyone planning to use Tiktoken and needs a practical guide with examples.

OpenAI Fundamentals

Get Started Using the OpenAI API and More!

Start Now

Getting Started With Tiktoken

To start using Tiktoken, we need to install it in our Python environment (Tiktoken is also available for other programming languages). This can be done with the following command:

pip install tiktoken

You can check out the code for the open-source Python version of Tiktoken in the following GitHub repo.

To import the library, we run:

import tiktoken

Encoding Models

Encoding models in Tiktoken determine the rules for breaking down text into tokens. These models are crucial as they define how the text is split and encoded, impacting the efficiency and accuracy of language processing tasks. Different OpenAI models use different encodings.

Tiktoken provides three encoding models optimized for different use cases:

o200k_base: Encoding for the newest GPT-4o-Mini model.
cl100k_base: Encoding model for newer OpenAI models such as GPT-4 and GPT-3.5-Turbo.
p50k_base: Encoding for Codex models, these models are used for code applications.
r50k_base: Older encoding for different versions of GPT-3.

All of these models are available with OpenAI’s API. Notice that the API gives access to many more models than I have listed here. Fortunately, the Tiktoken library provides an easy way to check which encoding should be used with which model.

For example, if I need to know what encoding model the text-embedding-3-small model uses, I can run the following command and get the answer as an output:

print(tiktoken.encoding_for_model('text-embedding-3-small'))

We get <Encoding 'cl100k_base'> as an output. Before we get to working with Tiktoken directly, I want to mention that OpenAI has a tokenization web app where you can see how different strings are tokenized—you can access it here. There is also a third-party online tokenizer, Tiktokenizer, which supports non-OpenAI models.

Encoding Text Into Tokens

To encode text into tokens using Tiktoken, you first need to obtain an encoding object. There are two ways to initialize it. First, you can do it with the tokenizer’s name:

encoding = tiktoken.get_encoding("[name of the tokenizer]")

Or, you can run the previously mentioned encoding_for_model function to get the encoder for a specific model:

encoding = tiktoken.encoding_for_model("[name of the model]")

Now, we can run the encode method of our encoding object to encode a string. For example, we can encode the “I love DataCamp” string in the following way—here I use the cl100k_base encoder:

print(encoding.encode("I love DataCamp"))

We get [40, 3021, 2956, 34955] as an output.

Decoding Tokens to Text

To decode the tokens back into text, we can use the .decode() method on the encoding object.

Let’s decode the following tokens [40, 4048, 264, 2763, 505, 2956, 34955]:

print(encoding.decode([40, 4048, 264, 2763, 505, 2956, 34955]))

The tokens decode to “I learn a lot from DataCamp.”

Practical Use Cases and Tips

Outside of encoding and decoding, there are two other use cases I can think of.

Cost estimation and management

Knowing the token count before sending a request to the OpenAI API can help you manage costs effectively. Since OpenAI's billing is based on the number of tokens processed, pre-tokenizing your text allows you to estimate the cost of your API usage. Here's how you can count the tokens in your text using Tiktoken:

tokens = encoding.encode(text)
print(len(tokens))

We simply see how many tokens we got by checking the length of the array. By knowing the number of tokens in advance, you can decide whether to shorten the text or adjust your usage to stay within budget.

You can read more about this approach in this tutorial on Estimating The Cost of GPT Using The tiktoken Library in Python.

Input length validation

When using OpenAI models from the API you are constrained by the maximum number of tokens for inputs and outputs. Exceeding these limits can result in errors or truncated outputs. Using Tiktoken, you can validate the input length and ensure it complies with the token limits.

Conclusion

Tiktoken is an open-source tokenization library offering speed and efficiency tailored to OpenAI’s language models.

Understanding how to encode and decode text using Tiktoken, along with its various encoding models, can greatly enhance your work with large language models.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.

Get Certified, Get Hired

Author

Dimitri Didmanidze

Topics

Artificial Intelligence

Learn AI with these courses!

Track

OpenAI Fundamentals

15 hr

Begin creating AI systems using models from OpenAI. Learn how to use the OpenAI API to prompt OpenAI's GPT and Whisper models.

See Details

Start Course

Course

Introduction to Embeddings with the OpenAI API

3 hr

15.4K

Unlock more advanced AI applications, like semantic search and recommendation engines, using OpenAI's embedding model!

See Details

Start Course

Course

Developing LLM Applications with LangChain

3 hr

36.8K

Discover how to build AI-powered applications using LLMs, prompts, chains, and agents in LangChain.

See Details

Start Course

blog

What is Tokenization?

Tokenization breaks text into smaller parts for easier machine analysis, helping machines understand human language.

Abid Ali Awan

10 min

cheat-sheet

The OpenAI API in Python

ChatGPT and large language models have taken the world by storm. In this cheat sheet, learn the basics on how to leverage one of the most powerful AI APIs out there, then OpenAI API.

Richie Cotton

Tutorial

Estimating The Cost of GPT Using The tiktoken Library in Python

Learn to manage GPT model costs with tiktoken in Python. Explore tokenization, BPE, and estimate OpenAI API expenses efficiently.

Moez Ali

Tutorial

NLTK Sentiment Analysis Tutorial for Beginners

Python NLTK (natural language toolkit) sentiment analysis tutorial. Learn how to create and develop sentiment analysis using Python. Follow specific steps to mine and analyze text for natural language processing.

Moez Ali

Tutorial

How to use the OpenAI Text-to-Speech API

OpenAI’s TTS API is an endpoint that enables users to interact with their TTS AI model that converts text to natural-sounding spoken language.

Kurtis Pykes

code-along

Getting Started with the OpenAI API and ChatGPT

Get an introduction to the OpenAI API and the GPT-3 model.

Richie Cotton

See More See More

OpenAI Fundamentals

Getting Started With Tiktoken

Encoding Models

Encoding Text Into Tokens

Decoding Tokens to Text

Practical Use Cases and Tips

Cost estimation and management

Input length validation

Conclusion

Earn a Top AI Certification

What is Tokenization?

The OpenAI API in Python

Estimating The Cost of GPT Using The tiktoken Library in Python

NLTK Sentiment Analysis Tutorial for Beginners

How to use the OpenAI Text-to-Speech API

Getting Started with the OpenAI API and ChatGPT

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}OpenAI Fundamentals

Introduction to Embeddings with the OpenAI API

Developing LLM Applications with LangChain

What is Tokenization?

The OpenAI API in Python

Estimating The Cost of GPT Using The tiktoken Library in Python

NLTK Sentiment Analysis Tutorial for Beginners

How to use the OpenAI Text-to-Speech API

Getting Started with the OpenAI API and ChatGPT

OpenAI Fundamentals