Track
Tokenization is a fundamental task when working on NLP tasks. It involves breaking down text into smaller units, known as tokens, which can be words, subwords, or characters.
Efficient tokenization is crucial for the performance of language models, making it an essential step in various NLP tasks such as text generation, translation, and summarization.
Tiktoken is a fast and efficient tokenization library developed by OpenAI. It provides a robust solution for converting text into tokens and vice versa. Its speed and efficiency make it an excellent choice for developers and data scientists working with large datasets and complex models.
This guide is tailored for developers, data scientists, and anyone planning to use Tiktoken and needs a practical guide with examples.
OpenAI Fundamentals
Get Started Using the OpenAI API and More!
Getting Started With Tiktoken
To start using Tiktoken, we need to install it in our Python environment (Tiktoken is also available for other programming languages). This can be done with the following command:
pip install tiktoken
You can check out the code for the open-source Python version of Tiktoken in the following GitHub repo.
To import the library, we run:
import tiktoken
Encoding Models
Encoding models in Tiktoken determine the rules for breaking down text into tokens. These models are crucial as they define how the text is split and encoded, impacting the efficiency and accuracy of language processing tasks. Different OpenAI models use different encodings.
Tiktoken provides three encoding models optimized for different use cases:
- o200k_base: Encoding for the newest GPT-4o-Mini model.
- cl100k_base: Encoding model for newer OpenAI models such as GPT-4 and GPT-3.5-Turbo.
- p50k_base: Encoding for Codex models, these models are used for code applications.
- r50k_base: Older encoding for different versions of GPT-3.
All of these models are available with OpenAI’s API. Notice that the API gives access to many more models than I have listed here. Fortunately, the Tiktoken library provides an easy way to check which encoding should be used with which model.
For example, if I need to know what encoding model the text-embedding-3-small model uses, I can run the following command and get the answer as an output:
print(tiktoken.encoding_for_model('text-embedding-3-small'))
We get <Encoding 'cl100k_base'> as an output. Before we get to working with Tiktoken directly, I want to mention that OpenAI has a tokenization web app where you can see how different strings are tokenized—you can access it here. There is also a third-party online tokenizer, Tiktokenizer, which supports non-OpenAI models.
Encoding Text Into Tokens
To encode text into tokens using Tiktoken, you first need to obtain an encoding object. There are two ways to initialize it. First, you can do it with the tokenizer’s name:
encoding = tiktoken.get_encoding("[name of the tokenizer]")
Or, you can run the previously mentioned encoding_for_model function to get the encoder for a specific model:
encoding = tiktoken.encoding_for_model("[name of the model]")
Now, we can run the encode method of our encoding object to encode a string. For example, we can encode the “I love DataCamp” string in the following way—here I use the cl100k_base encoder:
print(encoding.encode("I love DataCamp"))
We get [40, 3021, 2956, 34955] as an output.
Decoding Tokens to Text
To decode the tokens back into text, we can use the .decode() method on the encoding object.
Let’s decode the following tokens [40, 4048, 264, 2763, 505, 2956, 34955]:
print(encoding.decode([40, 4048, 264, 2763, 505, 2956, 34955]))
The tokens decode to “I learn a lot from DataCamp.”
Practical Use Cases and Tips
Outside of encoding and decoding, there are two other use cases I can think of.
Cost estimation and management
Knowing the token count before sending a request to the OpenAI API can help you manage costs effectively. Since OpenAI's billing is based on the number of tokens processed, pre-tokenizing your text allows you to estimate the cost of your API usage. Here's how you can count the tokens in your text using Tiktoken:
tokens = encoding.encode(text)
print(len(tokens))
We simply see how many tokens we got by checking the length of the array. By knowing the number of tokens in advance, you can decide whether to shorten the text or adjust your usage to stay within budget.
You can read more about this approach in this tutorial on Estimating The Cost of GPT Using The tiktoken Library in Python.
Input length validation
When using OpenAI models from the API you are constrained by the maximum number of tokens for inputs and outputs. Exceeding these limits can result in errors or truncated outputs. Using Tiktoken, you can validate the input length and ensure it complies with the token limits.
Conclusion
Tiktoken is an open-source tokenization library offering speed and efficiency tailored to OpenAI’s language models.
Understanding how to encode and decode text using Tiktoken, along with its various encoding models, can greatly enhance your work with large language models.





