Skip to content
Comparing Trained LLM Tokenizers
Evaluating Tokenizers of Trained Language Models
For the full course, visit the (Deeplearning.AI)[learn.deeplearning.ai] short course.
Tokenizing text
from transformers import AutoTokenizer
# define the sentence to tokenize
sentence = "Hello world!"
# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
[!NOTE] Understanding AutoTokenizer.from_pretrained()
When you call
AutoTokenizer.from_pretrained("bert-base-cased")
, you're not loading a tokenizer named "bert-base-cased". Instead, you're loading the specific tokenizer that was designed to work with the "bert-base-cased" model. TheAutoTokenizer
class automatically identifies and initializes the appropriate tokenizer type with the correct vocabulary and configuration files associated with that model.
# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids
print(token_ids)
# to map each token ID to its corresponding token
for id in token_ids:
print(tokenizer.decode(id))
Visualizing Tokenization
# create list of colors in RGB for representing the tokens
colors = [
'102;194;165', '252;141;98', '141;160;203',
'231;138;195', '166;216;84', '255;217;47'
]
# creat function to visualize tokens in the sentence
def show_tokens(sentence: str, tokenizer_name: str):
"""Show the tokens each separated by different color."""
# load the tokenizer and tokenize the input
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
token_ids = tokenizer(sentence).input_ids
# extract vocabulary length
print(f"Vocab length: {len(tokenizer)}")
# print a colored list of tokens
for idx, t in enumerate(token_ids):
print(
f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
tokenizer.decode(t) +
'\x1b[0m',
end=' '
)
text = """
English and CAPITALIZATION
π΅ ιΈ
show_tokens False None elif == >= else: two tabs:" " Three tabs: " "
12.0*50=600
"""
bert-base-cased
show_tokens(text, "bert-base-cased")
β
β
β
β
β