Comparing Trained LLM Tokenizers

Evaluating Tokenizers of Trained Language Models

For the full course, visit the (Deeplearning.AI)[learn.deeplearning.ai] short course.

Tokenizing text

from transformers import AutoTokenizer

# define the sentence to tokenize
sentence = "Hello world!"

# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

[!NOTE] Understanding AutoTokenizer.from_pretrained()

When you call AutoTokenizer.from_pretrained("bert-base-cased"), you're not loading a tokenizer named "bert-base-cased". Instead, you're loading the specific tokenizer that was designed to work with the "bert-base-cased" model. The AutoTokenizer class automatically identifies and initializes the appropriate tokenizer type with the correct vocabulary and configuration files associated with that model.

# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids
print(token_ids)

# to map each token ID to its corresponding token
for id in token_ids:
    print(tokenizer.decode(id))

Visualizing Tokenization

# create list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]

# creat function to visualize tokens in the sentence
def show_tokens(sentence: str, tokenizer_name: str):
    """Show the tokens each separated by different color."""
    
    # load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    
    # extract vocabulary length
    print(f"Vocab length: {len(tokenizer)}")
    
    # print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )

text = """
English and CAPITALIZATION
🎵 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

bert-base-cased

show_tokens(text, "bert-base-cased")

‌
‌
‌

Comparing Trained LLM Tokenizers

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Evaluating Tokenizers of Trained Language Models

Tokenizing text

Visualizing Tokenization

bert-base-cased

Evaluating Tokenizers of Trained Language Models