Skip to content

Evaluating Tokenizers of Trained Language Models

For the full course, visit the (Deeplearning.AI)[learn.deeplearning.ai] short course.

Tokenizing text

from transformers import AutoTokenizer
# define the sentence to tokenize
sentence = "Hello world!"
# load the pretrained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

[!NOTE] Understanding AutoTokenizer.from_pretrained()

When you call AutoTokenizer.from_pretrained("bert-base-cased"), you're not loading a tokenizer named "bert-base-cased". Instead, you're loading the specific tokenizer that was designed to work with the "bert-base-cased" model. The AutoTokenizer class automatically identifies and initializes the appropriate tokenizer type with the correct vocabulary and configuration files associated with that model.

# apply the tokenizer to the sentence and extract the token ids
token_ids = tokenizer(sentence).input_ids
print(token_ids)
# to map each token ID to its corresponding token
for id in token_ids:
    print(tokenizer.decode(id))

Visualizing Tokenization

# create list of colors in RGB for representing the tokens
colors = [
    '102;194;165', '252;141;98', '141;160;203',
    '231;138;195', '166;216;84', '255;217;47'
]
# creat function to visualize tokens in the sentence
def show_tokens(sentence: str, tokenizer_name: str):
    """Show the tokens each separated by different color."""
    
    # load the tokenizer and tokenize the input
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    
    # extract vocabulary length
    print(f"Vocab length: {len(tokenizer)}")
    
    # print a colored list of tokens
    for idx, t in enumerate(token_ids):
        print(
            f'\x1b[0;30;48;2;{colors[idx % len(colors)]}m' +
            tokenizer.decode(t) +
            '\x1b[0m',
            end=' '
        )
text = """
English and CAPITALIZATION
🎡 鸟
show_tokens False None elif == >= else: two tabs:"    " Three tabs: "       "
12.0*50=600
"""

bert-base-cased

show_tokens(text, "bert-base-cased")
β€Œ
β€Œ
β€Œ