Using Open Source AI Models with Hugging Face
Hugging Face has a large ecosystem that consists of a Git platform and multiple open source libraries such as transformers, diffusers, datasets, accelerate, and much more. This project is intended to help you get a broad understanding of Hugging Face ecosystem and its components with a focus on transformers. You will learn about
- How to navigate the Hugging Face Hub
- Design principles of the transformers library
- How to use the datasets library to load and create datasets on the Hub
- How to load and use pre-trained open source models to create custom NLP and Computer Vision pipelines
While this project is meant as an introduction, you can always refer to the Hugging Face documentation pages and the discussion forum to learn more about each library and integrated model.
Before you begin
You'll need a developer account with Hugging Face and create an access token for the final task in this project.
See getting-started.ipynb for steps on how to create a token and store it in Workspace.
Note that Task 8 is computationally intensive and takes several minutes to run in Workspace Premium. It does not consistently run in the free version of Workspace. Click 'Get Premium' in the top right of Workspace to upgrade.
Task 0: Setup
For this project we need the torch, transformers and sentencepiece Python packages in order to load and use pre-trained models. We will also need the huggingface_hub package to programmatically login to the Hugging Face Hub and manage repositories, and the datasets package to download datasets from the Hub.
Instructions
The sentencepiece package is required by transformers to perform inference with some of the pre-trained open source models on Hugging Face Hub and does not need to be explicitly imported. Import the remaining packages as follows.
- Run the provided
!pipcode to install necessary packages and restart your kernel. - Import
torch - Import
huggingface_hubusing the aliashf_hub. - Import
datasets - Import
transformers
!pip install datasets==2.13
!pip install huggingface_hub==0.16.4
!pip install pyarrow>=8.0.0import os
import torch
# Import huggingface_hub using the alias hf_hub
import huggingface_hub as hf_hub
import datasets
import transformersTask 1: Download Pre-trained Models from the HF Hub
Hugging Face Hub as a Git Platform
The Hugging Face website (also known as the Hub) is essentially a Git platform designed to store pretrained models and datasets as Git repositories. Similar to GitHub, it allows users to explore, create, clone, push repositories and so much more. Each pretrained checkpoint has its own repository and in most cases a descriptive README with code snippets to load and run the model. See the bert-base-cased model repository as an example.
How to Use Pretrained Models
While the Hub is a great place to explore different tasks and pretrained models, we need the transformers or diffusers libraries in order to load and make predictions with pre-trained models. These two libraries reimplement the code of the state-of-the-art ML research such that vastly different models can be downloaded, loaded into memory and used in a unified way with a few lines of code.
In this task, you will learn how to use the Auto classes of transformers and the from_pretrained method to download and load any model on the Hugging Face Hub. For a full list of supported models, refer to the GitHub README.
What is the Auto Class?
Auto classes of the transformers are simply tools to load models and their data preprocessors in a unified way. Remember, the library reimplements each model such that they each have their own class (BertModel, RobertaModel, T5Model, etc.) with mostly uniform input and output data format across all models. transformers have the following Auto class types to load models and their data preprocessors:
AutoModelAutoModelForTASK>(more on this below)AutoTokenizerAutoFeatureExtractorAutoImageProcessorAutoProcessor
Loading Models into Memory with from_pretrained
For the first task, you will download the pretrained "cardiffnlp/twitter-roberta-base-emoji" model and load the model and its data preprocessor into memory with the from_pretrained(<REPO_NAME_OR_PATH>) method. The "cardiffnlp/twitter-roberta-base-emoji" is a text classification model that is trained to predict the emoji class ID of a given tweet.
cardiffnlp/twitter-roberta-base-emoji is a valid Git repository on the Hub and the from_pretrained() method downloads and uses the tokenizer specific files from the model repository.
Instructions
Use the AutoTokenizer and AutoModel classes to download a pre-trained language model from the Hub and load its data preprocessor.
- Import
AutoTokenizerandAutoModelclasses - Call the
from_pretrained()method for both classes using the target repository name as input - Identify the explicit class name of the pretrained model using the model configuration
# Import the AutoTokenizer and AutoModel classes from transformers
from transformers import AutoTokenizer, AutoModel
# Load the pre-trained tokenizer of the "cardiffnlp/twitter-roberta-base-emoji" model
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-emoji")