Building NLP Applications with Hugging Face
Welcome! In this project, you will be learning how to perform common Natural Language Processing (NLP) tasks using Hugging Face. Some of these tasks include:
- sentiment analysis (i.e. categorizing text as negative or positive);
- text embedding (i.e. transforming a piece of text into a numerical, n-dimensional vector, representation);
- semantic search (i.e. matching a query with the most appropriate result based on embeddings);
- and more!
The dataset comes from "Rent the Runway" link and is comprised of user reviews on clothing items, their ratings on fit, and other metadata about the user (i.e. gender, height, size, age, reason for renting) and the item (i.e. category). It is a nice mixture of data types, but most importantly, lots of text!
In order to be successful, you should have:
Intermediate knowledge of Python
- list comprehension
- for loops and while loops
- installing packages
- creating and using functions
- using NumPy and Pandas
Basic understanding of NLP
- What it is
- Data preparation steps and why they're important
- Familiarity, though not necessarily expert proficiency, in some NLP tasks
Brief usage of Hugging Face
Most of all, you should have a curiosity about NLP workflows, specifically those in Hugging Face using transformers!
Task 0: Setup
For this project, we will need several Python packages:
pandas
numpy
datetime
re
string
matplotlib.pyplot
seaborn
transformers
sentence_transformers
These packages will help us with the data preprocessing steps, visualization, and, of course, NLP tasks using Hugging Face (i.e. transformers
and sentence_transformers
).
Instructions
Import the following packages.
- Import
re
,datetime
, andstring
. - Import
pandas
using the aliaspd
. - Import
numpy
using the aliasnp
. - Import
matplotlib.pyplot
using the aliasplt
. - From the
transformers
package, importpipeline
. - From the
sentence_transformers
package, importSentenceTransformer
. - From the
sentence_transformers.util
package, importsemantic_search
. - From the
IPython.display
package, importdisplay
andMarkdown
.
# Import the other required packages and modules.
import re
import datetime as datetime
import string
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
from IPython.display import display, Markdown
# From the IPython.display package, import display and Markdown
Task 1: Import the Runway Data
The runway data is contained in a CSV file named runway.csv
.
The dataset contains the following columns.
user_id
: the unique identifier for the user.item_id
: the unique identifier for the item/product rented.rating
: the rating by the user.rented_for
: the reason the item was rented.review_text
: the actual text for the submitted user review.category
: the category of the item rented.height
: the height of the user in the format {feet}'{inches}".size
: the size of the item rented by the user.age
: the age of the user.review_date
: the date the review was made by the user.
Instructions
Import the runway data to a pandas dataframe.
- Read the data from
runway.csv
, making sure to parse the date column. |Assign torunway
. - Print the column info
# Read the data from runway.csv
runway = pd.read_csv("runway.csv", parse_dates = ['review_date'])
# Print the column info
runway.info()
Task 2: Preprocessing the review_text
review_text
Most unstructured text, such as reviews for products, are messy. They contain special characters which may not be necessary, extra spaces, irrelevant digits, and more. Therefore, it is common practice to process, or clean, the text before performing NLP tasks on it.
You will create several processing steps for the review_text
strings.
Note: there may be some instances where special characters, digits, and the like are important to the meaning, or context, of the sentence. It's best to think through the implications of such preprocessing steps before blindly doing so.
Also note: some Pythonistas will say preprocessing text before using transformers, such as those from Hugging Face, is unnecessary. We will explore this in the following tasks.
Instructions