Skip to content

Building NLP Applications with Hugging Face

Welcome! In this project, you will be learning how to perform common Natural Language Processing (NLP) tasks using Hugging Face. Some of these tasks include:

  • sentiment analysis (i.e. categorizing text as negative or positive);
  • text embedding (i.e. transforming a piece of text into a numerical, n-dimensional vector, representation);
  • semantic search (i.e. matching a query with the most appropriate result based on embeddings);
  • and more!

The dataset comes from "Rent the Runway" link and is comprised of user reviews on clothing items, their ratings on fit, and other metadata about the user (i.e. gender, height, size, age, reason for renting) and the item (i.e. category). It is a nice mixture of data types, but most importantly, lots of text!

In order to be successful, you should have:

Intermediate knowledge of Python

  • list comprehension
  • for loops and while loops
  • installing packages
  • creating and using functions
  • using NumPy and Pandas

Basic understanding of NLP

  • What it is
  • Data preparation steps and why they're important
  • Familiarity, though not necessarily expert proficiency, in some NLP tasks

Brief usage of Hugging Face

Most of all, you should have a curiosity about NLP workflows, specifically those in Hugging Face using transformers!

Task 0: Setup

For this project, we will need several Python packages:

  • pandas
  • numpy
  • datetime
  • re
  • string
  • matplotlib.pyplot
  • seaborn
  • transformers
  • sentence_transformers

These packages will help us with the data preprocessing steps, visualization, and, of course, NLP tasks using Hugging Face (i.e. transformers and sentence_transformers).

Instructions

Import the following packages.

  • Import re, datetime, and string.
  • Import pandas using the alias pd.
  • Import numpy using the alias np.
  • Import matplotlib.pyplot using the alias plt.
  • From the transformers package, import pipeline.
  • From the sentence_transformers package, import SentenceTransformer.
  • From the sentence_transformers.util package, import semantic_search.
  • From the IPython.display package, import display and Markdown.
# Import the other required packages and modules.
import re
import datetime as datetime
import string
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from transformers import pipeline
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
from IPython.display import display, Markdown

# From the IPython.display package, import display and Markdown

Task 1: Import the Runway Data

The runway data is contained in a CSV file named runway.csv.

The dataset contains the following columns.

  • user_id: the unique identifier for the user.
  • item_id: the unique identifier for the item/product rented.
  • rating: the rating by the user.
  • rented_for: the reason the item was rented.
  • review_text: the actual text for the submitted user review.
  • category: the category of the item rented.
  • height: the height of the user in the format {feet}'{inches}".
  • size: the size of the item rented by the user.
  • age: the age of the user.
  • review_date: the date the review was made by the user.

Instructions

Import the runway data to a pandas dataframe.

  • Read the data from runway.csv, making sure to parse the date column. |Assign to runway.
  • Print the column info
# Read the data from runway.csv
runway = pd.read_csv("runway.csv", parse_dates = ['review_date'])

# Print the column info
runway.info()

Task 2: Preprocessing the review_text

Most unstructured text, such as reviews for products, are messy. They contain special characters which may not be necessary, extra spaces, irrelevant digits, and more. Therefore, it is common practice to process, or clean, the text before performing NLP tasks on it.

You will create several processing steps for the review_text strings.

Note: there may be some instances where special characters, digits, and the like are important to the meaning, or context, of the sentence. It's best to think through the implications of such preprocessing steps before blindly doing so.

Also note: some Pythonistas will say preprocessing text before using transformers, such as those from Hugging Face, is unnecessary. We will explore this in the following tasks.

Instructions