Skip to content
Building NLP Applications with Hugging Face
  • AI Chat
  • Code
  • Report
  • Building NLP Applications with Hugging Face

    Welcome! In this project, you will be learning how to perform common Natural Language Processing (NLP) tasks using Hugging Face. Some of these tasks include:

    • sentiment analysis (i.e. categorizing text as negative or positive);
    • text embedding (i.e. transforming a piece of text into a numerical, n-dimensional vector, representation);
    • semantic search (i.e. matching a query with the most appropriate result based on embeddings);
    • and more!

    The dataset comes from "Rent the Runway" link and is comprised of user reviews on clothing items, their ratings on fit, and other metadata about the user (i.e. gender, height, size, age, reason for renting) and the item (i.e. category). It is a nice mixture of data types, but most importantly, lots of text!

    In order to be successful, you should have:

    Intermediate knowledge of Python

    • list comprehension
    • for loops and while loops
    • installing packages
    • creating and using functions
    • using NumPy and Pandas

    Basic understanding of NLP

    • What it is
    • Data preparation steps and why they're important
    • Familiarity, though not necessarily expert proficiency, in some NLP tasks

    Brief usage of Hugging Face

    Most of all, you should have a curiosity about NLP workflows, specifically those in Hugging Face using transformers!

    Task 0: Setup

    For this project, we will need several Python packages:

    • pandas
    • numpy
    • datetime
    • re
    • string
    • matplotlib.pyplot
    • seaborn
    • transformers
    • sentence_transformers

    These packages will help us with the data preprocessing steps, visualization, and, of course, NLP tasks using Hugging Face (i.e. transformers and sentence_transformers).

    Instructions

    Import the following packages.

    • Import re, datetime, and string.
    • Import pandas using the alias pd.
    • Import numpy using the alias np.
    • Import matplotlib.pyplot using the alias plt.
    • From the transformers package, import pipeline.
    • From the sentence_transformers package, import SentenceTransformer.
    • From the sentence_transformers.util package, import semantic_search.
    • From the IPython.display package, import display and Markdown.
    # Import the other required packages and modules.
    import pandas as pd
    import datetime
    import re
    import string
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    from transformers import pipeline
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util import semantic_search
    
    
    # From the IPython.display package, import display and Markdown
    from IPython.display import display, Markdown

    Task 1: Import the Runway Data

    The runway data is contained in a CSV file named runway.csv.

    The dataset contains the following columns.

    • user_id: the unique identifier for the user.
    • item_id: the unique identifier for the item/product rented.
    • rating: the rating by the user.
    • rented_for: the reason the item was rented.
    • review_text: the actual text for the submitted user review.
    • category: the category of the item rented.
    • height: the height of the user in the format {feet}'{inches}".
    • size: the size of the item rented by the user.
    • age: the age of the user.
    • review_date: the date the review was made by the user.

    Instructions

    Import the runway data to a pandas dataframe.

    • Read the data from runway.csv, making sure to parse the date column. |Assign to runway.
    • Print the column info
    # Read the data from runway.csv
    runway = pd.read_csv("runway.csv", parse_dates=['review_date'])
    
    # Print the column info
    print(runway.info())

    Task 2: Preprocessing the review_text

    Most unstructured text, such as reviews for products, are messy. They contain special characters which may not be necessary, extra spaces, irrelevant digits, and more. Therefore, it is common practice to process, or clean, the text before performing NLP tasks on it.

    You will create several processing steps for the review_text strings.

    Note: there may be some instances where special characters, digits, and the like are important to the meaning, or context, of the sentence. It's best to think through the implications of such preprocessing steps before blindly doing so.

    Also note: some Pythonistas will say preprocessing text before using transformers, such as those from Hugging Face, is unnecessary. We will explore this in the following tasks.

    Instructions