Interactive Course

Feature Engineering for NLP in Python

Learn techniques to extract useful information from text and process them into a format suitable for machine learning.

  • 4 hours
  • 15 Videos
  • 52 Exercises
  • 2,557 Participants
  • 4,200 XP

Loved by learners at thousands of top companies:

mercedes-grey.svg
rei-grey.svg
paypal-grey.svg
t-mobile-grey.svg
dell-grey.svg
forrester-grey.svg

Course Description

In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

  1. Text preprocessing, POS tagging and NER

    In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

  2. TF-IDF and similarity scores

    Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

  1. 1

    Basic features and readability scores

    Free

    Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

  2. Text preprocessing, POS tagging and NER

    In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

  3. N-Gram models

    Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

  4. TF-IDF and similarity scores

    Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

What do other learners have to say?

Devon

“I've used other sites, but DataCamp's been the one that I've stuck with.”

Devon Edwards Joseph

Lloyd's Banking Group

Louis

“DataCamp is the top resource I recommend for learning data science.”

Louis Maiden

Harvard Business School

Ronbowers

“DataCamp is by far my favorite website to learn from.”

Ronald Bowers

Decision Science Analytics @ USAA

Rounak Banik
Rounak Banik

Data Scientist at Fractal Analytics

Rounak is a Young India Fellow and the author of the book, Hands-on Recommendation Systems with Python. He currently works as a Data Scientist at Fractal Analytics. He obtained his B.Tech degree in Electronics & Communication Engineering from IIT Roorkee.

See More
Icon Icon Icon professional info