Premium project

Find Movie Similarity from Plot Summaries

Use NLP and clustering on movie plot summaries from IMDb and Wikipedia to quantify movie similarity.

Start Project
12 Tasks1,500 XP

Loved by learners at thousands of companies


Project Description

Natural Language Processing (NLP) is an exciting field of study for data scientists where they develop algorithms that can make sense out of conversational language used by humans. In this Project, you will use NLP to find the degree of similarity between movies based on their plots available on IMDb and Wikipedia. The dataset contains the titles of the top 100 movies on [IMDb](https://www.imdb.com/) as well as each movie's plot summary from both IMDb and Wikipedia.

Project Tasks

  1. 1
    Import and observe dataset
  2. 2
    Combine Wikipedia and IMDb plot summaries
  3. 3
    Tokenization
  4. 4
    Stemming
  5. 5
    Club together Tokenize & Stem
  6. 6
    Create TfidfVectorizer
  7. 7
    Fit transform TfidfVectorizer
  8. 8
    Import KMeans and create clusters
  9. 9
    Calculate similarity distance
  10. 10
    Import Matplotlib, Linkage, and Dendrograms
  11. 11
    Create merging and plot dendrogram
  12. 12
    Which movies are most similar?
Technologies
Python Python
Topics
Data ManipulationData VisualizationMachine LearningProbability & StatisticsImporting & Cleaning Data
Anubhav Singh Headshot

Anubhav Singh

Founder at The Code Foundation
A developer since childhood, Anubhav has been an explorer of technologies. Starting off with developing his own social network and search engine at the age of 15, he's continuously developing software for the community in domains which require uncommon combinations of technology and stacks. You can often catch him guiding students on how to approach the fantastic sciences of Artificial Intelligence. He’s also the Founder of The Code Foundation which is an open source organization working on multimedia search and natural language processing.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA