Premium Project

Find Movie Similarity from Plot Summaries

Use NLP and clustering on movie plot summaries from IMDb and Wikipedia to quantify movie similarity.

Start Project
  • 12 tasks
  • 2,267 participants
  • 1,500 XP

Project Description

Natural Language Processing (NLP) is an exciting field of study for data scientists where they develop algorithms that can make sense out of conversational language used by humans. In this Project, you will use NLP to find the degree of similarity between movies based on their plots available on IMDb and Wikipedia.

This Project lets you apply the skills from Natural Language Processing Fundamentals in Python and Unsupervised Learning in Python. We recommend that you are familiar with the content in those courses before starting this Project.

The dataset contains the titles of the top 100 movies on IMDb as well as each movie's plot summary from both IMDb and Wikipedia.

Project Tasks

  • 1Import and observe dataset
  • 2Combine Wikipedia and IMDb plot summaries
  • 3Tokenization
  • 4Stemming
  • 5Club together Tokenize & Stem
  • 6Create TfidfVectorizer
  • 7Fit transform TfidfVectorizer
  • 8Import KMeans and create clusters
  • 9Calculate similarity distance
  • 10Import Matplotlib, Linkage, and Dendrograms
  • 11Create merging and plot dendrogram
  • 12Which movies are most similar?
Anubhav Singh

Founder at The Code Foundation

A developer since childhood, Anubhav has been an explorer of technologies. Starting off with developing his own social network and search engine at the age of 15, he's continuously developing software for the community in domains which require uncommon combinations of technology and stacks. You can often catch him guiding students on how to approach the fantastic sciences of Artificial Intelligence. He’s also the Founder of The Code Foundation which is an open source organization working on multimedia search and natural language processing.

See More


  • Python LogoPython
  • Topics

    Data ManipulationData VisualizationMachine LearningProbability & StatisticsImporting & Cleaning Data