Premium Project

Book Recommendations from Charles Darwin

Build a book recommendation system using NLP and the text of books like "On the Origin of Species."

Start Project
  • 12 tasks
  • 534 participants
  • 1,500 XP

Project Description

Recommendation systems are at the heart of many products such as Netflix or Amazon. They generally rely on metadata (e.g., the actors or director of a movie) or on user tastes (e.g., the movies you liked before) to determine which you are most likely to enjoy. But when you are working with text-heavy datasets, you have access to a much richer resource—the whole text! In this project, you will learn how to build the basis of a book recommendation system based on their content. You will use Charles Darwin's bibliography to find out which books might interest you.

To complete this project, you should be familiar with the basics of Python, Pandas and Natural Language Processing. pandas Foundations and Natural Language Processing Fundamentals in Python are recommended as prerequisites.

The dataset was manually collected from Project Gutenberg.

Project Tasks

  • 1Darwin's bibliography
  • 2Load the contents of each book into Python
  • 3Find "On the Origin of Species"
  • 4Tokenize the corpus
  • 5Stemming of the tokenized corpus
  • 6Building a bag-of-words model
  • 7The most common words of a given book
  • 8Build a tf-idf model
  • 9The results of the tf-idf model
  • 10Compute distance between texts
  • 11The book most similar to "On the Origin of Species"
  • 12Which books have similar content?
Instructor Avatar
Philippe Julien

Senior Data Scientist at King

Philippe is a Senior Data Scientist at King, where he uses his analytical skills to improve games such as Candy Crush Saga. Before that, he worked for eight years as a researcher in computational biology studying how genomes evolve. In general, he is interested in the creative use of data in fields as diverse as science, gaming, sport, or tech in general.

See More

Technology

  • Python LogoPython
  • Topics

    Data ManipulationData VisualizationProbability & StatisticsImporting & Cleaning Data