Skip to content
Project: Word Frequency in Moby Dick
In this workspace, you'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python requests package. You'll extract words from this web data using BeautifulSoup before analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.
The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')
r=requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.html')
r.encoding
'utf-8'