Skip to content
In this workspace, you'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python requests
package. You'll extract words from this web data using BeautifulSoup
before analyzing the distribution of words using the Natural Language ToolKit (nltk
) and Counter
.
The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter
nltk.download('stopwords')
# Request book text HTML
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm'
r = requests.get(url)
r.encoding = 'utf-8'
html = r.text
# Extract text from the HTML
html_soup = BeautifulSoup(html)
moby_text = soup.get_text().lower()
# Tokenize text to alphanumeric text
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|\$[\d\.]+')
words = tokenizer.tokenize(moby_text)
# Remove stop words
stop_words = stopwords.words('english')
words_no_stop = [t for t in words
if t not in stop_words]
# Count each word and select the top ten
count = Counter(words_no_stop)
top_ten = count.most_common(10)
print(top_ten)