Skip to content

In this workspace, you'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python requests package. You'll extract words from this web data using BeautifulSoup before analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from collections import Counter
nltk.download('stopwords')


# Request book text HTML
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm'
r = requests.get(url)
r.encoding = 'utf-8'
html = r.text

# Extract text from the HTML
html_soup = BeautifulSoup(html)
moby_text = soup.get_text().lower()

# Tokenize text to alphanumeric text
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|\$[\d\.]+')
words = tokenizer.tokenize(moby_text)

# Remove stop words
stop_words = stopwords.words('english')
words_no_stop = [t for t in words
                if t not in stop_words]

# Count each word and select the top ten
count = Counter(words_no_stop)
top_ten = count.most_common(10)
print(top_ten)