In this workspace, you'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python requests
package. You'll extract words from this web data using BeautifulSoup
before analyzing the distribution of words using the Natural Language ToolKit (nltk
) and Counter
.
The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.
What are the most frequent words in Herman Melville's novel Moby Dick, and how often do they occur?
Note that the HTML file you are asked to request is a cashed version of this file from Project Gutenberg.
Your project will follow these steps:
The first step will be to request the Moby Dick HTML file using requests and encoding it to utf-8. Here is the URL to scrape from: https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm Next, you'll extract the HTML and create a BeautifulSoup object using an HTML parser to get the text. Following that, you'll initialize a regex tokenizer object tokenizer using nltk.tokenize.RegexpTokenizer to keep only alphanumeric text, assigning the results to tokens. You'll transform the tokens into lowercase, removing English stop words, and saving the results to words_no_stop. Finally, you'll initialize a Counter object and find the ten most common words, saving the result to top_ten and printing to see what they are.
# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')
# Get the Moby Dick HTML
r = requests.get('https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm')
# Set the correct text encoding of the HTML page
r.encoding = 'utf-8'
# Extract the HTML from the request object
html = r.text
# Print the first 2000 characters in html
print(html[0:2000])
# Create a BeautifulSoup object from the HTML
html_soup = BeautifulSoup(html, "html.parser")
# Get the text out of the soup
moby_text = html_soup.get_text()
# Create a tokenizer
tokenizer = nltk.tokenize.RegexpTokenizer('\w+')
# Tokenize the text
tokens = tokenizer.tokenize(moby_text)
# Create a list called words containing all tokens transformed to lowercase
words = [token.lower() for token in tokens]
# Print out the first eight words
words[:8]
# Get the English stop words from nltk
stop_words = nltk.corpus.stopwords.words('english')
# Print out the first eight stop words
stop_words[:8]
# Create a list words_ns containing all words that are in words but not in stop_words
words_no_stop = [word for word in words if word not in stop_words]
# Print the first five words_no_stop to check that stop words are gone
words_no_stop[:5]
# Initialize a Counter object from our processed list of words
count = Counter(words_no_stop)
# Store ten most common words and their counts as top_ten
top_ten = count.most_common(10)
# Print the top ten words and their counts
print(top_ten)