Project: Word Frequency in Moby Dick

In this workspace, you'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python requests package. You'll extract words from this web data using BeautifulSoup before analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.

The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

# Import and download packages
import requests
from bs4 import BeautifulSoup
import nltk
from collections import Counter
nltk.download('stopwords')

# 1 Request & encode the text
r = requests.get("https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm")
r.encoding = "utf-8"
len(r.text)

# 2 Extract the text
html = r.text
print(html[:2000])

# 3 Create a BeautifulSoup object and get the text
html_soup = BeautifulSoup(html, "html.parser")
moby_text = html_soup.get_text()
print(moby_text[:1000])

# 4 Tokenize the text
tokenizer = nltk.tokenize.RegexpTokenizer("\w+")
tokens = tokenizer.tokenize(moby_text)
print(tokens[:20])

# 5 Convert words to lowercase
words = [token.lower() for token in tokens]
print(words[:8])

# 6 Load in stop words
import nltk

# Download the stopwords corpus if not already present
nltk.download('stopwords')

# Use the correct language code: 'english' (all lowercase)
stop_words = nltk.corpus.stopwords.words("english")
print(stop_words[:8])

# 7 Remove stop words from the text
words_no_stop = [word for word in words if word not in stop_words]
print(words_no_stop[:5])

count = Counter(words_no_stop)
top_ten = count.most_common(10)
print(top_ten)