Skip to content
Project: Word Frequency in Moby Dick
  • AI Chat
  • Code
  • Report
  • In this workspace, you'll scrape the novel Moby Dick from the website Project Gutenberg (which contains a large corpus of books) using the Python requests package. You'll extract words from this web data using BeautifulSoup before analyzing the distribution of words using the Natural Language ToolKit (nltk) and Counter.

    The Data Science pipeline you'll build in this workspace can be used to visualize the word frequency distributions of any novel you can find on Project Gutenberg.

    # Import and download packages
    import requests
    from bs4 import BeautifulSoup
    import nltk
    from nltk.corpus import stopwords
    from collections import Counter
    nltk.download('stopwords')
    
    
    # Request book text HTML
    url = 'https://s3.amazonaws.com/assets.datacamp.com/production/project_147/datasets/2701-h.htm'
    r = requests.get(url)
    r.encoding = 'utf-8'
    html = r.text
    
    # Extract text from the HTML
    html_soup = BeautifulSoup(html)
    moby_text = soup.get_text().lower()
    
    # Tokenize text to alphanumeric text
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|\$[\d\.]+')
    words = tokenizer.tokenize(moby_text)
    
    # Remove stop words
    stop_words = stopwords.words('english')
    words_no_stop = [t for t in words
                    if t not in stop_words]
    
    # Count each word and select the top ten
    count = Counter(words_no_stop)
    top_ten = count.most_common(10)
    print(top_ten)