Official Blog
python
+2

Live Coding Recap: Frequencies of Words in Novels

Recap of DataCamp's first live coding session. Here you will find all the resources needed to recreate Hugo's data science pipeline to plot word frequency distribution in novels!

Many of you joined us on our Facebook page to see DataCamp's very own Hugo Bowne-Anderson go through a word frequency histograms of words in Moby Dick. But for those that joined us late, or couldn'��t make it, here'��s a recap of the event.

This live code-along session covers how to build a data science pipeline to plot frequency histograms of words in Moby Dick, among many other novels. Hugo shows you how to scrape the novels from Project Gutenberg using the Python package requests and how to extract the novels from this web data using BeautifulSoup. Then he dives into analyzing the novels using the Natural Language ToolKit (nltk) explaining each step of the process.

This code along session is meant for beginners and intermediates alike. Some programming fundamentals and Python basics will help though. Hugo uses Jupyter Notebooks and the terminal. However, if you'��re not super familiar with these tools, never fear! In this session, you'��ll get a gentle intro to them.

You'll learn how to put all of these tools together into a data science pipeline to produce informative figures such as this:

Can you guess which novel this word distribution the graph above is from?

Feel free to code-along or just watch Hugo do his thing! For those that do wish to code please follow the instructions to set up before starting the video. You'��ll need, amongst other things, to clone the repo and download the Anaconda Distribution for Python 3.6 if you haven'��t already.

Now you are ready to watch Hugo and follow along. The video lasts about 1 hour and 20 minutes. You can skip to the 12th minute where Hugo comes on and the session actually starts.

You can find the Notebook that Hugo was working on, with the solutions here.

For those of you who prefer a detailed Jupyter Notebook for the same project, it is available here! Can you improve on Hugo's pipeline? (Hint: for now, the words "whale" and "whales" are counted separately).

We had a lot of fun making this live video and will be making more in the future. Help us make it better by sharing feedback on Twitter to @DataCamp & @hugobowne.

Finally, we'd like to announce that we are preparing another live code along session for the week of November 27th. More on that later. But if you're interested in doing an awesome Machine Learning project using pandas and scikit-learn make sure to stay tuned!

Want to leave a comment?