Many of you joined us on our Facebook page to see DataCamp's very own Hugo Bowne-Anderson go through a word frequency histograms of words in Moby Dick. But for those that joined us late, or couldn't make it, here's a recap of the event.
This live code-along session covers how to build a data science pipeline to plot frequency histograms of words in Moby Dick, among many other novels. Hugo shows you how to scrape the novels from Project Gutenberg using the Python package
requests and how to extract the novels from this web data using
BeautifulSoup. Then he dives into analyzing the novels using the Natural Language ToolKit (
nltk) explaining each step of the process.
For our beginner and intermediate learners
This code-along session is meant for beginners and intermediates alike. Some programming fundamentals and Python basics will help though. Hugo uses Jupyter Notebooks and the terminal. However, if you're not super familiar with these tools, never fear! In this session, you'll get a gentle intro to them.
You'll learn how to put all of these tools together into a data science pipeline to produce informative figures such as this:
Can you guess which novel this word distribution the graph above is from?
Feel free to code along or just watch Hugo do his thing! For those that do wish to code please follow the instructions to set up before starting the video. You'll need, amongst other things, to clone the repo and download the Anaconda Distribution for Python 3.6 if you haven't already.
Now you are ready to watch Hugo and follow along. The video lasts about 1 hour and 20 minutes. You can skip to the 12th minute where Hugo comes on and the session actually starts.
Accessing solutions notebook
You can find the Notebook that Hugo was working on, with the solutions here.
For those of you who prefer a detailed Jupyter Notebook for the same project, it is available here! Can you improve on Hugo's pipeline? (Hint: for now, the words "whale" and "whales" are counted separately).
Spread the word!
Finally, we'd like to announce that we are preparing another live code-along session for the week of November 27th. More on that later. But if you're interested in doing an awesome Machine Learning project using
scikit-learn make sure to stay tuned!