Data Processing in Shell

Learn powerful command-line skills to download, process, and transform data, including machine learning pipeline.
Start Course for Free
4 Hours13 Videos46 Exercises7,286 Learners
3550 XP

Create Your Free Account

By continuing you accept the Terms of Use and Privacy Policy. You also accept that you are aware that your data will be stored outside of the EU and that you are above the age of 16.

Loved by learners at thousands of companies

Course Description

We live in a busy world with tight deadlines. As a result, we fall back on what is familiar and easy, favoring GUI interfaces like Anaconda and RStudio. However, taking the time to learn data analysis on the command line is a great long-term investment because it makes us stronger and more productive data people. <br /> <br /> In this course, we will take a practical approach to learn simple, powerful, and data-specific command-line skills. Using publicly available Spotify datasets, we will learn how to download, process, clean, and transform data, all via the command line. We will also learn advanced techniques such as command-line based SQL database operations. Finally, we will combine the powers of command line and Python to build a data pipeline for automating a predictive model.

  1. 1

    Downloading Data on the Command Line

    In this chapter, we learn how to download data files from web servers via the command line. In the process, we also learn about documentation manuals, option flags, and multi-file processing.
    Play Chapter Now
  2. 2

    Data Cleaning and Munging on the Command Line

    We continue our data journey from data downloading to data processing. In this chapter, we utilize the command line library csvkit to convert, preview, filter and manipulate files to prepare our data for further analyses.
    Play Chapter Now
  3. 3

    Database Operations on the Command Line

    In this chapter, we dig deeper into all that csvkit library has to offer. In particular, we focus on database operations we can do on the command line, including table creation, data pull, and various ETL transformation.
    Play Chapter Now
  4. 4

    Data Pipeline on the Command Line

    In the last chapter, we bridge the connection between command line and other data science languages and learn how they can work together. Using Python as a case study, we learn to execute Python on the command line, to install dependencies using the package manager pip, and to build an entire model pipeline using the command line.
    Play Chapter Now
In the following tracks
Data Engineer
Adrián SotoHillary Green-Lerman
Susan Sun Headshot

Susan Sun

Data Person
I want to make a statistically significant difference in the world.

My expertise lies in design-based strategic data consulting for start-ups, data-related training & education, conscientious application of data science for business, and pro bono data work for social change.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA