Course
8 Useful Shell Commands for Data Science
There are many scenarios where you need to quickly analyze, modify and process large files, both in number and size. However, files are often text-based Comma Separated Values (CSV) files. Opening them with standard spreadsheet applications, such as Excel, LibreOffice or OpenOffice, overloads a machine's memory. Also, processing a large number of files in one batch often fails after a few hours because of some unexpected file anomaly.
You inevitably spend a lot of time in frozen screens, restarts and long waits.
However, most of these tasks could be carried out with a few lines of code. What's more, familiarity with a few simple shell command lines can go a long way in saving time and reducing frustration.
This post will give an overview of some shell commands that I use nearly every day. I hope you find them useful.
Quick Notes
Note that there are different types of shell (bash, zsh, ...), with bash shell the most common as it is the default shell on OS X and major linux distributions. Zsh (and Oh My Zsh) is a popular and powerful alternative. The shell commands that are included in this blog post have been tested on bash on OS X (macOS) and should work with other shells and environments.
All the following examples are based on the adult dataset from the UCI Machine Learning repository also known as "Census Income" dataset and available here. This data set is commonly used to predict whether income exceeds \$50K/yr based on census data. With 48842 rows and 14 attributes, it is not a large dataset by far but will be sufficient to illustrate the examples. Although the data is stored with the .data
extension, it is a well-formatted CSV file. Feel free to download the dataset to follow along!
Count with wc
Given a new text based file, you want to know how many lines it contains. This can be done with the word count wc -l
command:
$ wc -l adult.data
32562 adult.data
Which tells you that adult.data
file contains 32562 rows. The -l
flag tells wc
to count the lines. But you can also use wc
to count words instead using the -w
flag.
$ wc -w adult.data
488415 adult.data
The adult.data
file contains nearly 500k words.
The wc
command can also count the number of files in a directory by using the output of a simple ls -l
command as input to wc
. Using the output of a command as input to another command using the pipe symbol |
is a useful shell pattern called pipelining. You'll see it in several examples throughout this post.
Assume now you have a folder with many files. The following command will tell you exactly how many file it contains:
$ ls -l <folder> | wc -l
508
There are many other applications for wc
if you start adding wildcards and subfolders.
Let's go back to your adult.data
file and look at the first lines with the head
command. By default, head
outputs 10 lines, you limit the output to the first 2 lines using the -n
flag.
$ head -n 2 adult.data
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
And you notice that the file does not have a header line. The names of columns are not within the file. You can add that header line by concatenating two files using the cat
command.
Concatenate Files with cat
The cat
command prints a file's content to the standard output (aka your terminal). It can also be used to concatenate a series of files. The command cat file_1.csv file_2.csv > target_file.csv
will merge the content of both file_1.csv
and file_2.csv
into target_file.csv
, adding file_2.csv
at the end of file_1.csv
.
The header file is not in the original dataset and you need to create it. To do so, you echo
the comma separated list of column names into a header.csv
file.
$ echo "age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,class" > header.csv
In this example, you used the shell redirection >
character to dump the output of cat
to the adult.csv
file. The >
will either create a file or replace its content entirely if the file already exists. Doubling the symbol >>
will append the new content to an already existing file without erasing its content.
Let's now add the header.csv
file at the beginning of the adult.data
file. At the same time, you rename adult.data
to adult.csv
since it is, after all, a CSV formatted file.
$ cat header.csv adult.data > adult.csv
Check that the first row of adult.csv
contains the column names with:
$ head -n 1 adult.csv
age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,class
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
Modify a File with sed
Another frequent data mining scenario happens when a file is corrupted or badly formatted, such as with non UTF-8 characters or a misplaced comma. You can correct that file without actually opening it using the sed
command.
The generic sed
pattern is
$ sed "s/<string to replace>/<string to replace it with>/g" <source_file> > <target_file>.
The adult.csv
dataset uses the ?
character to denote a missing value. You can replace that character with a better suited default value using sed
. The empty string is preferable as an indication of a missing value as it will be interpreted as a NaN
value when loading the data in a Pandas DataFrame.
$ grep ", ?," adult.csv | wc -l
2399
This gives you a count of 2399 lines with at least one column with a missing value denoted by ?
. The following command will replace all the columns with ?
by an empty string. All the cell that only contain the ?
, followed by a space will now be truly empty.
$ sed "s/, ?,/,,/g" adult.csv > adult.csv
Note that you use the column delimiter ,
in the source and target strings to avoid replacing legit question marks that could be present elsewhere in the dataset.
Subset a Large File
Imagine now that you're dealing with a file with over 30 Million rows. As you craft your Python or R script, you will want to test your workflow on just a sample of that original large file and not have to load the whole dataset into memory. You can use a combination of head
and tail
to create a subsample of the original dataset.
head
outputs the first part of files;tail
outputs the last part of files
As you've seen before, these commands take a -n
flag, which stands for --lines
, that restrict the number of lines in the output. tail
also takes a -f
flag where f
stands for --follow
which outputs appended data as lines are added to the end of the file. This is particularly convenient to monitor log files as your scripts are running using tail -f <logfile>
.
Mixing in the pipe symbol |
with head
and tail
, you can extract a certain number of lines from within a source file and export the content to a subset file. For example, to extract 20 lines starting at line 100:
$ head -n 120 adult.csv | tail -n 20 > adult_sample.csv
Note that you first take the first_line + number_of_lines with head
and then tail
the number_of_lines. The generic subsetting command is:
$ head -n <total_lines> <source_file> | tail -n <number_of_lines> > <target_file>
where total_lines = first_line + number_of_lines
. However, your new sampled file no longer contains the header row. We already know how to add the header back in the subset file using cat
to concatenate the subset file and the header.csv
file created before:
$ cat adult_sample.csv header.csv > adult_sample_with_header.csv
Note that you did not use the source filename adult\_sample.csv
as the target filename, but instead created a new file titled adult\_sample\_with\_header.csv
. Having the same name as one of the sources and as the target while using cat
will result in unexpected file content. It is better to create a new file as the output of the cat
command.
Finding Duplicates with uniq
With the uniq
command you can find adjacent repeated lines in a file. uniq
takes several flags, the more useful ones being:
uniq -c
: which adds the repetition count to each line;uniq -d
: which only outputs duplicate lines; Anduniq -u
: which only outputs unique lines.
However, uniq
is not a smart command. Repeated lines will not be detected if they are not adjacent. Which means that you first need the sort
the file. This command counts the number of duplicated lines in adult.csv
.
$ sort adult.csv | uniq -d | wc -l
23
and shows that there are 23 duplicates. The next command takes the output of all lines with added repetition counts, sorts in reverse and outputs the first 3 duplicates:
$ sort adult.csv | uniq -c | sort -r | head -n 3
3 25, Private, 195994, 1st-4th, 2, Never-married, ...
2 90, Private, 52386, Some-college, 10, Never-married, ...
2 49, Self-emp-not-inc, 43479, Some-college, 10, Married-civ-spouse, ...
There are many powerful options that can be obtained by combining sort
and uniq
with different flags. Use the man sort
and man uniq
to further explore these commands.
Selecting Columns with cut
The great thing about CSV files and shell commands is that you can also work at the column level by using cut
to select a particular column. cut
takes two main flags: -d
to specify the column delimiter and -f
to specify the columns you want to work on. In the following example, you use cut
to find the number of unique values taken by the categorical variable workclass
(column 2).
First select the column workclass
and pipe to head to verify that you have the right column:
$ cut -d "," -f 2 adult.csv | head -3
workclass
State-gov
Self-emp-not-inc
Now, to count uniques, you sort the output of cut and pipe the result to uniq -c
, as such:
$ cut -d "," -f 2 adult.csv | sort | uniq -c
1837
960 Federal-gov
2093 Local-gov
7 Never-worked
22696 Private
1116 Self-emp-inc
2541 Self-emp-not-inc
1298 State-gov
14 Without-pay
1 workclass
Which tells you, for instance, that you have 1837 null values, and that the main class is by far the Private
class with 22969 occurrences.
The command line above is similar to the value_counts()
method applied to a DataFrame containing the adult.csv
data.
Looping
So far, you've mostly worked on one file. To rename, process or transfer large number of files, you should add loops to our shell toolkit. The generic format of a loop in bash shell is:
while true; do
_do something_ ;
done
Let's put that to work. Imagine you have 1000 files with spaces in their filename and you want to replace each space by an undescore "_".
replace_source=' '
replace_target='_'
for filename in ./*.csv; do
new_filename=${filename//$replace_source/$replace_target}
mv "$filename" "$new_filename"
done
In this example,
- You first declare two variables:
replace_source
andreplace_target
as placeholders for the space and the underscore characters - You loop over all the
*.csv
files in the current folder - for each
filename
, you create anew_filename
by replacing each space by an underscore - and you rename the file from its current filename to the new filename with the
mv
command
In fact, not only have you used shell to loop through files in a directory, but you also have created variables.
Variables
You will end this intro to shell command with variables. Variable creation in shell is simply done by
$ varname='<a string>'
$ varname=a number
For instance:
$ varname='Hello world'
$ varname=123.4
Note that there is no space around the '=' sign. var = '<a string>'
would not work. To return the variable value, simply echo
it:
$ echo $varname
123.4
And to use it in a script, encapsulates it within quotes as you've seen before:
$ mv "$filename" "$new_filename"
Writing loops and using variables opens the door for more complex file manipulations and is a gateway to shell scripting which is beyond this introductory article. However, shell scripting is simple enough if you start small and improve as your confidence grows.
Conclusion
Shell command lines are extremely useful in your daily work as data scientists. There are many more examples and use cases that could be explored. And as you will see there are often always different ways to achieve a particular result using a wide range of available shell commands and related flags. In that case, keeping it simple is always the winning strategy.
Hopefully the examples presented here will help you integrate shell commands in your data processing toolkit. Feel free to reach out and share your shell tricks with me on twitter: @alexip.
Check out our Setup a Data Science Environment on your Computer tutorial.
Data Science Courses
Course
Introduction to Shell
Course
Data Processing in Shell
blog
What is Shell?
Wendy Gittleson
13 min
blog
Top 10 Data Science Tools To Use in 2024
tutorial
SQL Commands for Data Scientists
Emiko Sano
12 min
tutorial
SQL Commands Glossary—An Overview of 25 Most Popular SQL Commands
Bekhruz Tuychiev
8 min
tutorial
Docker for Data Science: An Introduction
tutorial