Tutorials
pandas
+4

pandas Tutorial for Beginners

Are you ready to begin your pandas journey? Here’s a step-by-step guide on how to get started.

What is pandas?

pandas is a very popular and easy-to-learn Python library for handling tabular data. It can take in data from a wide range of sources such as CSV files, Excel files, HTML tables on the web, and text files. It allows you to apply the same framework to all of these sources to clean and analyze the data using optimized built-in functionality which scales very well with large datasets.

In our blog post on how to learn pandas, we discussed the learning path you may take to master this package. In this tutorial, we’ll walk you through the installation process and basic pandas syntax, and then apply the syntax to a real-world Airbnb dataset to answer a few business questions. Much of this information is covered in our course, pandas Foundations. This article is aimed at beginners with basic knowledge of Python and no prior experience with pandas to help you get started.

How to Install pandas

Anaconda Distribution

There are a few different ways to install pandas onto your computer. The recommended method in the pandas documentation is to install it as part of the Anaconda distribution, a distribution for data analysis and scientific computing that works cross platform. This distribution also includes other popular packages in the SciPy stack such as NumPy, Matplotlib, and IPython which all work great together with pandas. Installation instructions and links for each operating system for Anaconda can be found here (Windows / MacOs).

Pip Install

Finally, the second recommended installation method is to install pandas using pip from PyPI. This is a very simple option. pip allows you to install individual packages to your computer using the pip install command in the terminal and comes with Python 2 installations greater than or equal to 2.7.9 and Python 3 greater than or equal to 3.4. For a more in-depth introduction to pip, we have a tutorial here.

pip install pandas

Getting Started with pandas

Now that pandas is installed, let’s learn a bit of the basics about how to use the package to handle real-world data. This tutorial will use an Airbnb listings dataset. Specifically, we will use the listings.csv file for New York City. This file includes the the name of the listing and host, information about the location of the listing including the neighborhood and longitude and latitude coordinates, the type of listing (entire home, private room, hotel, and shared room), information about past reviews, and the price/night and minimum nights for a booking.

Importing and Viewing Data

As mentioned previously, pandas allows us to convert data from different formats, such as a CSV file, into a DataFrame object. A DataFrame is a data structure we use quite often in pandas that serves as a tabular representation of data. Think of a DataFrame as an Excel spreadsheet or database table.

We begin by importing pandas, conventionally aliased as pd. We can then import a CSV file as a DataFrame using the pd.read_csv() function, which takes in the path of the file you want to import. To view the DataFrame in a Jupyter notebook, we simply type the name of the variable.

# import pandas package
import pandas as pd
# read the airbnb NYC listings csv file
airbnb = pd.read_csv("listings.csv")
# display the pandas DataFrame
display(airbnb)
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 100 30 48 11/4/2019 0.35 2 365
1 3831 Whole flr w/private bdrm, bath & kitchen(pls r... 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 73 1 386 1/27/2021 4.99 1 249
2 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 30 50 12/2/2019 0.35 1 365
3 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Midtown 40.76468 -73.98315 Private room 79 2 474 9/25/2020 3.31 1 343
4 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 75 2 118 7/21/2017 0.85 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
37007 48033101 Cozy One-Bedroom with Full Kitchen Near Manhattan 251455701 Alexander Queens Woodside 40.74415 -73.90941 Entire home/apt 90 3 0 NaN NaN 2 365
37008 48033611 Grand Concourse Gem 287782529 Adriana Bronx Norwood 40.87575 -73.88399 Private room 520 1 0 NaN NaN 1 292
37009 48038944 Natural Light-Filled Home in Upper East Side 305240193 June Manhattan Midtown 40.75774 -73.96173 Private room 63 30 0 NaN NaN 181 1
37010 48039640 Spacious Living in the Heart of Upper East Side 305240193 June Manhattan Upper East Side 40.76242 -73.95966 Private room 67 30 0 NaN NaN 181 360
37011 48039776 Find Cozy in this Upper West Side Furnished Home 305240193 June Manhattan Upper West Side 40.79850 -73.96216 Private room 66 30 0 NaN NaN 181 353

37012 rows × 16 columns

Since there are so many rows in the DataFrame, we see that most of the data is truncated. We can view just the first or last few entries in the DataFrame using the .head() and .tail() methods.

airbnb.head()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 100 30 48 11/4/2019 0.35 2 365
1 3831 Whole flr w/private bdrm, bath & kitchen(pls r... 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 73 1 386 1/27/2021 4.99 1 249
2 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 30 50 12/2/2019 0.35 1 365
3 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Midtown 40.76468 -73.98315 Private room 79 2 474 9/25/2020 3.31 1 343
4 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 75 2 118 7/21/2017 0.85 1 0
airbnb.tail()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
37007 48033101 Cozy One-Bedroom with Full Kitchen Near Manhattan 251455701 Alexander Queens Woodside 40.74415 -73.90941 Entire home/apt 90 3 0 NaN NaN 2 365
37008 48033611 Grand Concourse Gem 287782529 Adriana Bronx Norwood 40.87575 -73.88399 Private room 520 1 0 NaN NaN 1 292
37009 48038944 Natural Light-Filled Home in Upper East Side 305240193 June Manhattan Midtown 40.75774 -73.96173 Private room 63 30 0 NaN NaN 181 1
37010 48039640 Spacious Living in the Heart of Upper East Side 305240193 June Manhattan Upper East Side 40.76242 -73.95966 Private room 67 30 0 NaN NaN 181 360
37011 48039776 Find Cozy in this Upper West Side Furnished Home 305240193 June Manhattan Upper West Side 40.79850 -73.96216 Private room 66 30 0 NaN NaN 181 353

Selecting Columns

Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.

# Results for a single column
airbnb['name']
0                                    Skylit Midtown Castle
1        Whole flr w/private bdrm, bath & kitchen(pls r...
2                                          BlissArtsSpace!
3                         Large Furnished Room Near B'way 
4                       Cozy Clean Guest Room - Family Apt
                               ...                        
37007    Cozy One-Bedroom with Full Kitchen Near Manhattan
37008                                  Grand Concourse Gem
37009         Natural Light-Filled Home in Upper East Side
37010      Spacious Living in the Heart of Upper East Side
37011     Find Cozy in this Upper West Side Furnished Home
Name: name, Length: 37012, dtype: object

The result is a Series object with its own set of attributes and methods. These objects are like arrays and are the building blocks of DataFrames; each DataFrame is made up of a set of Series.

To select multiple columns at once, we use double brackets and commas between column names as shown below.

# results for multiple columns
hosts = airbnb[['host_id', 'host_name']]
hosts.head()
host_id host_name
0 2845 Jennifer
1 4869 LisaRoxanne
2 7356 Garon
3 8967 Shunichi
4 7490 MaryEllen

The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving onto the analysis, especially if the data is wide with many unnecessary variables.

Case Study: Analyzing Airbnb Data with pandas

Now that we understand the basics of pandas, it will be interesting to answer some questions using the built-in functions in pandas. Our goals in this case study will be to understand some of the differences between each type of listing in our data, defined by the room_type column. Specifically, we would like to know: What is the typical price for a listing? What is the difference in median price for different types of listings? How have these prices changed over time for each listing type? Finally, what is the distribution of typical listing prices?

Data Cleaning

Like any data analysis exercise, we’ll be starting off with some data cleaning. Luckily, pandas has some easy to use functionality to streamline data cleaning, to get it ready for analysis.

Changing Column Formatting

pandas does a relatively good job of understanding what data types each column is meant to be stored as. However, sometimes, we would like to change the default type. For example, dates are commonly seen in DataFrames. pandas has a useful built-in data type to handle dates, called a datetime index, which allows us to extract useful information like the year and month for a particular row.

To check the data types of columns we call the .dtypes attribute of the DataFrame. To convert a column to a datetime index, we use the .to_datetime() functions (these functions exist for all supported data types like .to_string() to convert a column to be stored as a string).

In the code below, we also see the syntax to both edit existing columns and create new ones. Specifically, we want to convert the last_review column to a datetime column. So we select it as seen in the previous section and set it equal to the result of the operation. Datetime series have a .dt attribute with built-in attributes and functions. Below, we select the .year attribute of the newly typed datetime column, last_review, to get the year of each row.

# Show the data types for each column
airbnb.dtypes
id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object
# Change the type of a column to datetime
airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])
airbnb.dtypes
id                                         int64
name                                      object
host_id                                    int64
host_name                                 object
neighbourhood_group                       object
neighbourhood                             object
latitude                                 float64
longitude                                float64
room_type                                 object
price                                      int64
minimum_nights                             int64
number_of_reviews                          int64
last_review                       datetime64[ns]
reviews_per_month                        float64
calculated_host_listings_count             int64
availability_365                           int64
dtype: object
# extract the year from a datetime series
airbnb['year'] = airbnb['last_review'].dt.year
airbnb['year'].head()
0    2019.0
1    2021.0
2    2019.0
3    2020.0
4    2017.0
Name: year, dtype: float64

Series String Functions

Another useful data cleaning tool is removing leading and trailing whitespace from string data. This can be done using the strip method.

# Strip leading and trailing spaces from a string series
airbnb['name'] = airbnb['name'].str.strip()
airbnb['name'].head()
0                                Skylit Midtown Castle
1    Whole flr w/private bdrm, bath & kitchen(pls r...
2                                      BlissArtsSpace!
3                      Large Furnished Room Near B'way
4                   Cozy Clean Guest Room - Family Apt
Name: name, dtype: object

pandas series have other interesting built-in functions such as making all of the inputs lowercase using .str.lower(). This is particularly useful in cases when the data do not have standard capitalization practices which could lead to classifying the same entity as two separate entities. Here, we are lower-casing all listing names that can be found in the name column, and assigning the updated listing names to the newly created name_lower column.

# lowercase all strings in a series
airbnb['name_lower'] = airbnb['name'].str.lower()
airbnb['name_lower'].head()
0                                skylit midtown castle
1    whole flr w/private bdrm, bath & kitchen(pls r...
2                                      blissartsspace!
3                      large furnished room near b'way
4                   cozy clean guest room - family apt
Name: name_lower, dtype: object

Combining Columns

One final useful data cleaning/preparation technique we will cover is combining rows. If we want to make calculations between columns, we can easily do this by applying the operation to each of the series as shown below. Here, we are calculating the minimum number of revenue a listing generates, by calculating the product of the minimum number of stays and the price per night.

# calculate using two columns
airbnb['min_revenue'] = airbnb['minimum_nights'] * airbnb['price']
airbnb[['minimum_nights', 'price', 'min_revenue']].head()
minimum_nights price min_revenue
0 30 100 3000
1 1 73 73
2 30 60 1800
3 2 79 158
4 2 75 150

Descriptive Analysis

Summary Statistics

Once the data is clean and ready to analyze, we can compute some interesting statistics to answer some business questions. The first question we may have is what the average and median price is for the listings in our data. We use the built-in .mean() and .median() methods to compute these.

# get the mean price
airbnb['price'].mean()
142.84224035447963
# get the median price
airbnb['price'].median()
99.0

This significant difference in mean and median suggests the distribution in price is heavily skewed. We should rely more on median prices for our analysis to get a sense of typical listings.

Grouped Statistics

We can also conduct these calculations on groupings of data using the .groupby() method. This function is very similar to using pivot tables in excel as we select a subset of columns in our data and then conduct aggregate calculations on them. As we mentioned in the introduction of this case study, we are interested in the difference in prices between each type of room listing in our data.

# get the mean grouped by type of room
airbnb[['room_type', 'price']].groupby('room_type', as_index=False).mean()
room_type price
0 Entire home/apt 191.461308
1 Hotel room 228.568562
2 Private room 86.318701
3 Shared room 100.991254
# get the median grouped by type of room
airbnb[['room_type', 'price']].groupby('room_type', as_index=False).median()
room_type price
0 Entire home/apt 140
1 Hotel room 150
2 Private room 60
3 Shared room 45

We see that hotel room median prices are the highest followed by entire home/apartment listings with shared room listings being the cheapest in terms of median price.

Note that we select the specific columns we are interested in before calling .groupby(). This allows us to have a slightly cleaner output as the calculations are done on all columns that are present in the grouped DataFrame. We also use an additional optional parameter, as_index which gives us a slightly cleaner and easier to use output, although this is not required.

If we would like to group on additional variables, we can input a list rather than a string as the first argument of .groupby().

# get the median grouped by type of room and year
airbnb[['room_type', 'year', 'price']].groupby(['room_type', 'year'], as_index=False).median()
room_type year price
0 Entire home/apt 2010.0 99.0
1 Entire home/apt 2011.0 199.5
2 Entire home/apt 2012.0 125.0
3 Entire home/apt 2013.0 150.0
4 Entire home/apt 2014.0 173.0
5 Entire home/apt 2015.0 150.0
6 Entire home/apt 2016.0 150.0
7 Entire home/apt 2017.0 149.0
8 Entire home/apt 2018.0 150.0
9 Entire home/apt 2019.0 150.0
10 Entire home/apt 2020.0 130.0
11 Entire home/apt 2021.0 112.0
12 Hotel room 2016.0 349.0
13 Hotel room 2017.0 248.0
14 Hotel room 2018.0 100.0
15 Hotel room 2019.0 229.0
16 Hotel room 2020.0 142.0
17 Hotel room 2021.0 108.5
18 Private room 2011.0 249.0
19 Private room 2012.0 164.0
20 Private room 2013.0 85.0
21 Private room 2014.0 75.0
22 Private room 2015.0 75.0
23 Private room 2016.0 65.0
24 Private room 2017.0 65.0
25 Private room 2018.0 65.0
26 Private room 2019.0 69.0
27 Private room 2020.0 60.0
28 Private room 2021.0 50.0
29 Shared room 2014.0 84.0
30 Shared room 2015.0 57.5
31 Shared room 2016.0 97.0
32 Shared room 2017.0 70.0
33 Shared room 2018.0 42.0
34 Shared room 2019.0 50.0
35 Shared room 2020.0 39.0
36 Shared room 2021.0 31.5

We see the price fluctuations over time for each of the room types. It appears that median prices have generally decreased over time in each of the four categories, most significantly for hotel room listings.

Filtering Data

Often, we are only interested in a subset of the rows in our dataset. For example, we may only be interested in listings under $1000 as they are more common and closer to the typical listing. We do this by passing a Boolean expression into single brackets as shown below.

# get all rows with price < 1000
airbnb_under_1000 = airbnb[airbnb['price'] < 1000]
airbnb_under_1000.head()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 year name_lower min_revenue
0 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 100 30 48 2019-11-04 0.35 2 365 2019.0 skylit midtown castle 3000
1 3831 Whole flr w/private bdrm, bath & kitchen(pls r... 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 73 1 386 2021-01-27 4.99 1 249 2021.0 whole flr w/private bdrm, bath & kitchen(pls r... 73
2 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 30 50 2019-12-02 0.35 1 365 2019.0 blissartsspace! 1800
3 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Midtown 40.76468 -73.98315 Private room 79 2 474 2020-09-25 3.31 1 343 2020.0 large furnished room near b'way 158
4 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 75 2 118 2017-07-21 0.85 1 0 2017.0 cozy clean guest room - family apt 150

We can also pass in multiple filters by surrounding each expression in parenthesis and using either & (for and expressions) or | (for or expressions). You will get an error if you do not surround the expressions with parentheses.

# get all rows with price < 1000 and year equal to 2020
airbnb_2019_under_1000 = airbnb[(airbnb['price'] < 1000) & (airbnb['year'] == 2020)]
airbnb_2019_under_1000.head()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 year name_lower min_revenue
3 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Midtown 40.76468 -73.98315 Private room 79 2 474 2020-09-25 3.31 1 343 2020.0 large furnished room near b'way 158
5 5803 Lovely Room 1, Garden, Best Area, Legal rental 9744 Laurie Brooklyn South Slope 40.66829 -73.98779 Private room 83 4 182 2020-10-17 1.27 3 365 2020.0 lovely room 1, garden, best area, legal rental 332
6 6848 Only 2 stops to Manhattan studio 15991 Allen & Irina Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 109 30 181 2020-03-16 1.27 1 309 2020.0 only 2 stops to manhattan studio 3270
14 9657 Modern 1 BR / NYC / East Village 21904 Dana Manhattan East Village 40.72920 -73.98542 Entire home/apt 150 10 35 2020-11-06 0.25 1 74 2020.0 modern 1 br / nyc / east village 1500
15 9704 Spacious 1 bedroom in luxe building 32045 Teri Manhattan Harlem 40.81305 -73.95466 Private room 55 30 98 2020-03-14 1.21 1 365 2020.0 spacious 1 bedroom in luxe building 1650

Plotting

pandas also has built-in plotting capabilities. For example, we can see the distribution of prices for each listing in our dataset using a histogram in one line of code. Note, we use the under $1000 DataFrame here as we cannot see the bars very clearly when including all prices.

# distribution of prices under $1000
ax = airbnb_under_1000['price'].plot.hist(bins=40)

Next Steps

You now have the basic tools required to analyze tabular data in python using pandas. We are already able to extract a lot of useful insights by just doing some simple discrete analysis on the Airbnb dataset. pandas is optimized to work well with a lot of other Python libraries like Matplotlib and seaborn which create great data visualizations in just a few lines of code. It is also typically used in Jupyter notebooks as they make it easy to quickly conduct analyses and observe results.

The pandas library scales very well to larger datasets. Now that you have the fundamentals down, it’s time to apply these skills to a real-world dataset on your own or to learn some of the more advanced functionality with some of our other courses and suggested projects. Some good courses to expand your pandas toolkit include our courses Merging DataFrames with pandas, Working with Categorical Data in Python, and chapter four of Writing Efficient Python Code. If you prefer written posts like this one, we have articles discussing more complex dataframe operations, joining dataframes, and an in-depth explanation of parameters for reading CSV files in pandas.