pandas Tutorial for Beginners
What is pandas?
pandas is a very popular and easy-to-learn Python library for handling tabular data. It can take in data from a wide range of sources such as CSV files, Excel files, HTML tables on the web, and text files. It allows you to apply the same framework to all of these sources to clean and analyze the data using optimized built-in functionality which scales very well with large datasets.
In our blog post on how to learn pandas, we discussed the learning path you may take to master this package. In this tutorial, we’ll walk you through the installation process and basic pandas syntax, and then apply the syntax to a real-world Airbnb dataset to answer a few business questions. Much of this information is covered in our course, pandas Foundations. This article is aimed at beginners with basic knowledge of Python and no prior experience with pandas to help you get started.
How to Install pandas
Anaconda Distribution
There are a few different ways to install pandas onto your computer. The recommended method in the pandas documentation is to install it as part of the Anaconda distribution, a distribution for data analysis and scientific computing that works cross platform. This distribution also includes other popular packages in the SciPy stack such as NumPy, Matplotlib, and IPython which all work great together with pandas. Installation instructions and links for each operating system for Anaconda can be found here (Windows / MacOs).
Pip Install
Finally, the second recommended installation method is to install pandas using pip from PyPI. This is a very simple option. pip allows you to install individual packages to your computer using the pip install
command in the terminal and comes with Python 2 installations greater than or equal to 2.7.9 and Python 3 greater than or equal to 3.4. For a more in-depth introduction to pip, we have a tutorial here.
pip install pandas
Getting Started with pandas
Now that pandas is installed, let’s learn a bit of the basics about how to use the package to handle real-world data. This tutorial will use an Airbnb listings dataset. Specifically, we will use the listings.csv
file for New York City. This file includes the the name of the listing and host, information about the location of the listing including the neighborhood and longitude and latitude coordinates, the type of listing (entire home, private room, hotel, and shared room), information about past reviews, and the price/night and minimum nights for a booking.
Importing and Viewing Data
As mentioned previously, pandas allows us to convert data from different formats, such as a CSV file, into a DataFrame object. A DataFrame is a data structure we use quite often in pandas that serves as a tabular representation of data. Think of a DataFrame as an Excel spreadsheet or database table.
We begin by importing pandas, conventionally aliased as pd. We can then import a CSV file as a DataFrame using the pd.read_csv()
function, which takes in the path of the file you want to import. To view the DataFrame in a Jupyter notebook, we simply type the name of the variable.
# import pandas package
import pandas as pd
# read the airbnb NYC listings csv file
airbnb = pd.read_csv("listings.csv")
# display the pandas DataFrame
display(airbnb)
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 100 | 30 | 48 | 11/4/2019 | 0.35 | 2 | 365 |
1 | 3831 | Whole flr w/private bdrm, bath & kitchen(pls r... | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 73 | 1 | 386 | 1/27/2021 | 4.99 | 1 | 249 |
2 | 5121 | BlissArtsSpace! | 7356 | Garon | Brooklyn | Bedford-Stuyvesant | 40.68688 | -73.95596 | Private room | 60 | 30 | 50 | 12/2/2019 | 0.35 | 1 | 365 |
3 | 5178 | Large Furnished Room Near B'way | 8967 | Shunichi | Manhattan | Midtown | 40.76468 | -73.98315 | Private room | 79 | 2 | 474 | 9/25/2020 | 3.31 | 1 | 343 |
4 | 5203 | Cozy Clean Guest Room - Family Apt | 7490 | MaryEllen | Manhattan | Upper West Side | 40.80178 | -73.96723 | Private room | 75 | 2 | 118 | 7/21/2017 | 0.85 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
37007 | 48033101 | Cozy One-Bedroom with Full Kitchen Near Manhattan | 251455701 | Alexander | Queens | Woodside | 40.74415 | -73.90941 | Entire home/apt | 90 | 3 | 0 | NaN | NaN | 2 | 365 |
37008 | 48033611 | Grand Concourse Gem | 287782529 | Adriana | Bronx | Norwood | 40.87575 | -73.88399 | Private room | 520 | 1 | 0 | NaN | NaN | 1 | 292 |
37009 | 48038944 | Natural Light-Filled Home in Upper East Side | 305240193 | June | Manhattan | Midtown | 40.75774 | -73.96173 | Private room | 63 | 30 | 0 | NaN | NaN | 181 | 1 |
37010 | 48039640 | Spacious Living in the Heart of Upper East Side | 305240193 | June | Manhattan | Upper East Side | 40.76242 | -73.95966 | Private room | 67 | 30 | 0 | NaN | NaN | 181 | 360 |
37011 | 48039776 | Find Cozy in this Upper West Side Furnished Home | 305240193 | June | Manhattan | Upper West Side | 40.79850 | -73.96216 | Private room | 66 | 30 | 0 | NaN | NaN | 181 | 353 |
37012 rows × 16 columns
Since there are so many rows in the DataFrame, we see that most of the data is truncated. We can view just the first or last few entries in the DataFrame using the .head()
and .tail()
methods.
airbnb.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 100 | 30 | 48 | 11/4/2019 | 0.35 | 2 | 365 |
1 | 3831 | Whole flr w/private bdrm, bath & kitchen(pls r... | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 73 | 1 | 386 | 1/27/2021 | 4.99 | 1 | 249 |
2 | 5121 | BlissArtsSpace! | 7356 | Garon | Brooklyn | Bedford-Stuyvesant | 40.68688 | -73.95596 | Private room | 60 | 30 | 50 | 12/2/2019 | 0.35 | 1 | 365 |
3 | 5178 | Large Furnished Room Near B'way | 8967 | Shunichi | Manhattan | Midtown | 40.76468 | -73.98315 | Private room | 79 | 2 | 474 | 9/25/2020 | 3.31 | 1 | 343 |
4 | 5203 | Cozy Clean Guest Room - Family Apt | 7490 | MaryEllen | Manhattan | Upper West Side | 40.80178 | -73.96723 | Private room | 75 | 2 | 118 | 7/21/2017 | 0.85 | 1 | 0 |
airbnb.tail()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37007 | 48033101 | Cozy One-Bedroom with Full Kitchen Near Manhattan | 251455701 | Alexander | Queens | Woodside | 40.74415 | -73.90941 | Entire home/apt | 90 | 3 | 0 | NaN | NaN | 2 | 365 |
37008 | 48033611 | Grand Concourse Gem | 287782529 | Adriana | Bronx | Norwood | 40.87575 | -73.88399 | Private room | 520 | 1 | 0 | NaN | NaN | 1 | 292 |
37009 | 48038944 | Natural Light-Filled Home in Upper East Side | 305240193 | June | Manhattan | Midtown | 40.75774 | -73.96173 | Private room | 63 | 30 | 0 | NaN | NaN | 181 | 1 |
37010 | 48039640 | Spacious Living in the Heart of Upper East Side | 305240193 | June | Manhattan | Upper East Side | 40.76242 | -73.95966 | Private room | 67 | 30 | 0 | NaN | NaN | 181 | 360 |
37011 | 48039776 | Find Cozy in this Upper West Side Furnished Home | 305240193 | June | Manhattan | Upper West Side | 40.79850 | -73.96216 | Private room | 66 | 30 | 0 | NaN | NaN | 181 | 353 |
Selecting Columns
Typically, we will only want a subset of the available columns in our DataFrame. We can select a single column using single brackets and the name of the column as shown below.
# Results for a single column
airbnb['name']
0 Skylit Midtown Castle
1 Whole flr w/private bdrm, bath & kitchen(pls r...
2 BlissArtsSpace!
3 Large Furnished Room Near B'way
4 Cozy Clean Guest Room - Family Apt
...
37007 Cozy One-Bedroom with Full Kitchen Near Manhattan
37008 Grand Concourse Gem
37009 Natural Light-Filled Home in Upper East Side
37010 Spacious Living in the Heart of Upper East Side
37011 Find Cozy in this Upper West Side Furnished Home
Name: name, Length: 37012, dtype: object
The result is a Series object with its own set of attributes and methods. These objects are like arrays and are the building blocks of DataFrames; each DataFrame is made up of a set of Series.
To select multiple columns at once, we use double brackets and commas between column names as shown below.
# results for multiple columns
hosts = airbnb[['host_id', 'host_name']]
hosts.head()
host_id | host_name | |
---|---|---|
0 | 2845 | Jennifer |
1 | 4869 | LisaRoxanne |
2 | 7356 | Garon |
3 | 8967 | Shunichi |
4 | 7490 | MaryEllen |
The result is a new DataFrame object with the selected columns. It is useful to select the columns you are interested in analyzing before moving onto the analysis, especially if the data is wide with many unnecessary variables.
Case Study: Analyzing Airbnb Data with pandas
Now that we understand the basics of pandas, it will be interesting to answer some questions using the built-in functions in pandas. Our goals in this case study will be to understand some of the differences between each type of listing in our data, defined by the room_type
column. Specifically, we would like to know: What is the typical price for a listing? What is the difference in median price for different types of listings? How have these prices changed over time for each listing type? Finally, what is the distribution of typical listing prices?
Data Cleaning
Like any data analysis exercise, we’ll be starting off with some data cleaning. Luckily, pandas has some easy to use functionality to streamline data cleaning, to get it ready for analysis.
Changing Column Formatting
pandas does a relatively good job of understanding what data types each column is meant to be stored as. However, sometimes, we would like to change the default type. For example, dates are commonly seen in DataFrames. pandas has a useful built-in data type to handle dates, called a datetime index, which allows us to extract useful information like the year and month for a particular row.
To check the data types of columns we call the .dtypes
attribute of the DataFrame. To convert a column to a datetime index, we use the .to_datetime()
functions (these functions exist for all supported data types like .to_string()
to convert a column to be stored as a string).
In the code below, we also see the syntax to both edit existing columns and create new ones. Specifically, we want to convert the last_review
column to a datetime column. So we select it as seen in the previous section and set it equal to the result of the operation. Datetime series have a .dt
attribute with built-in attributes and functions. Below, we select the .year
attribute of the newly typed datetime column, last_review
, to get the year of each row.
# Show the data types for each column
airbnb.dtypes
id int64
name object
host_id int64
host_name object
neighbourhood_group object
neighbourhood object
latitude float64
longitude float64
room_type object
price int64
minimum_nights int64
number_of_reviews int64
last_review object
reviews_per_month float64
calculated_host_listings_count int64
availability_365 int64
dtype: object
# Change the type of a column to datetime
airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])
airbnb.dtypes
id int64
name object
host_id int64
host_name object
neighbourhood_group object
neighbourhood object
latitude float64
longitude float64
room_type object
price int64
minimum_nights int64
number_of_reviews int64
last_review datetime64[ns]
reviews_per_month float64
calculated_host_listings_count int64
availability_365 int64
dtype: object
# extract the year from a datetime series
airbnb['year'] = airbnb['last_review'].dt.year
airbnb['year'].head()
0 2019.0
1 2021.0
2 2019.0
3 2020.0
4 2017.0
Name: year, dtype: float64
Series String Functions
Another useful data cleaning tool is removing leading and trailing whitespace from string data. This can be done using the strip method.
# Strip leading and trailing spaces from a string series
airbnb['name'] = airbnb['name'].str.strip()
airbnb['name'].head()
0 Skylit Midtown Castle
1 Whole flr w/private bdrm, bath & kitchen(pls r...
2 BlissArtsSpace!
3 Large Furnished Room Near B'way
4 Cozy Clean Guest Room - Family Apt
Name: name, dtype: object
pandas series have other interesting built-in functions such as making all of the inputs lowercase using .str.lower()
. This is particularly useful in cases when the data do not have standard capitalization practices which could lead to classifying the same entity as two separate entities. Here, we are lower-casing all listing names that can be found in the name column, and assigning the updated listing names to the newly created name_lower
column.
# lowercase all strings in a series
airbnb['name_lower'] = airbnb['name'].str.lower()
airbnb['name_lower'].head()
0 skylit midtown castle
1 whole flr w/private bdrm, bath & kitchen(pls r...
2 blissartsspace!
3 large furnished room near b'way
4 cozy clean guest room - family apt
Name: name_lower, dtype: object
Combining Columns
One final useful data cleaning/preparation technique we will cover is combining rows. If we want to make calculations between columns, we can easily do this by applying the operation to each of the series as shown below. Here, we are calculating the minimum number of revenue a listing generates, by calculating the product of the minimum number of stays and the price per night.
# calculate using two columns
airbnb['min_revenue'] = airbnb['minimum_nights'] * airbnb['price']
airbnb[['minimum_nights', 'price', 'min_revenue']].head()
minimum_nights | price | min_revenue | |
---|---|---|---|
0 | 30 | 100 | 3000 |
1 | 1 | 73 | 73 |
2 | 30 | 60 | 1800 |
3 | 2 | 79 | 158 |
4 | 2 | 75 | 150 |
Descriptive Analysis
Summary Statistics
Once the data is clean and ready to analyze, we can compute some interesting statistics to answer some business questions. The first question we may have is what the average and median price is for the listings in our data. We use the built-in .mean()
and .median()
methods to compute these.
# get the mean price
airbnb['price'].mean()
142.84224035447963
# get the median price
airbnb['price'].median()
99.0
This significant difference in mean and median suggests the distribution in price is heavily skewed. We should rely more on median prices for our analysis to get a sense of typical listings.
Grouped Statistics
We can also conduct these calculations on groupings of data using the .groupby()
method. This function is very similar to using pivot tables in excel as we select a subset of columns in our data and then conduct aggregate calculations on them. As we mentioned in the introduction of this case study, we are interested in the difference in prices between each type of room listing in our data.
# get the mean grouped by type of room
airbnb[['room_type', 'price']].groupby('room_type', as_index=False).mean()
room_type | price | |
---|---|---|
0 | Entire home/apt | 191.461308 |
1 | Hotel room | 228.568562 |
2 | Private room | 86.318701 |
3 | Shared room | 100.991254 |
# get the median grouped by type of room
airbnb[['room_type', 'price']].groupby('room_type', as_index=False).median()
room_type | price | |
---|---|---|
0 | Entire home/apt | 140 |
1 | Hotel room | 150 |
2 | Private room | 60 |
3 | Shared room | 45 |
We see that hotel room median prices are the highest followed by entire home/apartment listings with shared room listings being the cheapest in terms of median price.
Note that we select the specific columns we are interested in before calling .groupby()
. This allows us to have a slightly cleaner output as the calculations are done on all columns that are present in the grouped DataFrame. We also use an additional optional parameter, as_index
which gives us a slightly cleaner and easier to use output, although this is not required.
If we would like to group on additional variables, we can input a list rather than a string as the first argument of .groupby()
.
# get the median grouped by type of room and year
airbnb[['room_type', 'year', 'price']].groupby(['room_type', 'year'], as_index=False).median()
room_type | year | price | |
---|---|---|---|
0 | Entire home/apt | 2010.0 | 99.0 |
1 | Entire home/apt | 2011.0 | 199.5 |
2 | Entire home/apt | 2012.0 | 125.0 |
3 | Entire home/apt | 2013.0 | 150.0 |
4 | Entire home/apt | 2014.0 | 173.0 |
5 | Entire home/apt | 2015.0 | 150.0 |
6 | Entire home/apt | 2016.0 | 150.0 |
7 | Entire home/apt | 2017.0 | 149.0 |
8 | Entire home/apt | 2018.0 | 150.0 |
9 | Entire home/apt | 2019.0 | 150.0 |
10 | Entire home/apt | 2020.0 | 130.0 |
11 | Entire home/apt | 2021.0 | 112.0 |
12 | Hotel room | 2016.0 | 349.0 |
13 | Hotel room | 2017.0 | 248.0 |
14 | Hotel room | 2018.0 | 100.0 |
15 | Hotel room | 2019.0 | 229.0 |
16 | Hotel room | 2020.0 | 142.0 |
17 | Hotel room | 2021.0 | 108.5 |
18 | Private room | 2011.0 | 249.0 |
19 | Private room | 2012.0 | 164.0 |
20 | Private room | 2013.0 | 85.0 |
21 | Private room | 2014.0 | 75.0 |
22 | Private room | 2015.0 | 75.0 |
23 | Private room | 2016.0 | 65.0 |
24 | Private room | 2017.0 | 65.0 |
25 | Private room | 2018.0 | 65.0 |
26 | Private room | 2019.0 | 69.0 |
27 | Private room | 2020.0 | 60.0 |
28 | Private room | 2021.0 | 50.0 |
29 | Shared room | 2014.0 | 84.0 |
30 | Shared room | 2015.0 | 57.5 |
31 | Shared room | 2016.0 | 97.0 |
32 | Shared room | 2017.0 | 70.0 |
33 | Shared room | 2018.0 | 42.0 |
34 | Shared room | 2019.0 | 50.0 |
35 | Shared room | 2020.0 | 39.0 |
36 | Shared room | 2021.0 | 31.5 |
We see the price fluctuations over time for each of the room types. It appears that median prices have generally decreased over time in each of the four categories, most significantly for hotel room listings.
Filtering Data
Often, we are only interested in a subset of the rows in our dataset. For example, we may only be interested in listings under $1000 as they are more common and closer to the typical listing. We do this by passing a Boolean expression into single brackets as shown below.
# get all rows with price < 1000
airbnb_under_1000 = airbnb[airbnb['price'] < 1000]
airbnb_under_1000.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | year | name_lower | min_revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 100 | 30 | 48 | 2019-11-04 | 0.35 | 2 | 365 | 2019.0 | skylit midtown castle | 3000 |
1 | 3831 | Whole flr w/private bdrm, bath & kitchen(pls r... | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 73 | 1 | 386 | 2021-01-27 | 4.99 | 1 | 249 | 2021.0 | whole flr w/private bdrm, bath & kitchen(pls r... | 73 |
2 | 5121 | BlissArtsSpace! | 7356 | Garon | Brooklyn | Bedford-Stuyvesant | 40.68688 | -73.95596 | Private room | 60 | 30 | 50 | 2019-12-02 | 0.35 | 1 | 365 | 2019.0 | blissartsspace! | 1800 |
3 | 5178 | Large Furnished Room Near B'way | 8967 | Shunichi | Manhattan | Midtown | 40.76468 | -73.98315 | Private room | 79 | 2 | 474 | 2020-09-25 | 3.31 | 1 | 343 | 2020.0 | large furnished room near b'way | 158 |
4 | 5203 | Cozy Clean Guest Room - Family Apt | 7490 | MaryEllen | Manhattan | Upper West Side | 40.80178 | -73.96723 | Private room | 75 | 2 | 118 | 2017-07-21 | 0.85 | 1 | 0 | 2017.0 | cozy clean guest room - family apt | 150 |
We can also pass in multiple filters by surrounding each expression in parenthesis and using either &
(for and expressions) or |
(for or expressions). You will get an error if you do not surround the expressions with parentheses.
# get all rows with price < 1000 and year equal to 2020
airbnb_2019_under_1000 = airbnb[(airbnb['price'] < 1000) & (airbnb['year'] == 2020)]
airbnb_2019_under_1000.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | year | name_lower | min_revenue | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 5178 | Large Furnished Room Near B'way | 8967 | Shunichi | Manhattan | Midtown | 40.76468 | -73.98315 | Private room | 79 | 2 | 474 | 2020-09-25 | 3.31 | 1 | 343 | 2020.0 | large furnished room near b'way | 158 |
5 | 5803 | Lovely Room 1, Garden, Best Area, Legal rental | 9744 | Laurie | Brooklyn | South Slope | 40.66829 | -73.98779 | Private room | 83 | 4 | 182 | 2020-10-17 | 1.27 | 3 | 365 | 2020.0 | lovely room 1, garden, best area, legal rental | 332 |
6 | 6848 | Only 2 stops to Manhattan studio | 15991 | Allen & Irina | Brooklyn | Williamsburg | 40.70837 | -73.95352 | Entire home/apt | 109 | 30 | 181 | 2020-03-16 | 1.27 | 1 | 309 | 2020.0 | only 2 stops to manhattan studio | 3270 |
14 | 9657 | Modern 1 BR / NYC / East Village | 21904 | Dana | Manhattan | East Village | 40.72920 | -73.98542 | Entire home/apt | 150 | 10 | 35 | 2020-11-06 | 0.25 | 1 | 74 | 2020.0 | modern 1 br / nyc / east village | 1500 |
15 | 9704 | Spacious 1 bedroom in luxe building | 32045 | Teri | Manhattan | Harlem | 40.81305 | -73.95466 | Private room | 55 | 30 | 98 | 2020-03-14 | 1.21 | 1 | 365 | 2020.0 | spacious 1 bedroom in luxe building | 1650 |
Plotting
pandas also has built-in plotting capabilities. For example, we can see the distribution of prices for each listing in our dataset using a histogram in one line of code. Note, we use the under $1000 DataFrame here as we cannot see the bars very clearly when including all prices.
# distribution of prices under $1000
ax = airbnb_under_1000['price'].plot.hist(bins=40)

Next Steps
You now have the basic tools required to analyze tabular data in python using pandas. We are already able to extract a lot of useful insights by just doing some simple discrete analysis on the Airbnb dataset. pandas is optimized to work well with a lot of other Python libraries like Matplotlib and seaborn which create great data visualizations in just a few lines of code. It is also typically used in Jupyter notebooks as they make it easy to quickly conduct analyses and observe results.
The pandas library scales very well to larger datasets. Now that you have the fundamentals down, it’s time to apply these skills to a real-world dataset on your own or to learn some of the more advanced functionality with some of our other courses and suggested projects. Some good courses to expand your pandas toolkit include our courses Merging DataFrames with pandas, Working with Categorical Data in Python, and chapter four of Writing Efficient Python Code. If you prefer written posts like this one, we have articles discussing more complex dataframe operations, joining dataframes, and an in-depth explanation of parameters for reading CSV files in pandas.