Skip to main content
HomeBlogPython

Top 26 Python pandas Interview Questions and Answers

Explore key Python pandas interview questions and answers for data science roles
Dec 2023  · 15 min read

This article is a valued contribution from our community and has been edited for clarity and accuracy by DataCamp.

Interested in sharing your own expertise? We’d love to hear from you! Feel free to submit your articles or ideas through our Community Contribution Form.

As one of the most demanding skills for a data-oriented role, many data professionals or enthusiasts look for the most asked pandas interview questions to land a decent job in the booming data industry.

Whether you’re looking for your first job opportunity or want to level up your role, we’ve got you covered. In this article, we’ve compiled the most frequently asked Python pandas interview questions and their answers. Some of the questions are from my own interview experience at Target in a data scientist role.

So, read on to discover pandas interview questions of all levels.

If you’re running low on time and need a quick recap of pandas, check out our pandas cheat sheet.

pandas Basic Interview Questions

Let’s now look at some basic interview questions on pandas. Kind interviewers may start with these simple questions to comfort you in the beginning, while others might ask these to assess your basic grasp of the library.

1. What is pandas in Python?

Pandas is an open-source Python library with powerful and built-in methods to efficiently clean, analyze, and manipulate datasets. Developed by Wes McKinney in 2008, this powerful package can easily blend with various other data science modules in Python.

Pandas is built on top of the NumPy library, i.e., its data structures Series and DataFrame are the upgraded versions of NumPy arrays.

2. How do you access the top 6 rows and last 7 rows of a pandas DataFrame?

The head() method in pandas is used to access the initial rows of a DataFrame, and tail() method is used to access the last rows.

To access the top 6 rows: dataframe_name.head(6)

To access the last 7 rows: dataframe_name.tail(7)

3. Why doesn’t DataFrame.shape have parenthesis?

In pandas, shape is an attribute and not a method. So, you should access it without parentheses.

DataFrame.shape outputs a tuple with the number of rows and columns in a DataFrame.

4. What is the difference between Series and DataFrame?

DataFrame: The pandas DataFrame will be in tabular format with multiple rows and columns where each column can be of different data types.

Series: The Series is a one-dimensional labeled array that can store any data type, but all of its values should be of the same data type. The Series data structure is more like single column of a DataFrame.

The Series data structure consumes less memory than a DataFrame. So, certain data manipulation tasks are faster on it.

However, DataFrame can store large and complex datasets, while Series can handle only homogeneous data. So, the set of operations you can perform on DataFrame is significantly higher than on Series data structure.

5. What is an index in pandas?

The index is a series of labels that can uniquely identify each row of a DataFrame. The index can be of any datatype like integer, string, hash, etc.,

df.index prints the current row indexes of the DataFrame df.

Intermediate pandas Interview Questions

These questions will be a bit more challenging, and you are more likey to encounter them in roles requiring previous experience using pandas.

6. What is Multi indexing in pandas?

Index in pandas uniquely specifies each row of a DataFrame. We usually pick the column that can uniquely identify each row of a DataFrame and set it as the index. But what if you don’t have a single column that can do this?

For example, you have the columns “name”, “age”, “address”, and “marks” in a DataFrame. Any of the above columns may not have unique values for all the different rows and are unfit as indexes.

However, the columns “name” and “address” together may uniquely identify each row of the DataFrame. So you can set both columns as the index. Your DataFrame now has a multi-index or hierarchical index.

7. Explain pandas Reindexing

Reindexing in pandas lets us create a new DataFrame object from the existing DataFrame with the updated row indexes and column labels.

You can provide a set of new indexes to the function DataFrame.reindex() and it will create a new DataFrame object with given indexes and take values from the actual DataFrame.

If values for these new indexes were not present in the original DataFrame, the function fills those positions with the default nulls. However, we can alter the default value NaN to whatever value we want them to fill with.

Here is the sample code:

Create a DataFrame df with indexes:

import pandas as pd

data = [['John', 50, 'Austin', 70],
        ['Cataline', 45 , 'San Francisco', 80],
        ['Matt', 30, 'Boston' , 95]]

columns = ['Name', 'Age', 'City', 'Marks']

#row indexes
idx = ['x', 'y', 'z']

df = pd.DataFrame(data, columns=columns, index=idx)

print(df)

Reindex with new set of indexes:

new_idx = ['a', 'y', 'z']

new_df = df.reindex(new_idx)

print(new_df)

The new_df has values from the df for common indexes ( ‘y’ and ‘z’), and the new index ‘a’ is filled with the default NaN.

8. What is the difference between loc and iloc?

Both loc and the iloc methods in pandas are used to select subsets of a DataFrame. Practically, these are widely used for filtering DataFrame based on conditions.

We should use the loc method to select data using actual labels of rows and columns, while the iloc method is used to extract data based on integer indices of rows and columns.

9. Show two different ways to create a pandas DataFrame

Using Python Dictionary:

import pandas as pd

data = {'Name': ['John', 'Cataline', 'Matt'],
        'Age': [50, 45, 30],
        'City': ['Austin', 'San Francisco', 'Boston'],
        'Marks' : [70, 80, 95]}

df = pd.DataFrame(data)

Using Python Lists:

import pandas as pd

data = [['John', 25, 'Austin',70],
        ['Cataline', 30, 'San Francisco',80],
        ['Matt', 35, 'Boston',90]]

columns = ['Name', 'Age', 'City', 'Marks']

df = pd.DataFrame(data, columns=columns)

10. How do you get the count of all unique values of a categorical column in a DataFrame?

The function Series.value_counts() returns the count of each unique value of a series or a column.

Example:

We have created a DataFrame df that contains a categorical column named ‘Sex’, and ran value_counts() function to see the count of each unique value in that column.

import pandas as pd

data = [['John', 50, 'Male', 'Austin', 70],
        ['Cataline', 45 ,'Female', 'San Francisco', 80],
        ['Matt', 30 ,'Male','Boston', 95]]

# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']

# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)

df['Sex'].value_counts()

pandas Interview Questions for Experienced Practitioners

Those who already have a strong background in pandas and are applying for more senior roles may encounter some of these questions:

11. How Do you optimize the performance while working with large datasets in pandas?

Load less data: While reading data using pd.read_csv(), choose only the columns you need with the “usecols” parameter to avoid loading unnecessary data. Plus, specifying the “chunksize” parameter splits the data into different chunks and processes them sequentially.

Avoid loops: Loops and iterations are expensive, especially when working with large datasets. Instead, opt for vectorized operations, as they are applied on an entire column at once, making them faster than row-wise iterations.

Use data aggregation: Try aggregating data and perform statistical operations because operations on aggregated data are more efficient than on the entire dataset.

Use the right data types: The default data types in pandas are not memory efficient. For example, integer values take the default datatype of int64, but if your values can fit in int32, adjusting the datatype to int32 can optimize the memory usage.

Parallel processing: Dask is a pandas-like API to work with large datasets. It utilizes multiple processes of your system to parallely execute different data tasks.

12. What is the difference between Join and Merge methods in pandas?

Join: Joins two DataFrames based on their index. However, there is an optional argument ‘on’ to explicitly specify if you want to join based on columns. By default, this function performs left join.

Syntax: df1.join(df2)

Merge: The merge function is more versatile, allowing you to specify the columns on which you want to join the DataFrames. It applies inner join by default, but can be customized to use different join types like left, right, outer, inner, and cross.

Syntax: pd.merge(df1, df2, on=”column_names”)

13. What is Timedelta?

Timedelta represents the duration ie., the difference between two dates or times, measured in units as days, hours, minutes, and seconds.

14. What is the difference between append and concat methods?

We can use the concat method to combine DataFrames either along rows or columns. Similarly, append is also used to combine DataFrames, but only along the rows.

With the concat function, you have the flexibility to modify the original DataFrame using the “inplace” parameter, while the append function can't modify the actual DataFrame, instead it creates a new one with the combined data.

pandas Coding Interview Questions

Practical skills are just as important as theoretical knowledge when it comes to acing a tech interview. So here are some of the pandas interview questions coding you need to know before facing your interviewer..

15. How do you read Excel files to CSV using pandas?

First, we should use the read_excel() function to pull in the Excel data to a variable. Then, just apply the to_csv() function for a seamless conversion.

Here is the sample code:

import pandas as pd

#input your excel file path into the read_excel() function.
excel_data = pd.read_excel("/content/sample_data/california_housing_test.xlsx")

excel_data.to_csv("CSV_data.csv", index = None, header=True) 

16. How do you sort a DataFrame based on columns?

We have the sort_values() method to sort the DataFrame based on a single column or multiple columns.

Syntax: df.sort_values(by=[“column_names”])

Example code:

import pandas as pd

data = [['John', 50, 'Male', 'Austin', 70],
['Cataline', 45 ,'Female', 'San Francisco', 80],
['Matt', 30 ,'Male', 'Boston', 95],
['Oliver',35,'Male', 'New york', 65]]

# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']

# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)

# Sort values based on ‘Age’ column
df.sort_values(by=['Age'])

df.head()`

17. Show two different ways to filter data

To create a DataFrame:

import pandas as pd

data = {'Name': ['John', 'Cataline', 'Matt'],
        'Age': [50, 45, 30],
        'City': ['Austin', 'San Francisco', 'Boston'],
        'Marks' : [70, 80, 95]}

# Create a DataFrame df
df = pd.DataFrame(data)

Method 1: Based on conditions

new_df = df[(df.Name == "John") | (df.Marks > 90)]
print (new_df)

Method 2: Using query function

df.query('Name == "John" or Marks > 90')
print (new_df)

18. How do you aggregate data and apply some aggregation function like mean or sum on it?

The groupby function lets you aggregate data based on certain columns and perform operations on the grouped data. In the following code, the data is grouped on the column ‘Name’ and the mean ‘Marks’ of each group is calculated.

import pandas as pd

# Create a DataFrame
data = {
    'Name': ['John', 'Matt', 'John', 'Matt', 'Matt', 'Matt'],
    'Marks': [10, 20, 30, 15, 25, 18]
}

# Create a DataFrame df
df = pd.DataFrame(data)

# mean marks of John and Matt
print(df.groupby('Name').mean())

19. How can you create a new column derived from existing columns?

We can use apply() method to derive a new column by performing some operations on existing columns.

The following code adds a new column named ‘total’ to the DataFrame. This new column holds the sum of values from the other two columns.

Example code:

import pandas as pd

# Create a DataFrame
data = {
    'Name': ['John', 'Matt', 'John', 'Cateline'],
    'math_Marks': [18, 20, 19, 15],
    'science_Marks': [10, 20, 15, 12]
}

# Create a DataFrame df
df = pd.DataFrame(data)

df['total'] = df.apply(lambda row : row["math_Marks"] + row["science_Marks"], axis=1)


print(df)

pandas Interview Questions for Data Scientists

Now that we have covered all the general and coding interview questions for pandas, let's have a look at pandas data science interview questions.

20. How do you handle null or missing values in pandas?

You can use any of the following three methods to handle missing values in pandas:

dropna() – the function removes the missing rows or columns from the DataFrame.

fillna() – fill nulls with a specific value using this function.

interpolate() – this method fills the missing values with computed interpolation values. The interpolation technique can be linear, polynomial, spline, time, etc.,

21. Difference between fillna() and interpolate() methods

fillna() –

fillna() fills the missing values with the given constant. Plus, you can give forward-filling or backward-filling inputs to its ‘method’ parameter.

interpolate() –

By default, this function fills the missing or NaN values with the linear interpolated values. However, you can customize the interpolation technique to polynomial, time, index, spline, etc., using its ‘method’ parameter.

The interpolation method is highly suitable for time series data, whereas fillna is a more generic approach.

22. What is Resampling?

Resampling is used to change the frequency at which time series data is reported. Imagine you have monthly time series data and want to convert it into weekly data or yearly, this is where resampling is used.

Converting monthly to weekly or daily data is nothing but upsampling. Interpolation techniques are used to increase the frequencies here.

In contrast, converting monthly to yearly data is termed as downsampling, where data aggregation techniques are applied.

23. How do you perform one-hot encoding using pandas?

We perform one hot encoding to convert categorical values to numeric ones so that can be fed to the machine learning algorithm.

import pandas as pd

data = {'Name': ['John', 'Cateline', 'Matt', 'Oliver'],
        'ID': [1, 22, 23, 36]}

df = pd.DataFrame(data)

#one hot encoding 
new_df = pd.get_dummies(df.Name)
new_df.head()

24. How do you create a line plot in pandas?

To draw a line plot, we have a plot function in pandas.

import pandas as pd


data = {'units': [1, 2, 3, 4, 5],
        'price': [7, 12, 8, 13, 16]}
# Create a DataFrame df
df = pd.DataFrame(data)

df.plot(x='units', y='price')

25. What is the pandas method to get the statistical summary of all the columns in a DataFrame?

df.describe()

This method returns stats like mean, percentile values, min, max, etc., of each column in the DataFrame.

26. What is Rolling mean?

Rolling mean is also referred to as moving average because the idea here is to compute the mean of data points for a specified window and slide the window throughout the data. This will lessen the fluctuations and highlight the long-term trends in time series data.

Syntax: df['column_name'].rolling(window=n).mean()

Preparing for the Interview

In addition to pandas, a data-oriented job role demands a lot other skills. Here is the checklist to succeed in the overall interview process:

Understand the Job Requirements

Revise the job description and responsibilities and ensure your skills and resume are aligned with them. Additionally, knowing about the company and how your role impacts them is a plus.

Code in Python

The interviewer first checks your Python skills before asking about its library(pandas). So, equip yourself with strong Python skills.

For analyst roles, solely Python language does the job. But if you are applying for data scientist or ML engineer roles, it’s important to solve Python coding challenges.

Data Projects

Make sure you’ve some real-world data problems solved in your resume. For experienced ones, you can talk about your past projects. If you are new to the field, try finishing some projects from Kaggle.

General Concepts

For analysts, questions can be from Excel, data visualization dashboards, statistics, and probability. In addition, the interviewer can go deep into machine learning and deep learning subjects if you are applying for data scientist or ML engineer roles.

Prepare Using Frequently Asked Interview Questions

Chances are high that you get asked at least a few questions from the most frequently asked interview questions. So, prepare from the cheat sheets and mock interview questions.

ML System Design

Expect questions around system design if you’re applying for highly technical or experienced roles. Revise common design questions and practice end-to-end ML system design problems.

Conclusion

Cracking a job in the data industry requires strong Python pandas skills. The above list of both theoretical and hands-on interview questions should help you ace the pandas part of your interview. Plus, the tips in the end ensure that your entire interview goes smoothly.

You can use the following resources to help you prepare for your pandas interview:


Photo of Srujana Maddula
Author
Srujana Maddula

Srujana is a freelance tech writer with the four-year degree in Computer Science. Writing about various topics, including data science, cloud computing, development, programming, security, and many others comes naturally to her. She has a love for classic literature and exploring new destinations.

Topics

Start Your pandas Journey Today!

Certification available

Course

Data Manipulation with pandas

4 hr
350.4K
Learn how to import and clean data, calculate statistics, and create visualizations with pandas.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.
Mark Graus's photo

Mark Graus

10 min

Python NaN: 4 Ways to Check for Missing Values in Python

Explore 4 ways to detect NaN values in Python, using NumPy and Pandas. Learn key differences between NaN and None to clean and analyze data efficiently.
Adel Nehme's photo

Adel Nehme

5 min

Seaborn Heatmaps: A Guide to Data Visualization

Learn how to create eye-catching Seaborn heatmaps
Joleen Bothma's photo

Joleen Bothma

9 min

Test-Driven Development in Python: A Beginner's Guide

Dive into test-driven development (TDD) with our comprehensive Python tutorial. Learn how to write robust tests before coding with practical examples.
Amina Edmunds's photo

Amina Edmunds

7 min

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

See MoreSee More