Skip to main content

Top 30 Pandas Interview Questions and Answers

Explore key Pandas interview questions and answers for data science roles.
Updated Dec 22, 2025  · 15 min read

In this article, I’ve compiled the most frequently asked Pandas interview questions and their answers. Some of the questions are from my own interview experience at Target in a data scientist role. Let's get started!

Pandas Basic Interview Questions

Let’s look at some basic interview questions on pandas. Kind interviewers may start with these simple questions to comfort you in the beginning, while others might ask these to assess your basic grasp of the library.

1. What is pandas in Python?

pandas is an open-source Python library with powerful and built-in methods to efficiently clean, analyze, and manipulate datasets. Developed by Wes McKinney in 2008, this powerful package can easily blend with various other data science modules in Python.

While originally built on top of NumPy, modern pandas (v2.0+) also supports the PyArrow backend. This allows for significantly faster operations, true string data types, and better handling of missing values (nullable types) compared to the legacy NumPy-only architecture.

2. How do you quickly access the top 5 rows and last 5 rows of a pandas DataFrame?

The head() method in pandas is used to access the first 5 rows of a DataFrame, and the tail() method is used to access the last 5 rows.

  • To access the top 5 rows: dataframe_name.head()

  • To access the last 5 rows: dataframe_name.tail()

3. Why doesn’t DataFrame.shape have parenthesis?

In pandas, shape is an attribute and not a method. So, you should access it without parentheses.

DataFrame.shape outputs a tuple with the number of rows and columns in a DataFrame.

4. What is the difference between a Series and a DataFrame?

  • DataFrame: The pandas DataFrame will be in tabular format with multiple rows and columns, where each column can be of different data types.

  • Series: The Series is a one-dimensional labeled array that can store any data type, but all of its values should be of the same data type. The Series data structure is more like a single column of a DataFrame.

The Series data structure consumes less memory than a DataFrame. So, certain data manipulation tasks are faster on it.

However, a DataFrame can store large and complex datasets, while a Series can handle only homogeneous data. So, the set of operations you can perform on a DataFrame is significantly higher than on a Series data structure.

5. What is an index in pandas?

The index is a series of labels that can uniquely identify each row of a DataFrame. The index can be of any datatype (like integer, string, hash, etc.).

df.index prints the current row indexes of the DataFrame df.

Intermediate Pandas Interview Questions

These questions will be a bit more challenging, and you are more likey to encounter them in roles requiring previous experience using pandas.

6. What is multi-indexing in pandas?

The index in pandas uniquely identifies each row of a DataFrame. We usually pick the column that can uniquely identify each row of a DataFrame and set it as the index. But what if you don’t have a single column that can do this?

For example, you have the columns “name”, “age”, “address”, and “marks” in a DataFrame. Any of the above columns may not have unique values for all the different rows and are unfit as indexes.

However, the columns “name” and “address” together may uniquely identify each row of the DataFrame. So you can set both columns as the index. Your DataFrame now has a multi-index or hierarchical index.

7. Explain pandas reindexing. Give an example.

Reindexing is used to conform a DataFrame to a new index. It is most commonly used to fill in missing gaps in time-series data or to ensure a report includes all categories, even those with zero values.

If the new index contains labels not present in the original DataFrame, pandas introduces NaN (or a specified fill value) for those rows.

Example: Imagine you have sales data for Q1 and Q3, but Q2 and Q4 are missing because there were zero sales. A standard plot would look misleading. reindex() fixes this by forcing the missing quarters into the DataFrame.

import pandas as pd

# Original data (Note: Q2 and Q4 are missing)
data = {'Quarter': ['Q1', 'Q3'], 'Sales': [15000, 18000]}
df = pd.read_json(pd.DataFrame(data).to_json()) # Simulating loaded data
df = df.set_index('Quarter')

# The complete index we REQUIRE for the report
all_quarters = ['Q1', 'Q2', 'Q3', 'Q4']

# Reindex forces Q2 and Q4 to appear, filling them with 0 instead of NaN
df_full = df.reindex(all_quarters, fill_value=0)

8. What is the difference between loc and iloc?

Both .loc() and the .iloc() methods in pandas are used to select subsets of a DataFrame. Practically, these are widely used for filtering a DataFrame based on conditions.

We should use the .loc() method to select data using actual labels of rows and columns, while the .iloc() method is used to extract data based on integer indices of rows and columns.

9. Show two different ways to create a pandas DataFrame

From a dictionary:

import pandas as pd

data = {'Name': ['John', 'Cataline', 'Matt'],
        'Age': [50, 45, 30],
        'City': ['Austin', 'San Francisco', 'Boston'],
        'Marks' : [70, 80, 95]}

df = pd.DataFrame(data)

From a list of lists:

import pandas as pd

data = [['John', 25, 'Austin',70],
        ['Cataline', 30, 'San Francisco',80],
        ['Matt', 35, 'Boston',90]]

columns = ['Name', 'Age', 'City', 'Marks']

df = pd.DataFrame(data, columns=columns)

10. How do you get the count of all unique values of a categorical column in a DataFrame?

The function Series.value_counts() returns the count of each unique value of a series or a column.

Example:

We have created a DataFrame df that contains a categorical column named Sex, and ran the .value_counts() function to see the count of each unique value in that column.

import pandas as pd

data = [['John', 50, 'Male', 'Austin', 70],
        ['Cataline', 45 ,'Female', 'San Francisco', 80],
        ['Matt', 30 ,'Male','Boston', 95]]

# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']

# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)

df['Sex'].value_counts()

If you want to see the percentage distribution rather than just raw counts, use normalize=True as an argument in .value_counts().

11. What is the category data type, and why use it?

The category data type is used for columns with a limited number of unique string values (low cardinality). It saves massive amounts of memory and speeds up operations like sorting and grouping because Pandas stores the strings once in a lookup table and uses lightweight integers for the actual data column.

Pandas Interview Questions for Experienced Practitioners

Those who already have a strong background in pandas and are applying for more senior roles may encounter some of these questions:

12. How do you optimize the performance with large datasets in pandas?

  • Use PyArrow: Load data with engine="pyarrow" and dtype_backend="pyarrow". This is faster and much more memory-efficient than standard NumPy types.

  • Vectorization instead of loops: Loops and iterations are expensive, especially when working with large datasets. Instead, opt for vectorized operations, as they are applied on an entire column at once, making them faster than row-wise iterations.

  • Load only what you need: Use the usecols parameter in read_csv() or read_parquet() to limit the amount of data loaded.

  • Memory-efficient types: The default data types in pandas are not memory-efficient. For example, integer values take the default datatype of int64, but if your values can fit in int32, adjusting the datatype to int32 can optimize the memory usage. Converting low-cardinality strings to category type is a good idea as well.

  • Use data aggregation: Try aggregating data and performing statistical operations because operations on aggregated data are more efficient than on the entire dataset.

  • Parallel processing: Native pandas is single-threaded. For parallelism on a single machine, the preferred solution is using Polars (which is multi-threaded by default) or libraries like Modin that act as a drop-in parallel replacement for pandas.

13. What is the difference between the .join() and .merge() methods in pandas?

  • Join: Joins two DataFrames based on their index. However, there is an optional argument on to specify if you want to join based on columns explicitly. By default, this function performs a left join. The syntax is df1.join(df2).

  • Merge: The merge() function is more versatile, allowing you to specify the columns on which you want to join the DataFrames. It applies an inner join by default, but can be customized to use different join types like left, right, outer, inner, and cross. The syntax is pd.merge(df1, df2, on=”column_names”).

14. What is Timedelta?

Timedelta represents the duration, i.e., the difference between two dates or times, measured in units as days, hours, minutes, and seconds.

15. The .append() method is removed. How do you combine DataFrames now?

If you attempt to use .append() in a modern Pandas environment (version 2.0 or later), it will raise an error because the method was removed to encourage more efficient coding practices.

Instead, you should collect all your dataframes or rows into a list, and then call pd.concat() once:

new_df = pd.concat([df1, df2], ignore_index=True)

16. When should you use Polars instead of pandas?

You should consider Polars (a Rust-based DataFrame library) when:

  • The dataset is significantly larger than available RAM: Polars has lazy evaluation/streaming.

  • You need multi-threaded performance: pandas operations are largely single-threaded.

  • Every millisecond counts: You are building a high-performance data pipeline.

Pandas Coding Interview Questions

Practical skills are just as important as theoretical knowledge when it comes to acing a tech interview. So here are some of the pandas interview questions coding you need to know before facing your interviewer.

17. How do you read Excel files to CSV using pandas?

First, we should use the .read_excel() function to pull in the Excel data into a variable. Then, just apply the .to_csv() function for a seamless conversion.

Here is the sample code:

import pandas as pd

#input your excel file path into the read_excel() function.
excel_data = pd.read_excel("/content/sample_data/california_housing_test.xlsx")

excel_data.to_csv("CSV_data.csv", index = None, header=True) 

Note: In a modern interview, mention that preserving data types in parquet format (with .to_parquet() instead of .to_csv()) is often superior when dealing with large datasets.

18. How do you sort a DataFrame based on columns?

We call the .sort_values() method to sort the DataFrame based on a single column or multiple columns.

The syntax looks like this: df.sort_values(by=[“column_names”]), like the following example shows:

import pandas as pd

data = [['John', 50, 'Male', 'Austin', 70],
['Cataline', 45 ,'Female', 'San Francisco', 80],
['Matt', 30 ,'Male', 'Boston', 95],
['Oliver',35,'Male', 'New york', 65]]

# Column labels of the DataFrame
columns = ['Name','Age','Sex', 'City', 'Marks']

# Create a DataFrame df
df = pd.DataFrame(data, columns=columns)

# Sort values based on ‘Age’ column
df.sort_values(by=['Age'])

df.head()`

19. Show two different ways to filter data

Let's show how to filter the following DataFrame:

import pandas as pd

data = {'Name': ['John', 'Cataline', 'Matt'],
        'Age': [50, 45, 30],
        'City': ['Austin', 'San Francisco', 'Boston'],
        'Marks' : [70, 80, 95]}

# Create a DataFrame df
df = pd.DataFrame(data)

Method 1: Boolean indexing

new_df = df[(df.Name == "John") | (df.Marks > 90)]
print (new_df)

Method 2: Using the .query() method

df.query('Name == "John" or Marks > 90')
print (new_df)

20. How do you aggregate data (e.g., mean or sum)?

The .groupby() function lets you aggregate data based on certain columns and perform operations on the grouped data. Combine it with the .agg() method for improved clarity.

In the following code, the data is grouped by the column Name, and the mean Grades of each group are calculated:

import pandas as pd

# Create a DataFrame
data = {
    'Name': ['John', 'Matt', 'John', 'Matt', 'Matt', 'Matt'],
    'Grades': [10, 20, 30, 15, 25, 18]
}

# Create a DataFrame df
df = pd.DataFrame(data)

# mean marks of John and Matt
print(df.groupby('Name').agg(
    avg_grade=('Grades', 'mean'),
)

21. How can you create a new column derived from existing columns?

There are two ways of creating new columns from existing ones:

  • Direct assignment: Assigning a column with a new name to a transformation of existing column(s): df['Total'] = df['Math'] + df['Science']

  • Method chaining: Using a lambda function with .assign()df = df.assign(Total=lambda x: x['Math'] + x['Science'])

Method chaining is the preferred choice nowadays, as it improves code readability by eliminating the need for intermediate variables. Having your entire data transformation as a single, logical pipeline significantly reduces bugs, as you no longer run the risk of accidentally referencing an outdated or incorrect version of your DataFrame during analysis.

Pandas Interview Questions for Data Scientists

Now that we have covered all the general and coding interview questions for pandas, let's have a look at pandas data science interview questions.

22. How do you handle null or missing values in pandas?

You can use any of the following three methods to handle missing values in pandas:

  • dropna(): removes the missing rows or columns from the DataFrame.

  • fillna(): fills missing values with a specific value using this function.

  • interpolate(): fills the missing values with computed interpolation values.

With the PyArrow backend, integers can now store NA values without casting to float, which preserves data fidelity in scientific workflows.

23. Difference between fillna() and interpolate() methods

  • fillna(): Fills with a static value (e.g., 0 or "Unknown"). Forward filling using the method='ffill' argument inside .fillna() is deprecated; use .ffill() directly instead.

  • interpolate(): Fills using mathematical estimation (linear, polynomial, spline) between points. Essential for time-series data where you want to "connect the dots."

24. What is resampling?

Resampling is used to change the frequency at which time series data is reported. Imagine you have monthly time series data and want to convert it into weekly data or yearly data; this is where resampling is used.

Converting monthly data to weekly or daily data is essentially upsampling. Interpolation techniques are used to increase the frequencies here.

In contrast, converting monthly to yearly data is termed downsampling, where data aggregation techniques are applied.

25. How do you perform one-hot encoding using pandas?

We perform one-hot encoding to convert categorical values into numeric ones, allowing them to be fed to the machine learning algorithm.

import pandas as pd

data = {'Name': ['John', 'Cateline', 'Matt', 'Oliver'],
        'ID': [1, 22, 23, 36],
        'Category': ['A', 'B', 'A', 'B']}

df = pd.DataFrame(data)

#one hot encoding 
new_df = pd.get_dummies(df, columns=['Category'])
new_df.head()

For production, using Scikit-learn's OneHotEncoder is the preferred option because it preserves the schema.

26. How do you create a line plot in pandas?

To create a line plot, we use the plot function in pandas.

import pandas as pd


data = {'units': [1, 2, 3, 4, 5],
        'price': [7, 12, 8, 13, 16]}
# Create a DataFrame df
df = pd.DataFrame(data)

df.plot(kind='line', x='units', y='price')

Note: You can now change the backend to Plotly for interactive charts: pd.options.plotting.backend = "plotly".

27. What is the pandas method to get the statistical summary of all the columns in a DataFrame?

df.describe() returns stats like mean, percentile values, min, max, etc., of each column in the DataFrame.

28. What is the rolling mean?

The rolling mean is also referred to as a moving average because the idea here is to compute the mean of data points for a specified window and slide the window throughout the data. This will lessen the fluctuations and highlight the long-term trends in time series data.

The syntax looks like this: df['column_name'].rolling(window=n).mean()

29. What is the SettingWithCopyWarning, and how can you fix it?

This warning occurred when pandas was unsure if you were modifying a view or a copy. In modern pandas (3.0+ context), Copy-on-Write (CoW) is standard. CoW strictly separates views and copies, ensuring that modifying a subset never silently alters the original frame unless explicitly requested. This largely eliminates the ambiguity that caused the warning.

To fix it in older pandas versions, use .loc[] for explicit indexing or assign the slice to a new variable before making changes to ensure clarity and avoid unintended behavior. Read more in this blog: How to Fix SettingWithCopyWarning.

30. How do you validate DataFrame schemas in production?

The industry standard is to use pandera, a library that enforces rigorous data quality checks at runtime. pandera allows you to define a schema that validates column data types and applies statistical logic, such as ensuring values are within a specific range or that a column is unique.

You can use its decorators (e.g., @pa.check_types) to automatically validate inputs and outputs of your functions, preventing "dirty data" from silently breaking your downstream pipelines. This effectively acts as unit testing for your data itself.

Preparing for the Interview

It's likely that you'll be asked at least a few of the most frequently asked interview questions, which is why it is a good idea to use them for interview preparation. In addition to pandas, a data-oriented job role demands a lot of other skills. Here is the checklist to succeed in the overall interview process:

Understanding the job requirements

Revise the job description and responsibilities, and ensure your skills and resume are aligned with them. Additionally, knowing about the company and how your role impacts them is a plus.

Python coding

The interviewer first checks your Python skills before asking about its library(pandas). So, equip yourself with strong Python skills.

For analyst roles, solely Python language does the job. But if you are applying for data scientist or ML engineer roles, it’s important to solve Python coding challenges.

Data projects

Make sure you’ve some real-world data problems solved in your resume. For experienced ones, you can talk about your past projects. If you are new to the field, try finishing some projects from Kaggle.

Further concepts

Beyond those basics, the additional questions will depend on the role.

For analysts, questions can be from Excel, data visualization dashboards, statistics, and probability.

In addition, the interviewer can go deep into machine learning and deep learning subjects if you are applying for data scientist or ML engineer roles.

Expect questions around system design if you’re applying for highly technical or experienced roles. Revise common design questions and practice end-to-end ML system design problems.

Conclusion

Cracking a job in the data industry often requires strong Pandas skills. The above list of both theoretical and hands-on interview questions should help you ace the pandas part of your interview. Plus, the tips in the end ensure that your entire interview goes smoothly.

You can use the following resources to help you prepare for your pandas interview:


Srujana Maddula's photo
Author
Srujana Maddula
LinkedIn

Srujana is a freelance tech writer with the four-year degree in Computer Science. Writing about various topics, including data science, cloud computing, development, programming, security, and many others comes naturally to her. She has a love for classic literature and exploring new destinations.

Topics

Start Your pandas Journey Today!

Course

Data Manipulation with pandas

4 hr
522.7K
Learn how to import and clean data, calculate statistics, and create visualizations with pandas.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

28 Top Data Scientist Interview Questions For All Levels

Explore the top data science interview questions with answers for final-year students and professionals looking for jobs.
Abid Ali Awan's photo

Abid Ali Awan

15 min

blog

The 36 Top Python Interview Questions & Answers For 2026

Essential Python interview questions with examples for job seekers, final-year students, and data professionals.
Abid Ali Awan's photo

Abid Ali Awan

15 min

blog

Top 30 Data Structure Interview Questions and Answers for 2026

Are you applying for a job that requires data structure knowledge? This guide has you covered. Discover the top basic, intermediate, and advanced data structure questions to ace your upcoming interview.
Maria Eugenia Inzaugarat's photo

Maria Eugenia Inzaugarat

15 min

blog

Top 30+ Big Data Interview Questions: A Full Practice Guide

Master the key topics and questions asked in big data interviews, from foundational concepts like data storage and distributed computing to advanced areas like machine learning and security.
Vikash Singh's photo

Vikash Singh

15 min

blog

The Top 20 Spark Interview Questions

Essential Spark interview questions with example answers for job-seekers, data professionals, and hiring managers.
Tim Lu's photo

Tim Lu

blog

Top 36 PySpark Interview Questions and Answers for 2026

This article provides a comprehensive guide to PySpark interview questions and answers, covering topics from foundational concepts to advanced techniques and optimization strategies.
Maria Eugenia Inzaugarat's photo

Maria Eugenia Inzaugarat

15 min

See MoreSee More