Course
NumPy is a fundamental component of the data scientist’s toolkit. It allows for efficient data processing in Python, even at large volumes.
The toolset provided by NumPy, such as easy element-wise calculations, matrix multiplication, and vectorization, makes it a top choice for performing complex calculations in Python. As such, it will come up often during the interview process. Make sure you brush up on some potential questions by reading this article!
Also, check other resources on DataCamp for additional hands-on practice with NumPy.
Basic NumPy Interview Questions
Use these basic interview questions to check your understanding of NumPy fundamentals. They are great warmups and starter points for ensuring you know NumPy’s functionality and purpose.
1. What is NumPy, and why is it used in data science?
NumPy is a Python package with many portions built in C/C++ for performance reasons. Its main goal is to make the computation of large data arrays faster and easier in Python. Its core functionality is as follows:
- It provides support for large, multi-dimensional arrays and matrices, which are essential for handling large datasets.
- It offers a comprehensive collection of mathematical functions to operate on these arrays efficiently, enabling fast computations on large datasets.
- NumPy's vectorized operations allow for the efficient execution of complex mathematical operations.
- It forms the foundation for many other data science libraries like pandas, scikit-learn, and SciPy, making it a cornerstone of the Python data science ecosystem.
- NumPy arrays are more memory-efficient than Python lists, which is crucial when working with big data.
2. How do you create a 1D array in NumPy?
Creating a Numpy array is super easy! You simply need to invoke the array()
method to create an array object. Understanding how to make 1D arrays will allow you to build out higher n-dimensional NumPy arrays.
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
3. What's the difference between a Python list and a NumPy array?
The main differences between Python lists and NumPy arrays are:
- Homogeneity: NumPy arrays are homogeneous, meaning all elements must be the same type. Python lists can contain elements of different types.
- Memory efficiency: NumPy arrays are more memory-efficient because they store data in a contiguous block of memory, unlike Python lists, which store pointers to objects.
- Performance: NumPy arrays support vectorized operations, making them much faster for numerical computations. Operations are performed element-wise without the need for explicit loops.
- Functionality: NumPy arrays come with a wide range of mathematical operations and functions that can be applied directly to the array, which is not possible with Python lists.
4. How do you check the shape and size of a NumPy array?
Understanding how to check the shape of a NumPy array is important because, during data processing, you may have an expected final output array size.
If your result does not meet your expectations, checking the NumPy array shape will allow you to take steps toward resolving those issues. You can use the shape attribute to get the dimensions of the array and the size attribute to get the total number of elements:
import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.shape) # Output: (2, 3)
print(arr.size) # Output: 6
5. How do you reshape a NumPy array?
Reshaping arrays is a common operation in data preprocessing and feature engineering. It is crucial for adapting data to various algorithms' input requirements or reorganizing data for analysis.
You can reshape a NumPy array using the reshape()
method or the np.reshape()
function. Here's how:
import numpy as np
# Using reshape() method
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = arr.reshape(2, 3)
print(reshaped_arr)
# Output:
# [[1 2 3]
# [4 5 6]]
# Using np.reshape() function
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped_arr = np.reshape(arr, (3, 2))
print(reshaped_arr)
# Output:
# [[1 2]
# [3 4]
# [5 6]]
Become an ML Scientist
Upskill in Python to become a machine learning scientist.
Intermediate NumPy Interview Questions
These questions dive deeper into actually using NumPy. Once you’ve established a fundamental understanding of NumPy’s arrays, you should explore its functionality. Performing calculations with NumPy can be expected at the intermediate level.
6. How do you create an array of all zeros or all ones?
Creating arrays filled with zeros or ones is a common requirement in many data science tasks, such as initializing matrices, creating mask arrays, or setting up placeholder data structures.
To create an array of all zeros or all ones in NumPy, you use the np.zeros()
or np.ones()
functions:
import numpy as np
# Create a 3x4 array of zeros
zeros_arr = np.zeros((3, 4))
print(zeros_arr)
# Output:
# [[0. 0. 0. 0.]
# [0. 0. 0. 0.]
# [0. 0. 0. 0.]]
# Create a 2x2 array of ones
ones_arr = np.ones((2, 2))
print(ones_arr)
# Output:
# [[1. 1.]
# [1. 1.]]
7. What is broadcasting in NumPy?
Broadcasting is a key NumPy behavior that allows for efficient operations on arrays of different sizes.
Simply put, it allows for arithmetic operations between arrays of different sizes by ensuring both arrays have compatible shapes. This is done when NumPy automatically replicates smaller arrays across the larger array so that they have compatible shapes.
See the following example:
import numpy as np
a = np.array([1, 2, 3])b = np.array([[1], [2], [3]])
print(a + b)
# Output:
# [[2 3 4]
# [3 4 5]
# [4 5 6]]
8. How do you find the mean, median, and standard deviation of a NumPy array?
The mean, median, and standard deviation are key descriptive statistics commonly used to understand our data. NumPy calculates these very easily and has functions specially built for them.
Knowing these calculations is important for enhancing our ability to utilize NumPy:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Mean
mean_value = np.mean(arr)
print("Mean:", mean_value) # Output: 3.0
# Median
median_value = np.median(arr)
print("Median:", median_value) # Output: 3.0
# Standard deviation
std_value = np.std(arr)
print("Standard deviation:", std_value) # Output: 1.4142135623730951
9. How can we utilize NumPy to quickly query our numerical data and perform actions based on a boolean statement?
While we think of NumPy’s primary purpose as a calculation package, it also has powerful data-wrangling capabilities.
NumPy can easily query data through boolean indexing and perform operations based on results. The where()
method is particularly useful when we want to make changes to our data based on numerical values.
Assume we have a data frame of exam scores named df and want to categorize the students. A simple np.where()
may look like:
df[‘student_cat’] = np.where(df[‘score’] > 80, ‘good’, ‘bad’)
Note that the where()
method takes up to 3 inputs:
- The first is the boolean statement of any kind
- The second is the result if true
- The third is the result if false
10. What is a way to utilize NumPy for something like Mean Squared Error?
NumPy’s ability to work across an entire array at once makes implementing calculations like Mean Squared Error (MSE) simple.
Because it performs simple operations on an element-by-element basis it easily vectorizes the operation and provides an efficient way of calculating MSE.
An example of an implementation of MSE in NumPy follows:
# 1. We have two arrays, our prediction, and actual labels
# 2. We take the squared differences and sum them
# 3. We then divide by n, which is the length of the array
n= len(labels)
error = (1/n) * np.sum(np.square(predictions - labels))
Advanced NumPy Interview Questions
Now, it’s time to enter advanced territory! You’re expected to use NumPy to solve more complex problems at this level.
11. How can you compute the rolling statistics using NumPy, such as a rolling mean?
Rolling statistics such as a rolling average are very important in data science. The rolling mean is often used to smooth noisy data, especially if it is time-based.
A little-known functionality of NumPy is “strides.” One of the ways stride is implemented is to create a sliding window view of your array. Using lib.stride_tricks.sliding_window_view()
, you can easily generate array subsets.
You can then do any sort of summarization, such as calculating the mean, on each of these subsets to obtain a rolling average. Here’s an example implementation:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view
x = np.arange(6)
v = sliding_window_view(x, 3)
# This creates v, an array that contains subarrays of length 3 which reflect the size of the window.
12. How do you perform advanced indexing to select elements from a multidimensional array based on a condition?
While indexing may be a fundamental skill for NumPy, leveraging more advanced indexing techniques can allow data scientists to slice their data more precisely.
By utilizing integer array indexing and boolean indexing, creating datasets that meet specific criteria becomes trivial.
import numpy as np
array = np.array([[10, 15, 20, 25],
[30, 35, 40, 45],
[50, 55, 60, 65]])
print(array) # Output:
# [[10 15 20 25]
# [30 35 40 45]
# [50 55 60 65]]
# Boolean indexing: Select elements greater than 30
condition = array > 30
print(condition) # Output:
# [[False False False False]
# [False True True True]
# [ True True True True]]
# Apply the condition to get the elements that meet the criteria
filtered_elements = array[condition]
print(filtered_elements) # Output: [35 40 45 50 55 60 65]
# Integer array indexing: Select specific elements based on row and column indices
row_indices = np.array([0, 1, 2])
col_indices = np.array([1, 2, 3])
selected_elements = array[row_indices, col_indices]
print(selected_elements) # Output: [15 40 65]
# Combining boolean and integer indexing
# Select elements from the array where the element is greater than 30 and belongs to specific indices
combined_condition = (array > 30) & ((row_indices[:, None] == np.arange(3)).any(axis=0))
filtered_selected_elements = array[combined_condition]
print(filtered_selected_elements) # Output: [35 40 45 50 55 60 65]
13. How can you use NumPy to perform a linear algebra operation like matrix decomposition or solve a linear equation system?
Performing matrix decomposition is vital for data scientists dealing with large data volumes. Reducing data to its principal components is a critical first step in reducing complexity and noise.
NumPy’s linalg
module allows us to easily perform linear algebra to obtain principal components.
# The underlying signal is a sinusoidally modulated image
img = lena() # This is from scipy.misc import lena
t = np.arange(100)
time = np.sin(0.1*t)
true= time[:,np.newaxis,np.newaxis] * img[np.newaxis,...]
# We add some noise
noisy = real + np.random.randn(*true.shape)*255
# (observations, features) matrix
M = noisy.reshape(noisy.shape[0],-1)
# Singular value decomposition factorizes your data matrix such that:
# M = U*S*V.T (where '*' is matrix multiplication)
# * U and V are the singular matrices containing orthogonal vectors of unit length
# * S is a diagonal matrix containing the singular values of M - we can use this to calculate our PCs
# Obtain the results of SVD from our noisy matrix
U, s, Vt = np.linalg.svd(M, full_matrices=False)
# Transpose V to get our PC vectors
V = Vt.T
# PCs are already sorted by descending order of the singular values (i.e. by the proportion of total variance they explain)
# If we use all of the PCs we can reconstruct the noisy signal perfectly
S = np.diag(s)
Mhat = np.dot(U, np.dot(S, V.T))
print(“Using all PCs, MSE = %.6G" %(np.mean((M - Mhat)**2)))
14. How can you optimize memory usage when working with large arrays in NumPy?
A rarely used functionality of NumPy is memmap()
. This allows us to store arrays as a file, which allows us to read larger arrays than we would purely in memory. Its main benefit is lazy data reading, which reduces the overall memory need while allowing us to assess the entire dataset.
Cleverly using this function will allow a data scientist to work with large data more easily and conveniently.
import numpy as np
# Create a large array and save it to a file using memmap
filename = 'large_array.dat'
large_array_shape = (10000, 10000)
dtype = np.float32 # Specify the data type of the array
# Create a memmap object with the desired shape and dtype
large_array = np.memmap(filename, dtype=dtype, mode='w+', shape=large_array_shape)
# Initialize the array with some values (e.g., fill it with random numbers)
large_array[:] = np.random.rand(*large_array_shape)
# Access a small part of the array without loading the entire array into memory
sub_array = large_array[5000:5010, 5000:5010]
print(sub_array) # Output: A 10x10 array with random float values
# Clean up and ensure that the changes are written to disk
del large_array
15. How can you handle and manipulate arrays with missing or infinite values in NumPy?
Dealing with missing and infinite values is common for data scientists. First, you might want to use NumPy’s isnan()
or isinf()
to find these missing and infinite values.
If there is a systematic issue, we may want to assess our pipelines; otherwise, we may want to fill in these values.
While we may not use NumPy directly to fill in missing values, we are often using NumPy functions in combination with something like pandas fillna()
to fill in missing data. For instance, we may want to use NumPy’s mean()
or median()
functions to quickly fill in erroneous values.
NumPy Interview Questions for Data Scientists
Until now, we have covered general NumPy questions. Of course, the questions we reviewed before may apply to data science, but in this section, I compiled specific NumPy questions for data scientists.
16. Is there a way to quickly and easily apply functions to each row and column of a 2D array?
Sometimes, we need to perform custom calculations on our array to get information on each row or column. Thankfully, the NumPy method apply_along_axis()
can be utilized to apply a custom function across a NumPy array. These functions are applied across the entirety of a particular axis in each array.
import numpy as np
# Create a 2D array
data = np.array([
[1, 2, 3, 4, 5],
[10, 15, 20, 25, 30],
[100, 200, 300, 400, 500]
])
# Define a function to compute the range of a 1D array
def compute_range(arr):
return np.max(arr) - np.min(arr)
# Apply the compute_range function to each row (axis=1)
ranges = np.apply_along_axis(compute_range, axis=1, arr=data)
print("Range of each row:", ranges)
# Output:
# Range of each row: [ 4 20 400]
17. How can you leverage NumPy to perform feature scaling and normalization of datasets for machine learning?
Normalization of data ensures we’re properly training our machine learning models. Without normalization, the scale can impact our model results, especially for distance-based models.
We can use NumPy’s functions to easily perform scaling. Here's an example of min-max scaling that runs for all rows. Make sure you are choosing the right dimension when feature scaling.
import numpy as np
data = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
])
# Min-Max Scaling
min_vals = np.min(data, axis=0)
max_vals = np.max(data, axis=0)
scaled_data = (data - min_vals) / (max_vals - min_vals)
print("Scaled Data:\n", scaled_data)
# Output:
# Scaled Data:
# [[0. 0. 0. ]
# [0.5 0.5 0.5 ]
# [1. 1. 1. ]]
18. What are some ways we can easily sort and index our NumPy arrays?
While we have something like the DataFrame sort_values()
there are certain cases where we want to find the location of these sorted values.
NumPy’s argsort()
provides the positions that would sort a given array. One useful scenario is when we need to properly index other datasets to match our sorted arrays. By having the positions ready, we can utilize the output from argsort()
to ensure consistency across our datasets.
19. What is one important aspect of NumPy’s random number generator that can be utilized to make it predictable and why?
Random number generators in computing are not genuinely random. They are based on an initial seed. Since we often want to test our data and be able to easily assess results, we must minimize the amount of randomness present in our pipeline.
By utilizing NumPy’s random.seed()
method, we can set the seed for an entire pipeline so that we can get similar results each time. Setting a specific seed allows us to assess whether improvements to our results are based on our model adjustments and not randomness.
20. Describe how you might implement K-Means in NumPy.
During an interview, you may be asked to implement an algorithm of some kind. The focus of these questions is to answer with a fundamental understanding of the model and the package.
You do not need to memorize every line of code below, but be able to point out the key steps and methods needed. Make sure you’ve read about K-Means (and other basic algorithms) and understand how the algorithm works as well.
import numpy as np
# Generate a sample dataset
np.random.seed(42) # For reproducibility
data = np.vstack([
np.random.normal(loc=[1, 1], scale=0.5, size=(50, 2)),
np.random.normal(loc=[5, 5], scale=0.5, size=(50, 2)),
np.random.normal(loc=[9, 1], scale=0.5, size=(50, 2))
])
def k_means(X, k, max_iters=100, tol=1e-4):
# Step 1: Initialize centroids randomly
num_samples, num_features = X.shape
centroids = X[np.random.choice(num_samples, k, replace=False)]
for i in range(max_iters):
# Step 2: Assign clusters
distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
cluster_assignments = np.argmin(distances, axis=1)
# Step 3: Update centroids
new_centroids = np.array([X[cluster_assignments == j].mean(axis=0) for j in range(k)])
# Check for convergence
if np.all(np.linalg.norm(new_centroids - centroids, axis=1) < tol):
break
centroids = new_centroids
return centroids, cluster_assignments
# Apply k-means clustering
k = 3
centroids, cluster_assignments = k_means(data, k)
Final Thoughts
Building up your interview knowledge with NumPy is one of the critical steps to success in a career in data science.
Start by practicing and understanding the fundamentals followed by specific implementation. The more you use Numpy the better you will understand and internalize its functions. Try some of the courses and tutorials in DataCamp, such as the following:
Build Machine Learning Skills
Elevate your machine learning skills to production level.
FAQs
What other topics might be covered in a data science interview?
In addition to key Python programming concepts, it's important to be familiar with libraries like Matplotlib, scikit-learn, and SciPy. Knowledge of these tools can give you an edge in data science interviews.
What are the key things to know about NumPy?
Stay informed about the latest updates to NumPy, as being up-to-date with new features and changes can give you a significant advantage when applying for data science roles.
What are some common applications of NumPy?
NumPy is essential for tasks involving matrix calculations, such as gradient descent and convolutional neural network computations, making it highly applicable in various data science and machine learning scenarios.
How can I effectively learn NumPy?
You can quickly and efficiently learn NumPy by utilizing resources on platforms like DataCamp and gaining hands-on experience. Practical application is the most effective way to master NumPy.
What is a good project idea for practicing NumPy?
Implementing a machine learning model using NumPy is an excellent way to demonstrate your mathematical skills and understanding of the library. Start with a simple project like k-means clustering and progress to more complex tasks like gradient descent.
I am a data scientist with experience in spatial analysis, machine learning, and data pipelines. I have worked with GCP, Hadoop, Hive, Snowflake, Airflow, and other data science/engineering processes.
Learn more about Python and data science with these courses!
Course
Introduction to NumPy
Course
Python Toolbox
blog
The 23 Top Python Interview Questions & Answers For 2024
blog
Top 26 Python pandas Interview Questions and Answers
Srujana Maddula
15 min
blog
The Top 20 Spark Interview Questions
Tim Lu
blog
Top 30 PySpark Interview Questions and Answers for 2024
blog
28 Top Data Scientist Interview Questions For All Levels
cheat-sheet