Course Notes
Use this workspace to take notes, store code snippets, and build your own interactive cheatsheet!
Note that the data from the course is not yet added to this workspace. You will need to navigate to the course overview page, download any data you wish to use, and add it to the file browser.
# Import any packages you want to use here
Take Notes
Add notes here about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets here
Identifying a time series Which of the following data sets is not considered time series data?
A list of the average length of each class at the school.
Plotting a time series (I) In this exercise, you'll practice plotting the values of two time series without the time component.
Two DataFrames, data and data2 are available in your workspace.
Unless otherwise noted, assume that all required packages are loaded with their common aliases throughout this course.
Note: This course assumes some familiarity with time series data, as well as how to use them in data analytics pipelines. For an introduction to time series, we recommend the Introduction to Time Series Analysis in Python and Visualizing Time Series Data with Python courses.
Print the first 5 rows of data
print(data.head())
<script.py> output: symbol data_values 0 214.01 1 214.38 2 210.97 3 210.58 4 211.98
Print the first 5 rows of data2
print(data2.head())
<script.py> output: data_values 0 -0.007 1 -0.008 2 -0.009 3 -0.010 4 -0.011
Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10)) data.iloc[:1000].plot(y='data_values', ax=axs[0]) data2.iloc[:1000].plot(y='data_values', ax=axs[1]) plt.show()
Plotting a time series (II) You'll now plot both the datasets again, but with the included time stamps for each (stored in the column called "time"). Let's see if this gives you some more context for understanding each time series data.
Plot the time series in each dataset
fig, axs = plt.subplots(2, 1, figsize=(5, 10)) data.iloc[:1000].plot(x="time", y='data_values', ax=axs[0]) data2.iloc[:1000].plot(x="time", y='data_values', ax=axs[1]) plt.show()
Fitting a simple model: classification In this exercise, you'll use the iris dataset (representing petal characteristics of a number of flowers) to practice using the scikit-learn API to fit a classification model. You can see a sample plot of the data to the right.
Print the first 5 rows for inspection
print(data.head())
from sklearn.svm import LinearSVC
Construct data for the model
X = data[["petal length (cm)", "petal width (cm)"]] y = data[['target']]
Fit the model
model = LinearSVC() model.fit(X, y)
Predicting using a classification model Now that you have fit your classifier, let's use it to predict the type of flower (or class) for some newly-collected flowers.
Information about petal width and length for several new flowers is stored in the variable targets. Using the classifier you fit, you'll predict the type of each flower.
Create input array
X_predict = targets[['petal length (cm)', 'petal width (cm)']]
Predict with the model
predictions = model.predict(X_predict) print(predictions)
Visualize predictions and actual values
plt.scatter(X_predict['petal length (cm)'], X_predict['petal width (cm)'], c=predictions, cmap=plt.cm.coolwarm) plt.title("Predicted class values") plt.show()
<script.py> output: [2 2 2 1 1 2 2 2 2 1 2 1 1 2 1 1 2 1 2 2]
Fitting a simple model: regression In this exercise, you'll practice fitting a regression model using data from the California housing market. A DataFrame called housing is available in your workspace. It contains many variables of data (stored as columns). Can you find a relationship between the following two variables?
"MedHouseVal": the median house value for California districts (in $100,000s of dollars) "AveRooms" : average number of rooms per dwelling
from sklearn import linear_model
Prepare input and output DataFrames
X = housing[["MedHouseVal"]] y = housing[["AveRooms"]]
Fit the model
model = linear_model.LinearRegression() model.fit(X,y)
Predicting using a regression model Now that you've fit a model with the California housing data, lets see what predictions it generates on some new data. You can investigate the underlying relationship that the model has found between inputs and outputs by feeding in a range of numbers as inputs and seeing what the model predicts for each input.
A 1-D array new_inputs consisting of 100 "new" values for "MedHouseVal" (median house value) is available in your workspace along with the model you fit in the previous exercise.
Generate predictions with the model using those inputs
predictions = model.predict(new_inputs.reshape(-1,1))
Visualize the inputs and predicted values
plt.scatter(new_inputs, predictions, color='r', s=3) plt.xlabel('inputs') plt.ylabel('predictions') plt.show()
Inspecting the classification data In these final exercises of this chapter, you'll explore the two datasets you'll use in this course.
The first is a collection of heartbeat sounds. Hearts normally have a predictable sound pattern as they beat, but some disorders can cause the heart to beat abnormally. This dataset contains a training set with labels for each type of heartbeat, and a testing set with no labels. You'll use the testing set to validate your models.
As you have labeled data, this dataset is ideal for classification. In fact, it was originally offered as a part of a public Kaggle competition.
import librosa as lr from glob import glob
List all the wav files in the folder
audio_files = glob(data_dir + '/*.wav')
Read in the first audio file, create the time array
audio, sfreq = lr.load(audio_files[0]) time = np.arange(0, len(audio)) / sfreq
Plot audio over time
fig, ax = plt.subplots() ax.plot(time, audio) ax.set(xlabel='Time (s)', ylabel='Sound Amplitude') plt.show()
Inspecting the regression data The next dataset contains information about company market value over several years of time. This is one of the most popular kind of time series data used for regression. If you can model the value of a company as it changes over time, you can make predictions about where that company will be in the future. This dataset was also originally provided as part of a public Kaggle competition.
In this exercise, you'll plot the time series for a number of companies to get an understanding of how they are (or aren't) related to one another.
Read in the data
data = pd.read_csv('prices.csv', index_col=0)
Convert the index of the DataFrame to datetime
data.index = pd.to_datetime(data.index) print(data.head())
Loop through each column, plot its values over time
fig, ax = plt.subplots() for column in data: data[column].plot(ax=ax, label=column) ax.legend() plt.show()
<script.py> output:
AAPL FB NFLX V XOM
time
2010-01-04 214.01 NaN 53.48 88.14 69.15
2010-01-05 214.38 NaN 51.51 87.13 69.42
2010-01-06 210.97 NaN 53.32 85.96 70.02
2010-01-07 210.58 NaN 52.40 86.76 69.80
2010-01-08 211.98 NaN 53.30 87.00 69.52
Many repetitions of sounds In this exercise, you'll start with perhaps the simplest classification technique: averaging across dimensions of a dataset and visually inspecting the result.
You'll use the heartbeat data described in the last chapter. Some recordings are normal heartbeat activity, while others are abnormal activity. Let's see if you can spot the difference.
Two DataFrames, normal and abnormal, each with the shape of (n_times_points, n_audio_files) containing the audio for several heartbeats are available in your workspace. Also, the sampling frequency is loaded into a variable called sfreq. A convenience plotting function show_plot_and_make_titles() is also available in your workspace.
fig, axs = plt.subplots(3, 2, figsize=(15, 7), sharex=True, sharey=True)
Calculate the time array
time = np.arange(0,len(normal)) / sfreq
Stack the normal/abnormal audio so you can loop and plot
stacked_audio = np.hstack([normal, abnormal]).T
Loop through each audio file / ax object and plot
.T.ravel() transposes the array, then unravels it into a 1-D vector for looping
for iaudio, ax in zip(stacked_audio, axs.T.ravel()): ax.plot(time, iaudio) show_plot_and_make_titles()
Invariance in time While you should always start by visualizing your raw data, this is often uninformative when it comes to discriminating between two classes of data points. Data is usually noisy or exhibits complex patterns that aren't discoverable by the naked eye.
Another common technique to find simple differences between two sets of data is to average across multiple instances of the same class. This may remove noise and reveal underlying patterns (or, it may not).
In this exercise, you'll average across many instances of each class of heartbeat sound.
The two DataFrames (normal and abnormal) and the time array (time) from the previous exercise are available in your workspace.
Average across the audio files of each DataFrame
mean_normal = np.mean(normal, axis=1) mean_abnormal = np.mean(abnormal, axis=1)
Plot each average over time
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 3), sharey=True) ax1.plot(time, mean_normal) ax1.set(title="Normal Data") ax2.plot(time, mean_abnormal) ax2.set(title="Abnormal Data") plt.show()
Build a classification model While eye-balling differences is a useful way to gain an intuition for the data, let's see if you can operationalize things with a model. In this exercise, you will use each repetition as a datapoint, and each moment in time as a feature to fit a classifier that attempts to predict abnormal vs. normal heartbeats using only the raw data.
We've split the two DataFrames (normal and abnormal) into X_train, X_test, y_train, and y_test.
from sklearn.svm import LinearSVC
Initialize and fit the model
model = LinearSVC() model.fit(X_train, y_train)
Generate predictions and score them manually
predictions = model.predict(X_test) print(sum(predictions == y_test.squeeze()) / len(y_test))
.555555