Skip to content
1 hidden cell
Supervised Learning with scikit-learn - Python
Supervised Learning with scikit-learn
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
this is how tto test your models ·
# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
knn = KNeighborsClassifier(n_neighbors=5)
# Fit the classifier to the training data
knn.fit(X_train,y_train)
# Print the accuracy
print(knn.score(X_test, y_test))
0.8740629685157422Introduction to Regression
- Now we're going to check out the other type of supervised learning: regression. In regression tasks, the target variable typically has continuous values, such as a contry's GDP, or the price of a house.
- Creating feature and target arrays Recall that scikit-learn requires features and target values in distinct variables, X and y. To use all of the features in our dataset, we drop our target, blood glucose levels, and store the values attribute as X. For y, we take the the target column's values attribute. We can print the type for X and y to confirm they are now both NumPy arrays.
- Making predictions form a single feature - y axis can be one dimensional but X axis must be two dimensional (apply NumPy's dot-reshape method, passing minus one followed by one.
##query the data with python, when writing program don't forget print statement
import pandas as pd
hitters = pd.read_csv('hitters.csv')
print(hitters.head())
#Creating feature and target
X = hitters.drop("strikeout" , axis=1).values
y = hitters["strikeout"].values
print(type(X), type(y))
#try to predict strikeout from at bats (ab column 6)
X_ab =X[:, 6]
print(y.shape, X_ab.shape)
#Checking the shape of y and X_bmi, we see that they are both one-dimensional arrays. This is fine for y, but our features must be formatted as a two-dimensional array to be accepted by scikit-learn. To convert the shape of X_bmi we apply NumPy's dot-reshape method, passing minus one followed by one.
X_ab = X_ab.reshape(-1, 1)
print(X_ab.shape)
#plotting strikeout as a function of At Bats
import matplotlib.pyplot as plt
plt.scatter(X_ab, y)
plt.ylabel("Strikeout")
plt.xlabel("At Bats")
plt.show()
#Fitting a regression Model
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_ab, y)
predictions = reg.predict(X_ab)
plt.scatter(X_ab, y)
plt.plot(X_ab, predictions)
plt.ylabel("Strikeout")
plt.xlabel("At Bats")
plt.show()Hidden output
THE BASICS OF LINEAR REGRESSION
- We want to fit a line to the data, and in two dimensions this takes the form of y equals ax plus b. Using a single feature is known as simple linear regression, where y is the target, x is the feature, and a and b are the model parameters that we want to learn. a and b are also called the model coefficients, or the slope and intercept, respectively. So how do we accurately choose values for a and b? We can define an error function for any given line and then choose the line that minimizes this function. Error functions are also called loss or cost functions.
Regression Mechanics
-
y = ax+b -
y=target -
x = single feature -
a,b = parameters/coefficients of the model - slope, intercept -
How do we choose a and b? -
Define an error function for any given line -
Choose the line that minimizes the error functions -
Error function = loss function == cost function
import pandas as pd
hitters = pd.read_csv('hitters.csv')
##print(hitters.head
DataFrameas
df
variable
SELECT * FROM 'hitters.csv'