Course Notes
Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! The datasets used in this course are available in the datasets
folder.
# Import any packages you want to use here
from statsmodel.formula.api import ols
Take Notes
Add notes here about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets here
h = ols("y ~ x", data = m).fit()
print(h.params)
# To not include an intercept
h = ols("y ~ x + 0", data = m).fit()
# Extract the model coefficients, coeffs
coeffs = mdl_price_vs_both.params
# Assign each of the coeffs
ic_0_15, ic_15_30, ic_30_45, slope = coeffs
# Draw a scatter plot of price_twd_msq vs. n_convenience, colored by house_age_years
sns.scatterplot(x="n_convenience",
y="price_twd_msq",
hue="house_age_years",
data=taiwan_real_estate)
# Add three parallel lines for each category of house_age_years
# Color the line for ic_0_15 blue
plt.axline(xy1=(0, ic_0_15), slope=slope, color="blue")
# Color the line for ic_15_30 orange
plt.axline(xy1=(0, ic_15_30), slope=slope, color="orange")
# Color the line for ic_30_45 green
plt.axline(xy1=(0, ic_30_45), slope=slope, color="green")
# Show the plot
plt.show()
The prediction workflow
from itertools import product
# Create n_convenience as a range of numbers from 0 to 10
n_convenience = np.arange(0, 11)
# Extract the unique values of house_age_years
house_age_years = taiwan_real_estate["house_age_years"].unique()
# Create p as all combinations of values of n_convenience and house_age_years
p = product(n_convenience, house_age_years)
# Transform p to a DataFrame and name the columns
explanatory_data = pd.DataFrame(p, columns=['n_convenience', 'house_age_years'])
# Add predictions to the DataFrame
prediction_data = explanatory_data.assign(
price_twd_msq = mdl_price_vs_both.predict(explanatory_data))
print(prediction_data)
# Extract the model coefficients, coeffs
coeffs = mdl_price_vs_both.params
# Print coeffs
print(coeffs)
# Assign each of the coeffs
ic_0_15, ic_15_30, ic_30_45, slope = coeffs
# Create the parallel slopes plot
plt.axline(xy1=(0, ic_0_15), slope=slope, color="green")
plt.axline(xy1=(0, ic_15_30), slope=slope, color="orange")
plt.axline(xy1=(0, ic_30_45), slope=slope, color="blue")
sns.scatterplot(x="n_convenience",
y="price_twd_msq",
hue="house_age_years",
data=taiwan_real_estate)
# Add the predictions in black
sns.scatterplot(x="n_convenience", y="price_twd_msq", color="black", data=prediction_data)
plt.show()
Manually calculating predictions
# Define conditions
conditions = [
explanatory_data["house_age_years"] == "0 to 15",
explanatory_data["house_age_years"] == "15 to 30",
explanatory_data["house_age_years"] == "30 to 45"]
# Define choices
choices = [ic_0_15, ic_15_30, ic_30_45]
# Create array of intercepts for each house_age_year category
intercept = np.select(conditions, choices)
# Create prediction_data with columns intercept and price_twd_msq
prediction_data = explanatory_data.assign(
intercept = intercept,
price_twd_msq = intercept + slope * explanatory_data["n_convenience"])
print(prediction_data)
Getting the coefficient of determination
To easily get the coefficient of determination, you can use the rsquared attribute of the fitted model. For the mass versus length model, the coefficient of determination is zero-point-eight-two, where zero is the worst possible fit and one is a perfect fit. For the mass versus species model, the coefficient of determination is worse, at zero-point-two-five. For the mass versus both model, the coefficient of determination is the highest, at zero-point-nine-two. Using this metric, the model with both explanatory variables is the best one, since it has the highest coefficient of determination.