Certification Prep - Regression

Electricity Costs at a Data Center

Project Brief

You are working at a data center. One of the biggest costs for the center is the electricity required to run all of the servers. To ensure that they will continue to charge customers appropriately for their services, it’s important for them to be able to estimate the price they will have to pay for running the center - in particular the electricity price. As well as knowing whether it is possible to get a reasonable estimate of the price, they would like to know if there are factors that cause the price to increase, so they can charge customers more for those situations.

Data

Dataset containing the price of electricity for a data center in addition to factors that might affect the price.

DateTime: String, defines date and time of sample
Holiday: String, gives name of holiday if day is a bank holiday
HolidayFlag: integer, 1 if day is a bank holiday, zero otherwise
DayOfWeek: integer (0-6), 0 monday, day of week
WeekOfYear: integer, running week within year of this date
Day integer: day of the date
Month integer: month of the date
Year integer: year of the date
PeriodOfDay integer: denotes half hour period of day (0-47)
ORKTemperature: the actual temperature measured at Cork airport
ORKWindspeed: the actual windspeed measured at Cork airport
CO2Intensity: the actual CO2 intensity in (g/kWh) for the electricity produced
ActualWindProduction: the actual wind energy production for this period
SystemLoadEP2: the actual national system load for this period
SMPEP2: the actual price of this time period, the value to be forecasted

# Importing the useful modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

np.random.seed(123)
sns.set_style("darkgrid")

First Impressions

Let's look at the data to help plan our first steps.

df = pd.read_csv("Electricity Prices/electricity_prices.csv")
df.sample(10)

df.info()

Data Cleaning

We need to clean the data slightly before beginning exploration. This means dealing with any null values that we can find and ensuring the data type of each column is appropriate.

Null Values

First we find some hidden null values indicated by '?' entries in the numeric columns. We will set these to actual nulls and then drop the rows, since they make up less than 1% of the data and there is a lot of overlap between the null values in each column.

df.apply(lambda x: sum(x=='?'))

df = df.replace('?', np.nan)
df[df.ORKTemperature.isnull()].sample(10)

As we can see, a random sample suggests 90% of the nulls in ORKTemperature are also null in ORKWindspeed. With this amount of overlap, we won't lose too many rows by removing all rows with null values. In addition, keeping nulls would mean imputing values in both columns for these rows, which would create quite a few similar rows that aren't ideal.

df.dropna(inplace=True)
df.info()

The null values have been removed entirely.

Data Types

Data types of most columns are not ideal. We want to use the category data type for most of the date related columns (other than the actual datetime column). The remaining columns all need to be switched to a numeric format.

df.DateTime = pd.to_datetime(df.DateTime)
for col in df.columns[1:7]:
    df[col] = df[col].astype("category")
for col in df.columns[9:]:
    df[col] = df[col].astype("float")
df["PeriodOfDay"] = df.PeriodOfDay.astype("category")

df.info()

Now the data types look better.

Distributions and Outliers

The final thing to look at before beginning the analysis is the distribution of each numeric column. We use the statistics of df.describe() to get an initial view and then plot boxplots of any that look abnormal for more details.

df.describe()

‌
‌
‌