Electricity Costs at a Data Center
Project Brief
You are working at a data center. One of the biggest costs for the center is the electricity required to run all of the servers. To ensure that they will continue to charge customers appropriately for their services, it’s important for them to be able to estimate the price they will have to pay for running the center - in particular the electricity price. As well as knowing whether it is possible to get a reasonable estimate of the price, they would like to know if there are factors that cause the price to increase, so they can charge customers more for those situations.
Data
Dataset containing the price of electricity for a data center in addition to factors that might affect the price.
- DateTime: String, defines date and time of sample
- Holiday: String, gives name of holiday if day is a bank holiday
- HolidayFlag: integer, 1 if day is a bank holiday, zero otherwise
- DayOfWeek: integer (0-6), 0 monday, day of week
- WeekOfYear: integer, running week within year of this date
- Day integer: day of the date
- Month integer: month of the date
- Year integer: year of the date
- PeriodOfDay integer: denotes half hour period of day (0-47)
- ORKTemperature: the actual temperature measured at Cork airport
- ORKWindspeed: the actual windspeed measured at Cork airport
- CO2Intensity: the actual CO2 intensity in (g/kWh) for the electricity produced
- ActualWindProduction: the actual wind energy production for this period
- SystemLoadEP2: the actual national system load for this period
- SMPEP2: the actual price of this time period, the value to be forecasted
# Importing the useful modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
np.random.seed(123)
sns.set_style("darkgrid")First Impressions
Let's look at the data to help plan our first steps.
df = pd.read_csv("Electricity Prices/electricity_prices.csv")
df.sample(10)df.info()Data Cleaning
We need to clean the data slightly before beginning exploration. This means dealing with any null values that we can find and ensuring the data type of each column is appropriate.
Null Values
First we find some hidden null values indicated by '?' entries in the numeric columns. We will set these to actual nulls and then drop the rows, since they make up less than 1% of the data and there is a lot of overlap between the null values in each column.
df.apply(lambda x: sum(x=='?'))df = df.replace('?', np.nan)
df[df.ORKTemperature.isnull()].sample(10)
As we can see, a random sample suggests 90% of the nulls in ORKTemperature are also null in ORKWindspeed. With this amount of overlap, we won't lose too many rows by removing all rows with null values. In addition, keeping nulls would mean imputing values in both columns for these rows, which would create quite a few similar rows that aren't ideal.
df.dropna(inplace=True)
df.info()The null values have been removed entirely.
Data Types
Data types of most columns are not ideal. We want to use the category data type for most of the date related columns (other than the actual datetime column). The remaining columns all need to be switched to a numeric format.
df.DateTime = pd.to_datetime(df.DateTime)
for col in df.columns[1:7]:
df[col] = df[col].astype("category")
for col in df.columns[9:]:
df[col] = df[col].astype("float")
df["PeriodOfDay"] = df.PeriodOfDay.astype("category")
df.info()Now the data types look better.
Distributions and Outliers
The final thing to look at before beginning the analysis is the distribution of each numeric column. We use the statistics of df.describe() to get an initial view and then plot boxplots of any that look abnormal for more details.
df.describe()