Skip to content

You've recently started a new position as a Data Engineer at an energy company. Previously, analysts on other teams had to manually retrieve and clean data every quarter to understand changes in the sales and capability of different energy types. This process normally took days and was something that most analytsts dreaded. Your job is to automate this process by building a data pipeline. You'll write this data pipeline to pull data each month, helping to provide more rapid insights and free up time for your data consumers.

You will achieve this using the pandas library and its powerful parsing features. You'll be working with two raw files; electricity_sales.csv and electricity_capability_nested.json.

Below, you'll find a data dictionary for the electricity_sales.csv dataset, which you'll be transforming in just a bit. Good luck!

FieldData Type
periodstr
stateidstr
stateDescriptionstr
sectoridstr
sectorNamestr
pricefloat
price-unitsstr
import pandas as pd
import json
from pandas import json_normalize
def extract_tabular_data(file_path: str):
    """Extract data from a tabular file (CSV or Parquet)."""
    if file_path.endswith(".csv"):
        return pd.read_csv(file_path)
    elif file_path.endswith(".parquet"):
        return pd.read_parquet(file_path)
    else:
        raise Exception("Warning: Invalid file extension. Please try with .csv or .parquet!")
def extract_json_data(file_path: str):
    """Extract and flatten data from a JSON file."""
    with open(file_path, "r") as f:
        data = json.load(f)
    # Flatten nested JSON
    df = json_normalize(data)
    return df

def transform_electricity_sales_data(raw_data: pd.DataFrame):
    """
    Transform electricity sales to find the total amount of electricity sold
    in the residential and transportation sectors.
    
    Steps:
    - Drop any records with NA values in the `price` column.
    - Only keep records with a `sectorName` of "residential" or "transportation".
    - Create a `month` column using the first 4 characters of the values in `period`.
    - Create a `year` column using the last 2 characters of the values in `period`.
    - Return the transformed `DataFrame`, keeping only the columns:
      `year`, `month`, `stateid`, `price`, `price-units`.
    """
    # Drop NA values in price
    raw_data.dropna(subset=["price"], inplace=True)

    # Keep only residential and transportation sectors
    filtered_df = raw_data[raw_data["sectorName"].isin(["residential", "transportation"])]

    # Create month and year columns
    filtered_df["month"] = filtered_df["period"].str[:4]
    filtered_df["year"] = filtered_df["period"].str[-2:]

    # Select final columns
    transformed_df = filtered_df[["year", "month", "stateid", "price", "price-units"]]

    return transformed_df
def load(dataframe: pd.DataFrame, file_path: str):
    """Load a DataFrame to a file in either CSV or Parquet format."""
    if file_path.endswith(".csv"):
        dataframe.to_csv(file_path, index=False)
    elif file_path.endswith(".parquet"):
        dataframe.to_parquet(file_path, index=False)
    else:
        raise Exception(f"Warning: {file_path} is not a valid file type. Please try again!")
# === Pipeline execution ===
if __name__ == "__main__":
    raw_electricity_capability_df = extract_json_data("electricity_capability_nested.json")
    raw_electricity_sales_df = extract_tabular_data("electricity_sales.csv")

    cleaned_electricity_sales_df = transform_electricity_sales_data(raw_electricity_sales_df)

    load(raw_electricity_capability_df, "loaded__electricity_capability.parquet")
    load(cleaned_electricity_sales_df, "loaded__electricity_sales.csv")