Skip to content
- _In this task, we are working with a dataset named "bank_marketing.csv," which contains information related to a bank's marketing campaigns.My goal is to perform several data preprocessing and transformation steps using Python and SQL, and then save the processed data into three separate CSV files: client.csv, campaign.csv, and economics.csv.
- Additionally, we will create SQL table creation scripts for each of these CSV files.
Step-by-Step Explanation:
Python
- Reading the data from the "bank_marketing.csv" file into a Pandas DataFrame.
- Split the Data into Three DataFrames: client, campaign, and economics.
- Perform data cleaning tasks such as replacing "."; convert values in the columns to binary (1 or 0) based on conditions; change "nonexistent" to NumPy's null values in the column.
- Create Datetime Column, where the year is fixed as 2022, and the month and day values are taken from the "month" and "day" columns.
- Remove any redundant data or columns that might have been used to create new columns.
- Save Data to CSV Files:We save the three DataFrames (client, campaign, and economics) to CSV files without an index as client.csv, campaign.csv, and economics.csv, respectively. SQL
- Create SQL Table Scripts for each of the CSV files: client_table, campaign_table, and economics_table.
- Load Data into SQL Tables from the CSV files into their respective SQL tables using the \copy command in PostgreSQL.
import pandas as pd
import numpy as np
#Reading the original dataset
df = pd.read_csv('bank_marketing.csv')
print(df.columns)
#Split the original data into 3 DataFrames
client = df [['client_id', 'age', 'job', 'marital', 'education', 'credit_default',
'housing', 'loan']]
campaign = df[['client_id','campaign','duration','pdays', 'previous','poutcome','y','day', 'month']]
economics = df[['client_id','emp_var_rate','cons_price_idx','euribor3m', 'nr_employed']]
# Display the 3 new DataFrames
print(client.head())
print(campaign.head())
print(economics.head())
#Rename columns of client and economics
client.rename(columns={"client_id": "id"}, inplace=True)
economics.rename(columns={'euribor3m':'euribor_three_months', 'nr_employed':'number_employed'}, inplace=True)
print(client.columns)
print(economics.columns)
#Rename columns of campaign dataframe
campaign= campaign.rename(columns={'campaign': 'number_contacts',
'duration': 'contact_duration',
'previous': 'previous_campaign_contacts',
'poutcome': 'previous_outcome',
'y': 'campaign_outcome'})
print(campaign.columns)
#Cleaning client dataframe
# Replace "." with "_" in the "education" column
client['education'] = client['education'].str.replace('.', '_')
# Replace "unknown" with null values in the "education" column
client['education'] = client['education'].replace('unknown', np.NaN)
print(client['education'])
#There was value issue while I changed the binary value for the campaign_outcome column. Therefore,I delete the column and add it again from the original dataset (df).
#Add the remove column to campaign dataframe
campaign['campaign_outcome'] = df['y']
campaign.head()
#Cleaning campaign dataframe
# Convert 'campaign_outcome' to binary values (1 or 0) using map()
campaign['campaign_outcome'] = campaign['campaign_outcome'].map({'yes': 1, 'no': 0})
# Print the updated 'campaign_outcome' column
print(campaign['campaign_outcome'])
# Replace 'nonexistent' with null values
campaign['previous_outcome'] = campaign['previous_outcome'].replace('nonexistent', np.NaN)
print(campaign['previous_outcome'].head())
#Add a column called campaign_id in campaign dataframe, where all rows have a value of 1
campaign['campaign_id'] = '1'
#Check the campaign dataframe
campaign.head()
#Create 'datetime' column (last_contact_date) of campaign dataframe
#Capitalize the values of motnh column
campaign['month'] = campaign['month'].str.capitalize()
#Add new column called 'year'
campaign['year']= '2022'
#Convert 'day' column to string
campaign['day'] = campaign['day'].astype(str)
#Add all this columns together
campaign['date'] = campaign['year']+ '-'+ campaign['month']+ '-' + campaign['day']
print(campaign['date'])
#Convert to 'last_contact_date' column
campaign['last_contact_date'] = pd.to_datetime(campaign['date'], format="%Y-%b-%d")
print(campaign['last_contact_date'])
#Columns to drop
columns_to_drop = ['month', 'day', 'year']
campaign.drop(columns=columns_to_drop, inplace=True)
campaign.head()
#Saving data
client.to_csv('client.csv', index= False)
economics.to_csv('economics.csv', index= False)
campaign.to_csv('campaign.csv', index= False)