Skip to content
New Workbook
Sign up
Project: Designing a Bank Marketing Database
  • _In this task, we are working with a dataset named "bank_marketing.csv," which contains information related to a bank's marketing campaigns.My goal is to perform several data preprocessing and transformation steps using Python and SQL, and then save the processed data into three separate CSV files: client.csv, campaign.csv, and economics.csv.
  • Additionally, we will create SQL table creation scripts for each of these CSV files.

Step-by-Step Explanation:

Python

  1. Reading the data from the "bank_marketing.csv" file into a Pandas DataFrame.
  2. Split the Data into Three DataFrames: client, campaign, and economics.
  3. Perform data cleaning tasks such as replacing "."; convert values in the columns to binary (1 or 0) based on conditions; change "nonexistent" to NumPy's null values in the column.
  4. Create Datetime Column, where the year is fixed as 2022, and the month and day values are taken from the "month" and "day" columns.
  5. Remove any redundant data or columns that might have been used to create new columns.
  6. Save Data to CSV Files:We save the three DataFrames (client, campaign, and economics) to CSV files without an index as client.csv, campaign.csv, and economics.csv, respectively. SQL
  7. Create SQL Table Scripts for each of the CSV files: client_table, campaign_table, and economics_table.
  8. Load Data into SQL Tables from the CSV files into their respective SQL tables using the \copy command in PostgreSQL.
import pandas as pd
import numpy as np

#Reading the original dataset
df = pd.read_csv('bank_marketing.csv')
print(df.columns)
#Split the original data into 3 DataFrames 
client = df [['client_id', 'age', 'job', 'marital', 'education', 'credit_default',
              'housing', 'loan']]
campaign = df[['client_id','campaign','duration','pdays', 'previous','poutcome','y','day',                      'month']]
economics = df[['client_id','emp_var_rate','cons_price_idx','euribor3m', 'nr_employed']]
# Display the 3 new DataFrames
print(client.head())
print(campaign.head())
print(economics.head())
#Rename columns of client and economics
client.rename(columns={"client_id": "id"}, inplace=True)

economics.rename(columns={'euribor3m':'euribor_three_months', 'nr_employed':'number_employed'}, inplace=True)

print(client.columns)
print(economics.columns)
#Rename columns of campaign dataframe

campaign= campaign.rename(columns={'campaign': 'number_contacts',
                                    'duration': 'contact_duration',
                                    'previous': 'previous_campaign_contacts',
                                    'poutcome': 'previous_outcome',
                                    'y': 'campaign_outcome'})
print(campaign.columns)
#Cleaning client dataframe
# Replace "." with "_" in the "education" column
client['education'] = client['education'].str.replace('.', '_')

# Replace "unknown" with null values in the "education" column
client['education'] = client['education'].replace('unknown', np.NaN)

print(client['education'])
#There was value issue while I changed the binary value for the campaign_outcome column. Therefore,I delete the column and add it again from the original dataset (df).

#Add the remove column to campaign dataframe
campaign['campaign_outcome'] = df['y']
campaign.head()
#Cleaning campaign dataframe
# Convert 'campaign_outcome' to binary values (1 or 0) using map()
campaign['campaign_outcome'] = campaign['campaign_outcome'].map({'yes': 1, 'no': 0})

# Print the updated 'campaign_outcome' column
print(campaign['campaign_outcome'])
# Replace 'nonexistent' with null values
campaign['previous_outcome'] = campaign['previous_outcome'].replace('nonexistent', np.NaN)
print(campaign['previous_outcome'].head())
#Add a column called campaign_id in campaign dataframe, where all rows have a value of 1
campaign['campaign_id'] = '1'
#Check the campaign dataframe
campaign.head()
#Create 'datetime' column (last_contact_date) of campaign dataframe

#Capitalize the values of motnh column
campaign['month'] = campaign['month'].str.capitalize()

#Add new column called 'year'
campaign['year']= '2022'

#Convert 'day' column to string
campaign['day'] = campaign['day'].astype(str)

#Add all this columns together
campaign['date'] = campaign['year']+ '-'+ campaign['month']+ '-' + campaign['day']
print(campaign['date'])

#Convert to 'last_contact_date' column
campaign['last_contact_date'] = pd.to_datetime(campaign['date'], format="%Y-%b-%d")
print(campaign['last_contact_date'])
#Columns to drop
columns_to_drop = ['month', 'day', 'year']
campaign.drop(columns=columns_to_drop, inplace=True)
campaign.head()
#Saving data
client.to_csv('client.csv', index= False)
economics.to_csv('economics.csv', index= False)
campaign.to_csv('campaign.csv', index= False)