Skip to content

(Invalid URL)

Loan Data

Ready to put your coding skills to the test? Join us for our Workspace Competition.
For more information, visit datacamp.com/workspacecompetition

Context

This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.

# Load packages
import numpy as np 
import pandas as pd 
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

Load your data

# Load data from the csv file
df = pd.read_csv('loan_data.csv', index_col=None)

# Change the dots in the column names to underscores
df.columns = [c.replace(".", "_") for c in df.columns]
print(f"Number of rows/records: {df.shape[0]}")
print(f"Number of columns/variables: {df.shape[1]}")
df.head()

Understand your variables

# Understand your variables
variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])

for i, var in enumerate(df.columns):
    variables.loc[i] = [var, df[var].nunique(), df[var].unique().tolist()]
    
# Join with the variables dataframe
var_dict = pd.read_csv('variable_explanation.csv', index_col=0)
variables.set_index('Variable').join(var_dict)

Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:

  • Extract useful insights and visualize them in the most interesting way possible.
  • Find out how long it takes for users to pay back their loan.
  • Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
  • Find out what kind of people take a loan for what purposes.
# Start coding \
df.info()
df.describe()
print("total installments :",df['installment'].sum())
print("shape of the  data frame :",df.shape)
print("no. of loans taken :",df.shape[0])
print("no. of loans that are not fully paid",df['not_fully_paid'].sum())
df.corr()
sb.heatmap(df.corr())