Skip to content
Loan Data
  • AI Chat
  • Code
  • Report
  • Loan Data

    This dataset consists of more than 9,500 loans with information on the loan structure, the borrower, and whether the loan was pain back in full. This data was extracted from LendingClub.com, which is a company that connects borrowers with investors.

    Not sure where to begin? Scroll to the bottom to find challenges!

    import pandas as pd
    loan_data = pd.read_csv("loan_data.csv")
    print(loan_data.shape)
    loan_data.head(100)

    Data dictionary

    VariableExplanation
    0credit_policy1 if the customer meets the credit underwriting criteria; 0 otherwise.
    1purposeThe purpose of the loan.
    2int_rateThe interest rate of the loan (more risky borrowers are assigned higher interest rates).
    3installmentThe monthly installments owed by the borrower if the loan is funded.
    4log_annual_incThe natural log of the self-reported annual income of the borrower.
    5dtiThe debt-to-income ratio of the borrower (amount of debt divided by annual income).
    6ficoThe FICO credit score of the borrower.
    7days_with_cr_lineThe number of days the borrower has had a credit line.
    8revol_balThe borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
    9revol_utilThe borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
    10inq_last_6mthsThe borrower's number of inquiries by creditors in the last 6 months.
    11delinq_2yrsThe number of times the borrower had been 30+ days past due on a payment in the past 2 years.
    12pub_recThe borrower's number of derogatory public records.
    13not_fully_paid1 if the loan is not fully paid; 0 otherwise.

    Source of dataset.

    Don't know where to start?

    Challenges are brief tasks designed to help you practice specific skills:

    • πŸ—ΊοΈ Explore: Generate a correlation matrix between the numeric columns. What columns are positively and negatively correlated with each other? Does it change if you segment it by the purpose of the loan?
    • πŸ“Š Visualize: Plot histograms for every numeric column with a color element to segment the bars by not_fully_paid.
    • πŸ”Ž Analyze: Do loans with the same purpose have similar qualities not shared by loans with differing purposes? You can consider only fully paid loans.

    Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

    You recently got a job as a machine learning scientist at a startup that wants to automate loan approvals. As your first project, your manager would like you to build a classifier to predict whether a loan will be paid back based on this data. There are two things to note. First, there is class imbalance; there are fewer examples of loans not fully paid. Second, it's more important to accurately predict whether a loan will not be paid back rather than if a loan is paid back. Your manager will want to know how you accounted for this in training and evaluation your model.

    You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

    loan_data.info()
    loan_data.describe()
    
    unique_items = loan_data['purpose'].unique()
    unique_items
    import matplotlib.pyplot as plt
    
    loan_data.hist(bins=30, figsize=(20, 15))
    plt.suptitle('Feature Distributions')
    plt.show()
    import pandas as pd
    import plotly.graph_objects as go
    
    
    # Calculate counts of credit.policy values
    policy_counts = loan_data['credit.policy'].value_counts()
    
    # Step 2: Create the doughnut chart using Plotly
    labels = ['Meets Credit Policy', 'Does Not Meet Credit Policy']
    values = policy_counts.values
    
    fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=0.7)])
    
    fig.update_layout(title='Distribution of Credit Policy',title_x = 0.5,
                      annotations=[dict(text=f'Total Loans: {loan_data.shape[0]}', x=0.5, y=0.5, font_size=20, showarrow=False)])
    
    fig.show()
    
    import plotly.express as px
    
    # Step 1: Prepare Data
    nested_data = loan_data.groupby(['credit.policy', 'not.fully.paid']).size().reset_index(name='count')
    
    # Map credit.policy and not.fully.paid values to custom labels
    credit_policy_labels = {1: 'Meets Credit Policy', 0: 'Does Not Meet Credit Policy'}
    not_fully_paid_labels = {1: 'Not Fully Paid', 0: 'Fully Paid'}
    nested_data['credit.policy'] = nested_data['credit.policy'].map(credit_policy_labels)
    nested_data['not.fully.paid'] = nested_data['not.fully.paid'].map(not_fully_paid_labels)
    
    # Step 2: Create Nested Donut Chart using Plotly
    fig = px.sunburst(nested_data, 
                      path=['credit.policy', 'not.fully.paid'], 
                      values='count',
                      color='not.fully.paid',
                      color_discrete_map={'Fully Paid': 'lightblue', 'Not Fully Paid': 'darkblue'},
                      title='Credit Policy and Not Fully Paid Distribution')
    
    fig.update_traces(textinfo='label+percent entry')
    
    fig.show()
    import pandas as pd
    import plotly.express as px
    
    purpose_counts = loan_data['purpose'].value_counts()
    
    # Step 2: Create the bar chart using Plotly with different colors for each bar
    fig = px.bar(purpose_counts, x=purpose_counts.index, y=purpose_counts.values, 
                 labels={'x': 'Purpose', 'y': 'Count'}, 
                 title='Distribution of Loans by Purpose',
                 color=purpose_counts.index,  # Color based on purpose
                 color_discrete_sequence=px.colors.qualitative.Plotly)  
    
    fig.show()
    # Step 1: Prepare Data
    purpose_credit_counts = loan_data.groupby(['purpose', 'credit.policy']).size().reset_index(name='count')
    
    # Map credit.policy values to custom labels
    credit_policy_labels = {1: 'Meets Credit Policy', 0: 'Does Not Meet Credit Policy'}
    purpose_credit_counts['credit.policy'] = purpose_credit_counts['credit.policy'].map(credit_policy_labels)
    
    # Step 2: Create Sunburst Chart using Plotly
    fig = px.sunburst(purpose_credit_counts, 
                      path=['purpose', 'credit.policy'], 
                      values='count',
                      color='credit.policy',
                      color_discrete_map={'Meets Credit Policy': 'lightblue', 'Does Not Meet Credit Policy': 'darkblue'},
                      title='Loan Purpose and Credit Policy Distribution')
    
    fig.update_traces(textinfo='label+percent entry')
    
    fig.show()
    import plotly.express as px
    
    # Step 1: Prepare Data
    purpose_credit_counts = loan_data.groupby(['purpose', 'credit.policy']).size().reset_index(name='count')
    
    # Map credit.policy values to custom labels
    credit_policy_labels = {1: 'Meets Credit Policy', 0: 'Does Not Meet Credit Policy'}
    purpose_credit_counts['credit.policy'] = purpose_credit_counts['credit.policy'].map(credit_policy_labels)
    
    # Normalize the counts to get percentages
    total_counts = purpose_credit_counts.groupby('purpose')['count'].transform('sum')
    purpose_credit_counts['percentage'] = purpose_credit_counts['count'] / total_counts * 100
    
    # Step 2: Create 100% Stacked Bar Chart using Plotly
    fig = px.bar(purpose_credit_counts, 
                 x='purpose', 
                 y='percentage', 
                 color='credit.policy', 
                 title='Loan Purpose and Credit Policy Distribution (100% Stacked)',
                 labels={'percentage': 'Percentage', 'purpose': 'Purpose', 'credit.policy': 'Credit Policy'},
                 color_discrete_map={'Meets Credit Policy': 'lightblue', 'Does Not Meet Credit Policy': 'darkblue'})
    
    fig.update_layout(barmode='stack')
    
    # Add percentage labels to the bars
    fig.update_traces(texttemplate='%{y:.2f}%', textposition='inside')
    
    fig.show()
    import plotly.express as px
    
    # Step 1: Prepare Data
    sunburst_data = loan_data.groupby(['not.fully.paid', 'purpose']).size().reset_index(name='count')
    
    # Map not.fully.paid values to custom labels
    not_fully_paid_labels = {1: 'Not Fully Paid', 0: 'Fully Paid'}
    sunburst_data['not.fully.paid'] = sunburst_data['not.fully.paid'].map(not_fully_paid_labels)
    
    # Step 2: Create Sunburst Chart using Plotly
    fig = px.sunburst(sunburst_data, 
                      path=['not.fully.paid', 'purpose'], 
                      values='count',
                      color='not.fully.paid',
                      color_discrete_map={'Fully Paid': 'lightblue', 'Not Fully Paid': 'darkblue'},
                      title='Not Fully Paid Status and Loan Purpose Distribution')
    
    fig.update_traces(textinfo='label+percent entry')
    
    fig.show()
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Correlation matrix
    correlation_matrix = loan_data.corr()
    
    # Plotting the correlation matrix
    plt.figure(figsize=(12, 10))
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
    plt.title('Correlation Matrix')
    plt.show()
    β€Œ
    β€Œ
    β€Œ