Loan Data
Ready to put your coding skills to the test? Join us for our Workspace Competition!
For more information, visit datacamp.com/workspacecompetition
Context
This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.
Load packages
library(skimr)
library(tidyverse)Load your Data
loans <- readr::read_csv('data/loans.csv.gz')
skim(loans) %>%
select(-(numeric.p0:numeric.p100)) %>%
select(-(complete_rate))Understand your data
| Variable | class | description |
|---|---|---|
| credit_policy | numeric | 1 if the customer meets the credit underwriting criteria; 0 otherwise. |
| purpose | character | The purpose of the loan. |
| int_rate | numeric | The interest rate of the loan (more risky borrowers are assigned higher interest rates). |
| installment | numeric | The monthly installments owed by the borrower if the loan is funded. |
| log_annual_inc | numeric | The natural log of the self-reported annual income of the borrower. |
| dti | numeric | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
| fico | numeric | The FICO credit score of the borrower. |
| days_with_cr_line | numeric | The number of days the borrower has had a credit line. |
| revol_bal | numeric | The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle). |
| revol_util | numeric | The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available). |
| inq_last_6mths | numeric | The borrower's number of inquiries by creditors in the last 6 months. |
| delinq_2yrs | numeric | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
| pub_rec | numeric | The borrower's number of derogatory public records. |
| not_fully_paid | numeric | 1 if the loan is not fully paid; 0 otherwise. |
Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:
- Extract useful insights and visualize them in the most interesting way possible.
- Find out how long it takes for users to pay back their loan.
- Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
- Find out what kind of people take a loan for what purposes.
Judging Criteria
| CATEGORY | WEIGHTAGE | DETAILS |
|---|
| Analysis | 30% |
-
Documentation on the goal and what was included in the analysis
-
How the question was approached
-
Visualisation tools and techniques utilized
| | Results | 30% |
-
How the results derived related to the problem chosen
-
The ability to trigger potential further analysis
| | Creativity | 40% |
-
How "out of the box" the analysis conducted is
-
Whether the publication is properly motivated and adds value
|
Part 1 - Exploratory Data Analysis
How many loans are not paid back?
# 1 identifies loans that are not fully paid back by the borrower
table(loans$not_fully_paid)What purposes do people take a loan for?
First we explore the purposes of the loans and try to answer the following question:
What purposes do people take a loan for?
tbl <- table(loans$not_fully_paid, loans$purpose, dnn = c("not_fully_paid","purpose"))
tbl
proportions(tbl, "not_fully_paid")prop.table(tbl, 1)prop.table(tbl, 2)The main purpose appears to be "debt_consolidation" and the following barplot confirms it.
options(repr.plot.width = 12)
loans %>%
ggplot(aes(x = purpose, group = not_fully_paid)) +
geom_bar(aes(fill = as.factor(not_fully_paid)), position = "stack") +
coord_flip() + theme(legend.position = "top")# Are loans fully paid back by borrowers that meet credit policy criteria?
table(loans$not_fully_paid, loans$credit_policy, dnn = c("not_fully_paid", "credit_policy"))