Analysis of Loan Dataset

Analysis of Loan Dataset

Part A

In this assignment please provide the answer, and the R code used to reach the answer, for each question below.

Question 1

Read the dataset ‘assignment_one_loans_data.csv’ into R and assign to a variable called df. 

Question 2

What was the largest amount of money loaned? 

Question 3

What was the average (mean) interest rate? 

Question 4

Which purpose had the highest average (mean) loan amount? 

Question 5

What was the total value of all defaulted loans? 

Question 6

Create a new variable risk, which is equal to ‘high’ when the interest rate is 15% or above, and ‘normal’ when the interest rate is less than 15% 

Question 7

Find the number of loans in each combination of risk group and loan status. i.e find:

the number of high risk defaults

the number of low risk defaults

the number of high risk fully paid off loans

the number of low risk fully paid off loans 

Question 8

Change the ‘other’ category in the purpose column to be a missing value. 

Question 9

Find the average loan amount in each purpose that has at least 1000 loans made.

Arrange from the highest average loan amount to the lowest. 

Question 10

Looking only at loans for credit cards and debt consolidation, what is the average loan amount in high risk and low risk loans? 

Part B

Investigating the provenance of the loans spreadsheet, you have been informed that the master loans data is held in a spreadsheet compiled by an office administrator, who takes several spreadsheets created by loans officers at different branches of LendingClub, aligns the formats of the spreadsheets, then combines the data into a single sheet and emails this to senior management every month.

(1) Identify any security and/or privacy issues arising from the process described.

(2) Identify two areas where errors or inaccuracies may creep into the data. For each area identified, propose appropriate remedial steps.

(3) Identify any issues of versioning or currency (keeping data up-to-data) which may arise from the process described. 

Solution 

setwd(‘G:/RPro_16.10.17_Emily’)

#Question 1 : Read the dataset ‘assignment_one_loans_data.csv’ into R and assign to a variable called df.

df <- read.csv(‘assignment_one_loans_data.csv’)

# Question 2 : What was the largest amount of money loaned?

(largest_money_loaded <- max(df$loan_amnt))

#Question 3 : What was the average (mean) interest rate?

(average_interest_rate <- mean(df$int_rate))

#Question 4 : Which purpose had the highest average (mean) loan amount?

list <- aggregate(df$loan_amnt, by=list(df$purpose), FUN=mean)

colnames(list) <- c(‘purpose’, ‘mean_loan_amnt’)

(purpose_highest_average <- list$purpose[which.max(list$mean_loan_amnt)])

#Question 5 : What was the total value of all defaulted loans?

(length(which(df$loan_status == ‘Default’)))

#Question 6 : Create a new variable risk, which is equal to ‘high’ when the interest rate is 15% or above, and ‘normal’ when the interest rate is less than 15%

df$risk = ‘normal’

for(i in (1:nrow(df))){

if(df$int_rate[i] >= 15){

df$risk[i] = ‘high’

}

}

#Question 7 : Find the number of loans in each combination of risk group and loan status. i.e find:

# . the number of high risk defaults

(nrow(subset(df, risk == ‘high’ & loan_status == ‘Default’)))

# . the number of low risk defaults

(nrow(subset(df, risk == ‘normal’ & loan_status == ‘Default’)))

# . the number of high risk fully paid off loans

(nrow(subset(df, risk == ‘high’ & loan_status == ‘Fully Paid’)))

# . the number of low risk fully paid off loans

(nrow(subset(df, risk == ‘normal’ & loan_status == ‘Fully Paid’)))

#Question 8 : Change the ‘other’ category in the purpose column to be a missing value.

for(i in (1:nrow(df))){

if(df$purpose[i] == ‘other’){

df$purpose[i] <- NA

}

}

#Question 9

# Find the average loan amount in each purpose that has at least 1000 loans made.

temp <- data.frame(table(df$purpose))

colnames(temp) <- c(‘purpose’, ‘frequency’)

list$purpose[which(list$purpose == ‘other’)] <- NA

temp <- merge(temp, list)

temp <- subset(temp, frequency > 1000)

# Arrange from the highest average loan amount to the lowest.

temp <- temp[order(temp$mean_loan_amnt, decreasing = TRUE),]

#Question 10 : Looking only at loans for credit cards and debt consolidation, what is the average loan amount in high risk and low risk loans?

dt <- subset(df, purpose == ‘credit_card’ | purpose == ‘debt_consolidation’)

List <- aggregate(dt$loan_amnt, by=list(dt$risk), FUN=mean)