Data EDA Temporary File Creation

Data EDA Temporary File Creation

Data EDA, Data Cleansing, and Temporary File Creation

We have an R modeling assignment where we need to use the data from a Kaggle competition to produce Time Series Models.  The assignment is broken into two parts.  I would like to submit the first part. The data files can be found on the following web site.

What is needed in the first part:

  1. Evaluation of data set(s) using Exploratory Data Analysis (EDA) with:
  • Visual Data Evaluations
  • Descriptive Statistics
  • Statistical Graphs
  1. Data cleansing and transformation.  Create a temporary data set as to not corrupt the original data set(s).
  • Missing values
  • Capping of Outliers
  • Creation of Dummy Variables for Categorical Variables
  • Creation of the logError, log(Zestimate) and log(SalesPrice)
  • Log Transformation of other variables
  1. Any regression approaches including Backward Regression Approach to define predictor variables.

If at all possible, if you could avoid R Markdown format it would be appreciated. 

Solution

setwd(“C:/Users/AKOSMOS/Desktop/freelancer/tutorials”)

#data importation:

# It takes a huge amount of time because the file are enormous.

#2016

donnees_2016 <- read.csv(“properties_2016.csv”, header=TRUE, sep=”,” , dec=”.”)

len2016=length(names(donnees_2016))

#We delete colonne if number of Na is over 50% of data.

L=length(donnees_2016[,1])

for(i in 1:58){

l=length(which(is.na(donnees_2016[,i])))

if(l/L<0.5){donnees_2016[,-i]}

}

#Statistiquesdescriptives:

L=length(donnees_2016[,1])

C=length(names(donnees_2016))

#stat simple

for(i in 1:C){

names(donnees_2016)[i]

#max min etquantiles

summary(donnees_2016[,i])

#moyen:

mean(donnees_2016[,i])

#ecart type

sd(donnees_2016[,i]))

}

#visualisation

for(i in 1:C)

{

if(is.numeric(max(donnees_2016[,i])))

{plot(donnees_2016[,i],main=names(donnees_2016)[i], type=p)}

else

{hist(x, breaks = “Sturges”,main=names(donnees_2016)[i])}

#Traitement des valeursmanquantes:

for(i in 1:C){donnees_2016[is.na(donnees_2016[,i]),i]=mean(donnees_2016[,i], na.rm = FALSE)}

#Outlier: boite à moustache:

for(i in 1:C){

box=boxplot(donnees_2016[,i])

if(length(box$out)>0){

Position=match(box$out,  donnees_2016[,i])

donnees_2016[,Position]=mean(donnees_2016[,i])

}

#Recodage des variables/ dummy

for(i in 1:C){

if(is.factor(max(donnees_2016[,i]))

{

table=table(donnees_2016[,i])

for( k in 1:length(table))

{

colonne=as.numeric(names(table(a))==table[k])

cbind(donnees_2016, colonne)

}

}

}

#Backward Regression Approach: backward

step(full, data=donnees_2016, direction=”backward”)

x=donnees_2016[,1]

capture.output(summary(as.numeric(x)), “summary1.txt”)

head(x)

donnees_2017 <- read.csv(“properties_2017.csv”, header=TRUE, sep=”,” , dec=”.”)

length(names(donnees_2017))

#data=rbind(donnees_2016,donnees_2017)

#We export all variables to variable.txt

a=names(donnees_2016)

write(a,”variables.txt”)