Data EDA Temporary File Creation
Data EDA, Data Cleansing, and Temporary File Creation
We have an R modeling assignment where we need to use the data from a Kaggle competition to produce Time Series Models. The assignment is broken into two parts. I would like to submit the first part. The data files can be found on the following web site.
What is needed in the first part:
- Evaluation of data set(s) using Exploratory Data Analysis (EDA) with:
- Visual Data Evaluations
- Descriptive Statistics
- Statistical Graphs
- Data cleansing and transformation. Create a temporary data set as to not corrupt the original data set(s).
- Missing values
- Capping of Outliers
- Creation of Dummy Variables for Categorical Variables
- Creation of the logError, log(Zestimate) and log(SalesPrice)
- Log Transformation of other variables
- Any regression approaches including Backward Regression Approach to define predictor variables.
If at all possible, if you could avoid R Markdown format it would be appreciated.
Solution
setwd(“C:/Users/AKOSMOS/Desktop/freelancer/tutorials”)
#data importation:
# It takes a huge amount of time because the file are enormous.
#2016
donnees_2016 <- read.csv(“properties_2016.csv”, header=TRUE, sep=”,” , dec=”.”)
len2016=length(names(donnees_2016))
#We delete colonne if number of Na is over 50% of data.
L=length(donnees_2016[,1])
for(i in 1:58){
l=length(which(is.na(donnees_2016[,i])))
if(l/L<0.5){donnees_2016[,-i]}
}
#Statistiquesdescriptives:
L=length(donnees_2016[,1])
C=length(names(donnees_2016))
#stat simple
for(i in 1:C){
names(donnees_2016)[i]
#max min etquantiles
summary(donnees_2016[,i])
#moyen:
mean(donnees_2016[,i])
#ecart type
sd(donnees_2016[,i]))
}
#visualisation
for(i in 1:C)
{
if(is.numeric(max(donnees_2016[,i])))
{plot(donnees_2016[,i],main=names(donnees_2016)[i], type=p)}
else
{hist(x, breaks = “Sturges”,main=names(donnees_2016)[i])}
#Traitement des valeursmanquantes:
for(i in 1:C){donnees_2016[is.na(donnees_2016[,i]),i]=mean(donnees_2016[,i], na.rm = FALSE)}
#Outlier: boite à moustache:
for(i in 1:C){
box=boxplot(donnees_2016[,i])
if(length(box$out)>0){
Position=match(box$out, donnees_2016[,i])
donnees_2016[,Position]=mean(donnees_2016[,i])
}
#Recodage des variables/ dummy
for(i in 1:C){
if(is.factor(max(donnees_2016[,i]))
{
table=table(donnees_2016[,i])
for( k in 1:length(table))
{
colonne=as.numeric(names(table(a))==table[k])
cbind(donnees_2016, colonne)
}
}
}
#Backward Regression Approach: backward
step(full, data=donnees_2016, direction=”backward”)
x=donnees_2016[,1]
capture.output(summary(as.numeric(x)), “summary1.txt”)
head(x)
donnees_2017 <- read.csv(“properties_2017.csv”, header=TRUE, sep=”,” , dec=”.”)
length(names(donnees_2017))
#data=rbind(donnees_2016,donnees_2017)
#We export all variables to variable.txt
a=names(donnees_2016)
write(a,”variables.txt”)