# Exploratory Data Analysis

### Exploratory Data Analysis

Exploratory analysis or EDA is an approach and philosophy in data analysis. It employs a variety of graphical techniques to perform the following tasks:

• Maximizing insight into a data set
• Determining the optimal data settings
• Testing of underlying assumptions
• Extracting essential variables
• Developing parsimonious models
• Uncovering the underlying structure
• Detecting outliers and anomalies

Statistics Assignment Helper offers exploratory data analysis help that caters to all areas of this subject.

Exploratory data analysis focusses on how data analysis should be carried out. Although the terms EDA and statistical graphics can be used interchangeably, but they are not identical. Statistical graphics can be defined as a collection of graphically based techniques that focusses on one data characterization aspect. On the other hand, exploratory data analysis encompasses a larger value.  It is an approach to data analysis.

EDA postpones the usual assumptions about the kind of model the data follows with a more direct approach. This allows the data itself to reveal its underlying model and structure. Exploratory data analysis is a philosophy that describes how we dissect a data set, what we should look out for, and how we should interpret the results. It uses a collection of techniques called statistical graphics. However, DA is not identical to statistical graphics per se.

The techniques used in EDA

Most of the techniques in exploratory data analysis are graphical in nature. There are also a few quantitative techniques. EDA greatly relies on graphics because, by its very nature, it is meant to explore open-mindedly. Graphics give EDA unparalleled power to do so. It strives to reveal the structural secrets of data and offers some new unsuspected insight.

The specific graphical techniques employed in exploratory data analysis include:

• Raw data plotting such as histograms, data traces, lag plots, probability plots, Youden plots and block plots
• The plotting of simple statistics such as standard deviation plots, mean plots, box plots, and the main effects plots of the raw data
• Positioning the ploys mentioned above to maximize our natural pattern-recognition abilities like using multiple plots per page.

EDA underlying assumptions

• Underlying assumptions in the measurement process

The measurement processes typically have four assumptions. They are all on the basis that the data from the process should behave like:

• Random drawings
• The random drawings should be from a fixed distribution
• The distribution should have a fixed location
• And the distribution should have a fixed variation

The third item in the list above differs for different problem types. The simplest type of problem is of a single variable (univariate). The general model for a univariate problem: Response = random component + deterministic component becomes: response = constant + error.

• Assumptions for the univariate model

A fixed location is the unknown constant for this case. The process at hand is imagined to be operating under constant conditions that produce a single column of data with the properties mentioned below:

• The random component exhibits a fixed distribution
• The data is uncorrelated with each other
• The deterministic component is only made up of a constant
• The random component has a fixed variation

The univariate model can easily be extended to a more general case where the deterministic component is both a constant and a function of many variables. The engineering objective of this is to characterize and model the function. You should note that it doesn’t matter how many factors there are or how complicated the function is. As long as you choose a good model, the differences between the raw response data and the predicted values from the fitted model should themselves behave like a univariate process. Also, the differences (residuals) from the univariate fit will behave like:

• Random drawings
• The random drawings are from a fixed distribution
• With a fixed location
• With a fixed variation

The testing of underlying assumptions becomes a tool for the validation of the chosen model if the residuals from the fitted model do behave ideally. On the other hand, the chosen model is inadequate if the differences from the chosen fitted model violate one or more of the univariate assumptions mentioned above.

Importance of the assumptions

• Predictability and statistical control

This is an all-important goal in statistics and engineering. We would have achieved the probabilistic predictability if the four underlying assumptions hold.

• Validity of conclusions

The process is considered amenable to the generation of valid scientific conclusions if the four assumptions are valid.