Covariation Relationships
Causality and Lack Thereof
1. A clear temporal precedence of the cause before the effect is necessary for a causalrelationship:
2. The role of the dependent variable and the independent variable may switch depending onthe perspective. E.g., the parent’s income has an effect on the children’s education, but also aperson’s education level influences her/his income potential:
Note: the parent’s income precedes the child’s education and a person’s education precedesa her/his income.
3. Correlation may be caused by plain random coincidences. Statistical significance tests willhelp quantifying the potential random effects.
- A lurking third variable, jointly related with the two variables under investigation, may inducea relationship. This is known as spurious correlation or confounding effect.
Covariation Relationships
· This lecture discusses bivariate covariation between pairs of variables.
In contrast, regression analysis assumes a cause-effect relationship.
· The focus of bivariate analysis is to explore how values taken by one variable co-vary withvalues taken by another variable.
· Relationships are not directed, thus, one cannot say X1influences X2, i.e.,X1® X2or viceversa
X1¬X 2. Both variables just mutually co-vary, i.e., X1«X2
· These bivariate co-variations can be explored
o for nominal and ordinal scaled variables by nested bar charts (recall the spinogram) andquantified in contingency tables
o for quantitative variables we can scatterplot both variables and quantify their linearrelationship with the Pearson correlation coefficient and non-linear but monotonicrelationships by the Spearman rank correlation coefficient.
o For mixed qualitative and quantitative variables the co-variation of the quantitativevariable can be investigated by stratifying its conditional statistics and distributions bythe categories of the qualitative variable (recall the mean bar plot and parallel boxplots).
· Def. Statistical Dependence: When the probability of a variable taking a particular value isinfluenced by a given value of another variable, then these to variables are statisticallydependent.
Correlation Analysis
- Correlation between two metric variables measures their linear relationship:
o Positive linear relationship means that as variable X1increases so does the variable X2.
o Negative linear relationship means that as X1 increases variable X2does decrease.
o Zero correlation means that there is no systematic linear relationship between bothvariables.
o This bivariate relationship must hold over the full support of X1 and X2, i.e., it cannotbe piecewise for specific subsets of intervals.
Unless, however, a qualitative variable allows stratifying the relationship down intosubgroups.
· The relationship between two metric variables is best displayed in a scatterplot. This is also animportant exploratory tool
Meaningfulness of the Correlation Coefficient
· A statistical relationship per se does not need to be meaningful.
o Example: Confounding variable.
Birth rates and the density of storks in a rural German province
over several decades in the last century:
- Example: Induced Correlation
- Example: Spurious correlation
The factor “Size of Area” induces correlation. Normalization by the size of the area correctsfor the absolute size effects.
Example: Aggregation effects
Correlation of individual observations (e.g.,individuals living within regions) denoted by dotsversus correlation of aggregated observations (e.g,averaged individual data in regions) denoted bysquares. => This is known as the modifiable arealunit problem (MAUP)
Structure of Pearson’s Product-Moment Correlation Coefficient
· Empirically the joint variation between two variables is expressed by their measure ofcovariance:
Note that the denominator remains.
- Geometric interpretation: Point pairs [x1, x2] in quadrant defined by their variation
around their means
The covariance is not standardized to a comparable value range. It, therefore, it becomesdifficult to compare for different pairs of variables.
This is because the scale of its parent variables X1and X2may differ.
· For these reasons the covariance is standardized by the standard deviations and of itsparent variables.
This gives the correlation coefficient:
- Notes:
o The value range of the correlation coefficient is restricted within the interval
Negative values imply a negative linear relationship, a value of zero implies norelationship and positive value implies a positive linear relationship.
o Then denominator n – 1 in the covariance and the standard deviations cancels out.
o Burt and Barber give an equivalent computational equations
he Spearman’s Rank Correlation Coefficients
· The distribution of variable does not need to be symmetric or without outliers.
- Replacing the observations in each variable by their associated ranks and then calculating the
correlation between these ranks leads to the Spearman’s rank correlation coefficient.
· Spearman’s rank correlation coefficient is robust against skewness and outliers.
· It measure monotonically increasing or decreasing relationships, which do not necessarilyneed to follow a straight line. However, it cannot measure complex non-linear relationshipsproperly.
Scatterplot Matrix There is a priori no reason to believe that the protein consumption of different food groups in 25European countries before the fall of the iron curtain exhibits are directional influence.
Spurious Relationships in Cross-Tabulations
According to a nautical fish story, a research assistant was asked to explore for the U.S. CoastGuard whether life vests are really saving lives?
The research assistant sampled 500 records of man-over-board accidents on large vessels with thefollowing result:
Conclusions: Apparently wearing a vest is increasing the likelihood of being killing.
His supervisor felt extremely uncomfortable with these results and asked the research assistant toinvestigate the effect that weather may have on the survival chances.
The partial results broken down by the weather conditions tell a different, more in tune withcommon sense:
Conclusions: Under both weather conditions wearing a vest increases the likelihood of survival.
Explanation: In foul weather, sailors have the habit of putting life vests on, whereas in fair weatherthey mostly do not border doing so. However, falling over board in foul weather diminishes thelikelihood of survival irrespectively of wearing a vest or not (see red numbers).
Apparently, the weather conditions are correlated with the status of wearing a vest and, therefore,the survival likelihood.
Berkley Tables.R
##
## Working with count data of nominal/ordinal scaled variables
## See also Chapter 6: Tables in Michael Crawley’s “The R-Book”
##
data(UCBAdmissions, package = “datasets”)
UCBAdmissions # Data come as cross-tabulation
ftable(UCBAdmissions) # reformat table into a “flat” table
Admiss <- as.data.frame(UCBAdmissions) # Convert table into aggregated dataframe with “Freq” counts
##
## Work with simple pairwise cross-tabulations
##
AdmissByGen <- xtabs(Freq ~ Admit + Gender, data=Admiss)
## Make table with row percentages
AdmissByGenPropMat <- prop.table(AdmissByGen,2)
round(AdmissByGenPropMat,2)
##
## Redo pairwise cross-tabulations for males and females
##
## Males
AdmissMale <- Admiss[Admiss$Gender==”Male”, ]
AdmissByDeptMale <- xtabs(Freq ~ Admit + Dept, data=AdmissMale)
round(prop.table(AdmissByDeptMale,2), 2)
## Females
AdmissFem <- Admiss[Admiss$Gender==”Female”, ]
AdmissByDeptFem <- xtabs(Freq ~ Admit + Dept, data=AdmissFem)
round(prop.table(AdmissByDeptFem,2),2)
BivariateMetricStatsPlots.R
rm(list=ls(all=TRUE))
library(ellipse); library(corrplot); library(car); library(RColorBrewer); library(rgl); library(ggplot2)
data(diamonds, package=”ggplot2″)
## Select subset of variables
diaSub <- subset(diamonds,select = c(cut,price,carat,depth,table,x,y,z))
summary(diaSub)
## Correlation matrix
corMat <- cor(diaSub[ ,-1], method= “pearson”, use = “complete.obs”) # exclude nominal “cut”
round(corMat,2) # print rounded to 2 digits
## Define Colors using RColorBrewer
colors <- RColorBrewer::brewer.pal(11,”RdBu”) # Use RColorBrewer diverging red to blue palette
colors <- rev(colors) # reverse order of colors so negative correlations are blue
## Recode corMat so it matches the
colIdx <- 5*corMat + 6 # cor=-1 -> 1; cor=0 -> 6; cor=1 -> 11
## Use library(ellipse) scatterplot function
ellipse::plotcorr(corMat, col=colors[colIdx]) # ellipse function
## Use library(corrplot) scatterplot function
corrplot::corrplot(corMat) # wrong default color coding !!!
corrplot::corrplot(corMat, col=colors, method=”ellipse”, order=”FPC”,
diag=FALSE, addCoef.col=”green”)
## standard scatterplot function pairs with TOO many data-points and outliers
pairs(~x+y+z, data=diaSub, labels=c(“length”,”width”,”depth”),
main=”Diamond Dimensions”)
## use a sample of 1000 observations to minimize computational burden and gain visual resolution
diaSample <- diaSub[sample(1:nrow(diaSub),1000), ]
## car::scatterplotMatrix
car::scatterplotMatrix(~price+carat+depth+table, data=diaSample, diagonal=”density”) # is a CAR function. See formula syntax
## standard R scatterplot with interactive legend placement
plot(diaSample$carat,diaSample$price,pch=19,col=diaSample$cut, # color points according to retired status
xlab=”Carat”,ylab=”Price”,main=”Diamonds Data”)
abline(v=mean(diaSample$carat),h=mean(diaSample$price),lty=2) # reference lines
abline(lm(price~carat, data=diaSample), col=”red”, lwd=2) # add regression line
## interactivel legend. Use mouse pointer to place at desired location
legend(locator(1),
title=”Cut:”,legend=c(“fair”,”good”,”very good”,”premium”,”ideal”),
col=c(1,2,3,4,5),pch=19)
## Similar with ggplot
ggplot(diaSample, aes(x=carat, y=price)) +
geom_point( aes(color=cut) ) + # only points are colored by variable “cut”
geom_smooth(method=”lm”) +
geom_hline(yintercept = mean(diaSample$price), linetype=”dashed”) +
geom_vline(xintercept = mean(diaSample$carat), linetype=”dashed”) +
labs(title=”Relationship between Price and Carat”, x=”Carat”, y=”Price”)
## Similar but augmented with car::scatterplot
car::scatterplot(price~carat|cut, data=diaSample, legend.coords=”topleft”, # Enhanced car function. See “|” syntax
legend.title=”Cut:”, boxplots=”xy”, smooth=F, by.groups=FALSE)
## interactive 3D scatterplot. Look for RGL icon in task bar
rgl::plot3d(diaSample$carat, diaSample$price, diaSample$depth, col=”red”,
xlab=”Carat”, ylab=”Price”, zlab=”Depth”,
type=”s”,size=0.5)
## automatically spin rgl::plot3d plot
rgl::play3d(spin3d())
BivariateNominal.R
rm(list=ls(all=TRUE)) # Start clean
library(foreign); library(ggplot2)
options(digits=1) # try rounding to 2 digits
###############################
## Practice table operations ##
###############################
( hh.car <- read.dbf(“E:\\Lectures2017\\GISC6301\\Chapter04\\HHCar.dbf”) ) # hh.car is an aggregated data.frame with factors
class(hh.car)
sapply(hh.car, is.factor) # check for factors
( hh.car.tab <- xtabs(Freq~cars+HHSize, data=hh.car) ) # convert dataframe to table with factors as table dims
class(hh.car.tab)
(hh.car.df <- as.data.frame(hh.car.tab) ) # reverse: table to dataframe with “Freq” created
## Table margins and proportions
( addmargins(hh.car.tab) ) # add surrounding sums
( prop.table(hh.car.tab) ) # cell proportions from total
## Does number of cars determine the household size?
( hh.car.rowpro <- prop.table(hh.car.tab,1) ) # row proportions
( addmargins(hh.car.rowpro, 2) ) # add row sums
## Household size determines number of cars
( hh.car.colpro <- prop.table(hh.car.tab,2) ) # column proportions
( addmargins(hh.car.colpro, 1) ) # add column sums
## Dodge plot of absolute counts
ggplot(data=hh.car.df, aes(x=HHSize, y=Freq, fill=cars)) +
geom_bar(position=”dodge”, stat=”identity”)
## Spinogram
vcd::spine(xtabs(Freq~HHSize+cars, data=hh.car), main=”Number of Cars by Household Size in %” )
## Stacked plot of absolute counts
ggplot(data=hh.car.df, aes(x=HHSize, y=Freq, fill=cars)) +
geom_bar(position=”stack”, stat=”identity”)
## Stacked plot of percentages
hh.car.colpro <- prop.table(hh.car.tab,2) # row proportions
hh.car.colpro.df <- as.data.frame(hh.car.colpro) # convert table into data.frame. Note: Freq
hh.car.colpro.df$Percent <- hh.car.colpro.df$Freq*100 # transform proportions into percent
ggplot(data=hh.car.colpro.df, aes(x=HHSize, y=Percent, fill=cars)) +
geom_bar(position=”stack”, stat=”identity”)
##############################################
## List of 100 observations without factors ##
##############################################
( hh.car.lst <- read.dbf(“E:\\Lectures2017\\GISC6301\\Chapter04\\HHCarList.dbf”) ) # what is the structure ?
hh.car.lst$cars <- factor(hh.car.lst$cars,
labels=c(“c:0″,”c:1″,”c:2″,”c:3”)) # convert numeric to factor with labels
hh.car.lst$HHSize <- factor(hh.car.lst$HHSize,
labels=c(“p:2″,”p:3″,”p:4″,”p:5″)) # convert numeric to factor with labels
head(hh.car.lst, n=15)
( hh.car.tab <- xtabs( ~ HHSize+cars, data=hh.car.lst) ) # convert data-frame table. Note syntax
#####################################################################
## Converts aggregated dataframe into individual records dataframe ##
#####################################################################
( hh.car.df <- as.data.frame(hh.car.tab) ) # convert table to data-frame
hh.car.lst <- as.data.frame(lapply(hh.car.df,
function(x) rep(x, hh.car.df$Freq))) # expand to individual observations
hh.car.lst
########################################
## Cross-tabulations like SAS or SPSS ##
########################################
gmodels::CrossTable(hh.car.lst$HHSize, hh.car.lst$cars, chisq=T) # see help at gmodels::CrossTable
#######################################################
## Recoding a metric variable into an ordered factor ##
#######################################################
income <- rlnorm(200, mean=10, sd=1)/1000 + 15 # 200 random household incomes in $1000
hist(income, xlab=”Household Income (in $1000)”)
( income.fac <- cut(income,breaks=c(min(income)-1,40,90,max(income)+1)) ) # split into classes of type factor
income.ord <- ordered(income.fac,levels(income.fac),
labels=c(“low”,”medium”,”high”)) # make ordered factor and rename classes
( income.tab <- table(income.ord) )
barplot(income.tab) # barplot (unsorted because ordinal variable)
Solution
Part1 :
Task1 :
The table is the same for t1, t2 and t3.
x\y | 1 | 2 | 3 | 4 | Total |
1 | 1 | 1 | 1 | 1 | 4 |
2 | 1 | 1 | 1 | 1 | 4 |
3 | 1 | 1 | 1 | 1 | 4 |
Total | 3 | 3 | 3 | 3 | 12 |
Task2 :
The graph is the same for the three data frame.
Task3 :
The variables within each data frame are independant because the graph shows that each point has the same probability to occur.
Task4 :
B1$X | B2$X | B3$X |
24.50 | 24.50 | 24.53 |
Task5 :
Task6 :
We remark that for b1 and b2, variables X and G are independant because values of G dont affected by the values of X.
For b1, the value of G depend of the value of X. When we have a high value of X, we expect to achieve g=3.
Task7 :
C1 :
1 | -0.02468068 |
-0.02468068 | 1 |
C2 :
1 | 0.1050691 |
0.1050691 | 1 |
C3 :
1 | 0.5341502 |
0.5341502 | 1 |
Task8 :
C1 :
C2
C3
Task9 :
C1 graph dont help us to conclude. C2 graphs shows that variables are independant. C3 graph show that the variable are dependant.
Part2 :
We have the same probability that Male be admitted or rejected. But, female are more rejected than admitted.
Task 12 :
We remark that Male are rejected are usually rejected in 4 Dept from 6. We remark that Female are acpeted usually in 4 Dept from 6.
We can conclude that Male are more rejected than Female.
Cy1 | disp | hp | drat | wt | qsec | vs | am | gear | carb |
Negative correlation : increase in the opposite direction | Negative correlation : increase in the opposite direction | Negative correlation : increase in the opposite direction | Positivecorrelation : increase in the same direction | Negative correlation : increase in the opposite direction | Positivecorrelation : increase in the same direction | Positivecorrelation : increase in the same direction | Positivecorrelation : increase in the same direction | Positive correlation : increase in the same direction | Negative correlation : increase in the opposite direction |
Task14 :
scatterplotMatrix(mtcars, var.labels=colnames(mtcars), diagonal=c(“density”, “boxplot”, “histogram”, “oned”, “qqplot”, “none”))
Task 16 : sorry for this question, i don’t know
Code client.R
setwd(“C:/Users/AKOSMOS/Desktop/freelancer/project 1/Assigment”)
load(Part1Data.Data)
#task 1:
TabContingence1=table(t1$X, t1$Y)
TabContingence1
TabContingence2=table(t2$X, t2$Y)
TabContingence2
TabContingence3=table(t3$X, t3$Y)
TabContingence3
#1
col=rep(0,3)
for(i in 1:3)
col[i]=sum(TabContingence1[i,c(1:4)])
TabContingence1=cbind(TabContingence1,col)
ligne=rep(0,4)
for(i in 1:5)
ligne[i]=sum(TabContingence1
[c(1:3),i]
)
TabContingence1=rbind(TabContingence1,ligne)
TabContingence1
#2
col=rep(0,3)
for(i in 1:3)
col[i]=sum(TabContingence2[i,c(1:4)])
TabContingence2=cbind(TabContingence2,col)
ligne=rep(0,4)
for(i in 1:5)
ligne[i]=sum(TabContingence2
[c(1:3),i]
)
TabContingence2=rbind(TabContingence2,ligne)
TabContingence2
#task2
spineplot(t1$X, t1$Y , main=” t1$X et t1$Y” )
spineplot(t2$X, t2$Y , main=” t2$X et t2$Y” )
spineplot(t3$X, t3$Y , main=” t3$X et t3$Y” )
#variables are independent because we have the same prob for each case.
#task4
means=c(mean(b1$X),mean(b2$X),mean(b3$X))
#task5 and #task6
boxplot(X~G, data=b1) #dependant
boxplot(X~G, data=b2) #independent
boxplot(X~G, data=b3) #independent
#task7
cor(c1, use=”complete.obs”, method=”kendall”)
cor(c2, use=”complete.obs”, method=”kendall”)
cor(c3, use=”complete.obs”, method=”kendall”)
#task8
library(car)
scatterplot(X1~X2, data=c1, main=”Enhanced Scatter Plot for c1″)
scatterplot(X1~X2, data=c2, main=”Enhanced Scatter Plot for c2″)
scatterplot(X1~X2, data=c3, main=”Enhanced Scatter Plot for c3″)
#task9
#independent
#independent
#dependent
#==== Part2:
data(UCBAdmissions, package = “datasets”)
#task10
ftable(UCBAdmissions)
Admiss <- as.data.frame(UCBAdmissions)
AdmissByGen <- xtabs(Freq ~ Admit + Gender, data=Admiss)
## Make table with row percentages
AdmissByGenPropMat <- prop.table(AdmissByGen,2)
round(AdmissByGenPropMat,2)
#task11
## Females
AdmissFem <- Admiss[Admiss$Gender==”Female”, ]
AdmissByDeptFem <- xtabs(Freq ~ Admit + Dept, data=AdmissFem)
round(prop.table(AdmissByDeptFem,2),2)
#task12: write a comment on the report about gender admission biais
#========== Part3
data(mtcars, package= “datasets”)
names(mtcars)
#task13
cor(mtcars, use=”complete.obs”, method=”kendall”)
#task14
scatterplotMatrix(mtcars, var.labels=colnames(mtcars), diagonal=c(“density”, “boxplot”, “histogram”, “oned”, “qqplot”, “none”))
#task15
library(corrplot)
M <- cor(mtcars)
corrplot(M, method=”ellipse”)