Covariation Relationships

Covariation Relationships

Causality and Lack Thereof
1. A clear temporal precedence of the cause before the effect is necessary for a causalrelationship:

2. The role of the dependent variable and the independent variable may switch depending onthe perspective. E.g., the parent’s income has an effect on the children’s education, but also aperson’s education level influences her/his income potential:

Note: the parent’s income precedes the child’s education and a person’s education precedesa her/his income.
3. Correlation may be caused by plain random coincidences. Statistical significance tests willhelp quantifying the potential random effects.

  1. A lurking third variable, jointly related with the two variables under investigation, may inducea relationship. This is known as spurious correlation or confounding effect.

Covariation Relationships
· This lecture discusses bivariate covariation between pairs of variables.
In contrast, regression analysis assumes a cause-effect relationship.
· The focus of bivariate analysis is to explore how values taken by one variable co-vary withvalues taken by another variable.
· Relationships are not directed, thus, one cannot say X1influences X2, i.e.,X1® X2or viceversa
X1¬X 2. Both variables just mutually co-vary, i.e., X1«X2
· These bivariate co-variations can be explored
o for nominal and ordinal scaled variables by nested bar charts (recall the spinogram) andquantified in contingency tables

o for quantitative variables we can scatterplot both variables and quantify their linearrelationship with the Pearson correlation coefficient and non-linear but monotonicrelationships by the Spearman rank correlation coefficient.
o For mixed qualitative and quantitative variables the co-variation of the quantitativevariable can be investigated by stratifying its conditional statistics and distributions bythe categories of the qualitative variable (recall the mean bar plot and parallel boxplots).
· Def. Statistical Dependence: When the probability of a variable taking a particular value isinfluenced by a given value of another variable, then these to variables are statisticallydependent.

Correlation Analysis

  • Correlation between two metric variables measures their linear relationship:
    o Positive linear relationship means that as variable X1increases so does the variable X2.
    o Negative linear relationship means that as X1 increases variable X2does decrease.
    o Zero correlation means that there is no systematic linear relationship between bothvariables.
    o This bivariate relationship must hold over the full support of X1 and X2, i.e., it cannotbe piecewise for specific subsets of intervals.

Unless, however, a qualitative variable allows stratifying the relationship down intosubgroups.
· The relationship between two metric variables is best displayed in a scatterplot. This is also animportant exploratory tool

Meaningfulness of the Correlation Coefficient
· A statistical relationship per se does not need to be meaningful.
o Example: Confounding variable.
Birth rates and the density of storks in a rural German province
over several decades in the last century:

  • Example: Induced Correlation

  • Example: Spurious correlation
    The factor “Size of Area” induces correlation. Normalization by the size of the area correctsfor the absolute size effects.

Example: Aggregation effects

Correlation of individual observations (e.g.,individuals living within regions) denoted by dotsversus correlation of aggregated observations (e.g,averaged individual data in regions) denoted bysquares. => This is known as the modifiable arealunit problem (MAUP)

Structure of Pearson’s Product-Moment Correlation Coefficient
· Empirically the joint variation between two variables is expressed by their measure ofcovariance:

Note that the denominator remains.

  • Geometric interpretation: Point pairs [x1, x2] in quadrant defined by their variation

around their means

The covariance is not standardized to a comparable value range. It, therefore, it becomesdifficult to compare for different pairs of variables.
This is because the scale of its parent variables X1and X2may differ.
· For these reasons the covariance is standardized by the standard deviations and of itsparent variables.
This gives the correlation coefficient:

  • Notes:
    o The value range of the correlation coefficient is restricted within the interval

Negative values imply a negative linear relationship, a value of zero implies norelationship and positive value implies a positive linear relationship.
o Then denominator n – 1 in the covariance and the standard deviations cancels out.
o Burt and Barber give an equivalent computational equations

he Spearman’s Rank Correlation Coefficients
· The distribution of variable does not need to be symmetric or without outliers.

  • Replacing the observations in each variable by their associated ranks and then calculating the
    correlation between these ranks leads to the Spearman’s rank correlation coefficient.
    · Spearman’s rank correlation coefficient is robust against skewness and outliers.
    · It measure monotonically increasing or decreasing relationships, which do not necessarilyneed to follow a straight line. However, it cannot measure complex non-linear relationshipsproperly.

Scatterplot Matrix There is a priori no reason to believe that the protein consumption of different food groups in 25European countries before the fall of the iron curtain exhibits are directional influence.

Spurious Relationships in Cross-Tabulations 

According to a nautical fish story, a research assistant was asked to explore for the U.S. CoastGuard whether life vests are really saving lives?
The research assistant sampled 500 records of man-over-board accidents on large vessels with thefollowing result:

Conclusions: Apparently wearing a vest is increasing the likelihood of being killing.
His supervisor felt extremely uncomfortable with these results and asked the research assistant toinvestigate the effect that weather may have on the survival chances.
The partial results broken down by the weather conditions tell a different, more in tune withcommon sense:

Conclusions: Under both weather conditions wearing a vest increases the likelihood of survival.
Explanation: In foul weather, sailors have the habit of putting life vests on, whereas in fair weatherthey mostly do not border doing so. However, falling over board in foul weather diminishes thelikelihood of survival irrespectively of wearing a vest or not (see red numbers).
Apparently, the weather conditions are correlated with the status of wearing a vest and, therefore,the survival likelihood.

Berkley Tables.R

##

## Working with count data of nominal/ordinal scaled variables

## See also Chapter 6: Tables in Michael Crawley’s “The R-Book”

##

data(UCBAdmissions, package = “datasets”)

UCBAdmissions                                   # Data come as cross-tabulation

ftable(UCBAdmissions)                           # reformat table into a “flat” table

Admiss <- as.data.frame(UCBAdmissions)        # Convert table into aggregated dataframe with “Freq” counts

##

## Work with simple pairwise cross-tabulations

##

AdmissByGen <- xtabs(Freq ~  Admit + Gender, data=Admiss)

## Make table with row percentages

AdmissByGenPropMat <- prop.table(AdmissByGen,2)

round(AdmissByGenPropMat,2)

##

## Redo pairwise cross-tabulations for males and females

##

## Males

AdmissMale <- Admiss[Admiss$Gender==”Male”, ]

AdmissByDeptMale <- xtabs(Freq ~  Admit + Dept, data=AdmissMale)

round(prop.table(AdmissByDeptMale,2), 2)

## Females

AdmissFem <- Admiss[Admiss$Gender==”Female”, ]

AdmissByDeptFem <- xtabs(Freq ~  Admit + Dept, data=AdmissFem)

round(prop.table(AdmissByDeptFem,2),2) 

BivariateMetricStatsPlots.R 

rm(list=ls(all=TRUE))

library(ellipse); library(corrplot); library(car); library(RColorBrewer); library(rgl); library(ggplot2)

data(diamonds, package=”ggplot2″)

## Select subset of variables

diaSub <- subset(diamonds,select = c(cut,price,carat,depth,table,x,y,z))

summary(diaSub)

## Correlation matrix

corMat <- cor(diaSub[ ,-1], method= “pearson”, use = “complete.obs”)  # exclude nominal “cut”

round(corMat,2)                                                       # print rounded to 2 digits

## Define Colors using RColorBrewer

colors <- RColorBrewer::brewer.pal(11,”RdBu”)   # Use RColorBrewer diverging red to blue palette

colors <- rev(colors)                           # reverse order of colors so negative correlations are blue

## Recode corMat so it matches the

colIdx <- 5*corMat + 6                          # cor=-1 -> 1; cor=0 -> 6; cor=1 -> 11

## Use library(ellipse) scatterplot function

ellipse::plotcorr(corMat, col=colors[colIdx])   # ellipse function

## Use library(corrplot) scatterplot function

corrplot::corrplot(corMat)                      # wrong default color coding !!!

corrplot::corrplot(corMat, col=colors, method=”ellipse”, order=”FPC”,

diag=FALSE, addCoef.col=”green”)

## standard scatterplot function pairs with TOO many data-points and outliers

pairs(~x+y+z, data=diaSub, labels=c(“length”,”width”,”depth”),

main=”Diamond Dimensions”)

## use a sample of 1000 observations to minimize computational burden and gain visual resolution

diaSample <- diaSub[sample(1:nrow(diaSub),1000), ]

## car::scatterplotMatrix

car::scatterplotMatrix(~price+carat+depth+table, data=diaSample, diagonal=”density”) # is a CAR function. See formula syntax

## standard R scatterplot with interactive legend placement

plot(diaSample$carat,diaSample$price,pch=19,col=diaSample$cut,                       # color points according to retired status

xlab=”Carat”,ylab=”Price”,main=”Diamonds Data”)

abline(v=mean(diaSample$carat),h=mean(diaSample$price),lty=2)                        # reference lines

abline(lm(price~carat, data=diaSample), col=”red”, lwd=2)                            # add regression line

## interactivel legend. Use mouse pointer to place at desired location

legend(locator(1),

title=”Cut:”,legend=c(“fair”,”good”,”very good”,”premium”,”ideal”),

col=c(1,2,3,4,5),pch=19)

## Similar with ggplot

ggplot(diaSample, aes(x=carat, y=price)) +

geom_point( aes(color=cut) ) +                                                      # only points are colored by variable “cut”

geom_smooth(method=”lm”) +

geom_hline(yintercept = mean(diaSample$price), linetype=”dashed”) +

geom_vline(xintercept = mean(diaSample$carat), linetype=”dashed”) +

labs(title=”Relationship between Price and Carat”, x=”Carat”, y=”Price”)

## Similar but augmented with car::scatterplot

car::scatterplot(price~carat|cut, data=diaSample, legend.coords=”topleft”,           # Enhanced car function. See “|” syntax

legend.title=”Cut:”, boxplots=”xy”, smooth=F, by.groups=FALSE)

## interactive 3D scatterplot. Look for RGL icon in task bar

rgl::plot3d(diaSample$carat, diaSample$price, diaSample$depth, col=”red”,

xlab=”Carat”, ylab=”Price”, zlab=”Depth”,

type=”s”,size=0.5)

## automatically spin rgl::plot3d plot

rgl::play3d(spin3d()) 

BivariateNominal.R 

rm(list=ls(all=TRUE))                                                     # Start clean

library(foreign); library(ggplot2)

options(digits=1)                                                          # try rounding to 2 digits

###############################

## Practice table operations ##

###############################

( hh.car <- read.dbf(“E:\\Lectures2017\\GISC6301\\Chapter04\\HHCar.dbf”) ) # hh.car is an aggregated data.frame with factors

class(hh.car)

sapply(hh.car, is.factor)                                                  # check for factors

( hh.car.tab <- xtabs(Freq~cars+HHSize, data=hh.car) )                     # convert dataframe to table with factors as table dims

class(hh.car.tab)

(hh.car.df <- as.data.frame(hh.car.tab) )                                  # reverse: table to dataframe with “Freq” created

## Table margins and proportions

( addmargins(hh.car.tab) )                                                 # add surrounding sums

( prop.table(hh.car.tab) )                                                 # cell proportions from total

## Does number of cars determine the household size?

( hh.car.rowpro <- prop.table(hh.car.tab,1) )                              # row proportions

( addmargins(hh.car.rowpro, 2) )                                           # add row sums

## Household size determines number of cars

( hh.car.colpro <- prop.table(hh.car.tab,2) )                              # column proportions

( addmargins(hh.car.colpro, 1) )                                           # add column sums

## Dodge plot of absolute counts

ggplot(data=hh.car.df, aes(x=HHSize, y=Freq, fill=cars)) +

geom_bar(position=”dodge”, stat=”identity”)

## Spinogram

vcd::spine(xtabs(Freq~HHSize+cars, data=hh.car), main=”Number of Cars by Household Size in %” )

## Stacked plot of absolute counts

ggplot(data=hh.car.df, aes(x=HHSize, y=Freq, fill=cars)) +

geom_bar(position=”stack”, stat=”identity”)

## Stacked plot of percentages

hh.car.colpro <- prop.table(hh.car.tab,2)                                 # row proportions

hh.car.colpro.df <- as.data.frame(hh.car.colpro)                          # convert table into data.frame. Note: Freq

hh.car.colpro.df$Percent <- hh.car.colpro.df$Freq*100                     # transform proportions into percent

ggplot(data=hh.car.colpro.df, aes(x=HHSize, y=Percent, fill=cars)) +

geom_bar(position=”stack”, stat=”identity”)

##############################################

## List of 100 observations without factors ##

##############################################

( hh.car.lst <- read.dbf(“E:\\Lectures2017\\GISC6301\\Chapter04\\HHCarList.dbf”) )  # what is the structure ?

hh.car.lst$cars <- factor(hh.car.lst$cars,

labels=c(“c:0″,”c:1″,”c:2″,”c:3”))                # convert numeric to factor with labels

hh.car.lst$HHSize <- factor(hh.car.lst$HHSize,

labels=c(“p:2″,”p:3″,”p:4″,”p:5″))              # convert numeric to factor with labels

head(hh.car.lst, n=15)

( hh.car.tab <- xtabs( ~ HHSize+cars, data=hh.car.lst) )                    # convert data-frame table. Note syntax

#####################################################################

## Converts aggregated dataframe into individual records dataframe ##

#####################################################################

( hh.car.df <- as.data.frame(hh.car.tab) )                                   # convert table to data-frame

hh.car.lst <- as.data.frame(lapply(hh.car.df,

function(x) rep(x, hh.car.df$Freq)))      # expand to individual observations

hh.car.lst

########################################

## Cross-tabulations like SAS or SPSS ##

########################################

gmodels::CrossTable(hh.car.lst$HHSize, hh.car.lst$cars, chisq=T)             # see help at gmodels::CrossTable

#######################################################

## Recoding a metric variable into an ordered factor ##

#######################################################

income <- rlnorm(200, mean=10, sd=1)/1000 + 15                              # 200 random household incomes in $1000

hist(income, xlab=”Household Income (in $1000)”)

( income.fac <- cut(income,breaks=c(min(income)-1,40,90,max(income)+1)) )   # split into classes of type factor

income.ord <- ordered(income.fac,levels(income.fac),

labels=c(“low”,”medium”,”high”))                      # make ordered factor and rename classes

( income.tab <- table(income.ord) )

barplot(income.tab)                                                         # barplot (unsorted because ordinal variable)

Solution 

Part1 :

Task1 :

The table is the same for t1, t2 and t3.

x\y 1 2 3 4 Total
1 1 1 1 1 4
2 1 1 1 1 4
3 1 1 1 1 4
Total 3 3 3 3 12

Task2 :

The graph is the same for the three data frame.

Task3 :

The variables within each data frame are independant because the graph shows that each point has the same probability to occur.

Task4 :

B1$X B2$X B3$X
24.50 24.50 24.53

Task5 :

Task6 :

We remark that for b1 and b2, variables X and G are independant because values of G dont affected by the values of X.

For b1, the value of G depend of the value of X. When we have a high value of X, we expect to achieve g=3.

Task7 :

C1 :

1 -0.02468068
-0.02468068 1

C2 :

1 0.1050691
0.1050691 1

C3 :

1 0.5341502
0.5341502 1

Task8 :

C1 :

C2

C3

Task9 :

C1 graph dont help us to conclude. C2 graphs shows that variables are independant. C3 graph show that the variable are dependant.

Part2 :

We have the same probability that Male be admitted or rejected. But, female are more rejected than admitted.

Task 12 :

We remark that Male are rejected are usually rejected in 4 Dept from 6. We remark that Female are acpeted usually in 4 Dept from 6.

We can conclude that Male are more rejected than Female.

Cy1 disp hp drat wt qsec vs am gear carb
Negative correlation : increase in the opposite direction Negative correlation : increase in the opposite direction Negative correlation : increase in the opposite direction Positivecorrelation : increase in the same direction Negative correlation : increase in the opposite direction Positivecorrelation : increase in the same direction Positivecorrelation : increase in the same direction Positivecorrelation : increase in the same direction Positive correlation : increase in the same direction Negative correlation : increase in the opposite direction

Task14 :

scatterplotMatrix(mtcars, var.labels=colnames(mtcars), diagonal=c(“density”, “boxplot”, “histogram”, “oned”, “qqplot”, “none”))

Task 16 : sorry for this question, i don’t know 

Code client.R 

setwd(“C:/Users/AKOSMOS/Desktop/freelancer/project 1/Assigment”)

load(Part1Data.Data)

#task 1:

TabContingence1=table(t1$X, t1$Y)

TabContingence1

TabContingence2=table(t2$X, t2$Y)

TabContingence2

TabContingence3=table(t3$X, t3$Y)

TabContingence3

#1

col=rep(0,3)

for(i in 1:3)

col[i]=sum(TabContingence1[i,c(1:4)])

TabContingence1=cbind(TabContingence1,col)

ligne=rep(0,4)

for(i in 1:5)

ligne[i]=sum(TabContingence1

[c(1:3),i]

)

TabContingence1=rbind(TabContingence1,ligne)

TabContingence1

#2

col=rep(0,3)

for(i in 1:3)

col[i]=sum(TabContingence2[i,c(1:4)])

TabContingence2=cbind(TabContingence2,col)

ligne=rep(0,4)

for(i in 1:5)

ligne[i]=sum(TabContingence2

[c(1:3),i]

)

TabContingence2=rbind(TabContingence2,ligne)

TabContingence2

#task2

spineplot(t1$X, t1$Y , main=” t1$X et t1$Y” )

spineplot(t2$X, t2$Y , main=” t2$X et t2$Y” )

spineplot(t3$X, t3$Y , main=” t3$X et t3$Y” )

#variables are independent because we have the same prob for each case.

#task4

means=c(mean(b1$X),mean(b2$X),mean(b3$X))

#task5 and #task6

boxplot(X~G, data=b1) #dependant

boxplot(X~G, data=b2) #independent

boxplot(X~G, data=b3) #independent

#task7

cor(c1, use=”complete.obs”, method=”kendall”)

cor(c2, use=”complete.obs”, method=”kendall”)

cor(c3, use=”complete.obs”, method=”kendall”)

#task8

library(car)

scatterplot(X1~X2, data=c1, main=”Enhanced Scatter Plot for c1″)

scatterplot(X1~X2, data=c2, main=”Enhanced Scatter Plot for c2″)

scatterplot(X1~X2, data=c3, main=”Enhanced Scatter Plot for c3″)

#task9

#independent

#independent

#dependent

#==== Part2:

data(UCBAdmissions, package = “datasets”)

#task10

ftable(UCBAdmissions)

Admiss <- as.data.frame(UCBAdmissions)

AdmissByGen <- xtabs(Freq ~  Admit + Gender, data=Admiss)

## Make table with row percentages

AdmissByGenPropMat <- prop.table(AdmissByGen,2)

round(AdmissByGenPropMat,2)

#task11

## Females

AdmissFem <- Admiss[Admiss$Gender==”Female”, ]

AdmissByDeptFem <- xtabs(Freq ~  Admit + Dept, data=AdmissFem)

round(prop.table(AdmissByDeptFem,2),2)

#task12: write a comment on the report about gender admission biais

#========== Part3

data(mtcars, package= “datasets”)

names(mtcars)

#task13

cor(mtcars, use=”complete.obs”, method=”kendall”)

#task14

scatterplotMatrix(mtcars, var.labels=colnames(mtcars), diagonal=c(“density”, “boxplot”, “histogram”, “oned”, “qqplot”, “none”))

#task15

library(corrplot)

M <- cor(mtcars)

corrplot(M, method=”ellipse”)