# Construct and Correlate Data

Part I: Construct and Correlate Data

Please run the following code in R:

library(psych) # because you’ll need it later

x <- rnorm(30,0,1) # generates 30 random normal numbers

# with a mean of 0 and an SD of 1

y <- rnorm(30,0,1)

xydata<- data.frame(x=c(x),y=c(y)) #puts it into a dataframe

cor(xydata)

Questions:

1a.  What is the expected value of the correlation between these columns of numbers?

1b.  If you generate 10 different sets of random numbers do the resulting correlations differ from the expected value? If so, why?

1c. Now redo steps 1a and 1b, but with x and y having 100 cases instead of 30. Do you finding anything different between these analyses. If so what do you find? Do you have an explanation for any difference in your findings?

Part II:

Construct two more arrays of 30 data points each using the code above. Then construct a graph that provides a plot of the data points in a two-dimensional space (i.e., values for x on the x-axis, values for y on the y-axis; a scatterplot). Here’s the simplest way to do that in R:

plot(xydata)

In addition, use the describe function from the psych library to print out values for the various descriptive statistics of each variable. In case it’s not obvious, the command would be describe(xydata).

Question:

2a.  Run all of the code several times (starting with the random number generation) and see if you can spot any patterns between the magnitude of the correlation, the appearance of the scatterplot, and the various descriptive statistics. Note your observations here.

# Part III

The following code will generate correlated data with a population value of .4 from random numbers. It then puts the data into a dataframe, calculates the correlation, and requests a scatterplot.

r <- .4

x <- rnorm(30,0,1)

e <- rnorm(30,0,1)

y <- x*r+e*(sqrt(1-r^2))

xydata<- data.frame(x=c(x),y=c(y))

cor(xydata)

plot(xydata)

Use the code to generate correlated data for r‘s (i.e., population correlations) of .00, .25, .50, .75, and 1.00.

Questions:

3a. What happens to the shape of the plots as the magnitude of the correlation increases?

3b.  By examining how the formula for y works with the values of r above (the extremes of 0 and 1 are particularly useful), explain how the formula forces y to correlate with x at the specified level.

Part IV

Using the data for the r=.75 plot constructed in Part III (or another run using r = .75), identify the rows in your data set that have the lowest values for x. One way to do this in R (I’ll not say the best way, as I likely don’t know the best way) is to use the View(xydata) command to get the spreadsheet view of the two variables. Once there, you can click on the variable name and it will sort your data from lowest to highest (the first click) or from highest to lowest (the second click). Notice that the row numbers in the data View identify which row is which (i.e., allowing you to identify which rows have the lowest values). Look through the lowest x values until you find one that has a similarly low value for y. Now, replace that y value with a value that is approximately the same but opposite in sign. Here are a couple of ways to do this in R. Let’s say you want to change the 12th row of y to 2.31…

xydata\$y <- 2.31 or… xydata[12,2] <- 2.31

Both do the same thing. The first one identifies the y variable using the \$ to refer to its variable name [and then specifies the 12th row in brackets afterward], whereas the second one specifies the 12th row of the 2nd column (which happens to be the y variable). At any rate, once you’ve done this, you’ve created your own outlier!

Questions:

4a. What impact does changing the sign of this y value have on the magnitude of the correlation and the appearance of the scatterplot?

4b. Construct another data set that yields a correlation of r=.75, but construct this data set with 100 pairs of numbers. Go ahead and change the y-value as you did before. How do your results change? Do the results differ? If so why?

Part V: Transformations

Transformations are manipulations that are often used to change the characteristics of data. For instance, standardizing is a manipulation designed to create a common scale for comparing data sets. It is a combination manipulation where each data point in a distribution has a constant subtracted from it (the sample mean) so that a common mean is established, and each data point is divided by a constant (the sample SD) so that the resulting standard deviations of the data sets are equal. This form of standardization produces z-scores that have a mean of 0 and an SD of 1, but there are other, similar standardizations. For example, the SATs, GREs, and other high-stakes tests have at one time or another set the mean of each test at 500 and the SD at 100, though now each test uses slightly different types of standardization.

Using the R code above, construct a dataframe containing two (uncorrelated) arrays of 100 random normal numbers (x and y) and find their correlation.

Questions:

5a.  Construct two more variables (x2 and y2) based on x and y by adding a constant (e.g., 2.00) to one or both columns of numbers. You can do this easily using code such as:

xydata\$x2 <- x+2.00

This creates a new variable (x2) that is a function of the existing variable x.

What effect do these transformations have on the value of the correlation between the two new variables?

5b.  Construct two more new variables (x3 and y3) by multiplying one or both columns by a constant (e.g., 2.00). What effect does this have on the resulting correlation?

5c.  Based on your data for 5a and 5b, what effect would you estimate that standardizing has on the magnitude of correlations? Prepare another two variables to test your theory.

5d.  In some areas, we are often encouraged to engage in other forms of transformations of data to assist in hypothesis testing. Two examples are employing squared variables (X2) or using logarithms (i.e., ln for the natural log). Conduct a test to determine what effect, if any, squared or logarithmic transformations on one or both variables have on the magnitudes of correlations.

Part VI: Dichotomization

In many instances the variables that we are interested in studying may be continuous (i.e., time), ordinal (number of males in a four member team) or dichotomous (i.e., gender, yes/no decisions). The use of continuous variables maximizes information by attempting to more closely match the values variables can hold to the characteristics of the variables we wish to study. But sometimes, particularly with continuous variables, assessments are made with levels data (i.e., 1-5 Likert assessment) or with dichotomous responses. An important question is to what extent does this influence the resulting data, if at all?

Construct a new set of 200 x and y variables set to correlate at .50.

Questions:

6a.  Create a new x variable has a value of 0 if x is less than or equal to 0 and 1 all other values. To do this in R, adapt the code you used in EX1 to recode gender to a 0/1 variable. Now calculate the correlation between the new x variable and y. How does this correlation compare to the correlation between the original x and y?

6b.  Now conduct additional analyses dichotomizing the data at different split points. Specifically, try making the cut point at .5 SD above the mean, 1 SD above the mean, and 1.5 SD above the mean. As these variables are normally distributed with a mean of 0 and an SD of 1, you can just use the appropriate z-score for each cut. After recalculating the correlations, how do these values compare to each other and to those from 6a?

6c.  Now see what happens when both columns are dichotomized. First, dichotomize y (and x) at 1.5 SDs above the mean and compare the new to the original correlation. Next, dichotomize y (but not x) at -1.5 SDs above the mean. What do you find?

# Part VII: Range Restriction/Enhancement

Range restriction or range enhancement occurs when either by choice or as a result of some systematic factor our methods of developing our sample result in the systematic exclusion of individuals representing entire regions of the population (as when we choose to study only top performing organizations or individuals). Range restriction occurs when this results in a systematic reduction of variance on a target measure. Range enhancement occurs when the result is enhanced variance (e.g., when an analysis only looks at extremely high and low performers and does not include those with mid-range values).

Questions:

7a.  Construct a new data set that has 500 pairs of x and y values with a population correlation of .50. Be sure to examine the descriptives of these two variables as well as the correlation and scatterplot. Now use the following code to chop off 25% of the cases based on their (low) scores on x:

newxy<- xydata[xydata\$x>quantile(xydata\$x, 0.25), ]

Now recalculate the descriptives, correlation, and scatterplot. What impact does this have on the magnitude of the correlation and on the value of the means and standard deviations for the two variables?

7b.  Now let’s see what happens with range enhancement. To accomplish this, we’ll put the data from the lowest 25% x values in one dataframe, then we’ll put the data from the highest 25% x values in another dataframe. Finally, we’ll bind the two dataframes back together and compare the correlations. Here’s the code:

new1xy <- xydata[xydata\$x<quantile(xydata\$x, 0.25), ]new2xy <- xydata[xydata\$x>quantile(xydata\$x, 0.75), ]new3xy <- rbind(new1xy,new2xy)

What impact does this have on the values of the descriptives and correlation?

Solution

Part I: Construct and Correlate Data

1a. Since the columns x and y are generated independently, therefore the expected value of correlation is 0 (independent variables have correlation 0).

1b. Yes, if we generate 10 different sets of random numbers the correlations would differ from the expected value i.e. 0. Since there is randomness in the data, there is randomness even in the correlation coefficient. So the correlation in every sample doesn’t turn out to be exactly equal to the expected value 0 but in most of the cases it is very close to 0.

1c. When we redo the above two steps with 100 cases instead of 30 we find that the randomness in the correlations have actually decreased and the correlation is closer to 0 than in the steps with 30 cases.

30 case 10 correlations

-0.17794934 -0.15272664  0.10095654 -0.06812813  0.05571866 -0.16288484  0.24090639  0.11051410  0.02632407 -0.35227070

Standard deviation of correlations = 0.1764127

100 case 10 correlations

0.074369887  0.047854752  0.187183318  0.052954233  0.166504561 -0.001601338  0.068468256 -0.070076552  0.246488088-0.109639319

Standard deviation of correlations = 0.1114916

So as we see the standard deviation of correlations in case of 30 observations is higher than that of 100 observations. So the spread is more of the one with 30 cases.

library(psych) # because you’ll need it later

x <- rnorm(30,0,1) # generates 30 random normal numbers

# with a mean of 0 and an SD of 1

y <- rnorm(30,0,1)

xydata<- data.frame(x=c(x),y=c(y)) #puts it into a dataframe

cor(xydata)

Running this we get output

x          y

x 1.00000000 0.02238099

y 0.02238099 1.00000000

So the correlation between x and y for this sample is 0.02238099.

Part II:

2a. After running the codes a several times it seems that whenever the correlation is a bit high there seems to be a relationship between the variables like a linear relationship and whenever the correlation in low the scatterplot looks very random.

There weren’t any serious interpretation w.r.t. the descriptive statistics and correlation.

Example of a scatterplot and the descriptive statistics.

CASE -1

vars  n mean   sd median trimmed  mad   min  max range  skew kurtosis   se

x    1 30 0.09 0.76   0.22    0.13 0.73 -1.80 1.56  3.36 -0.45    -0.41 0.14

y    2 30 0.06 0.85   0.18    0.05 0.88 -1.26 1.59  2.86  0.00    -1.17 0.15 Correlation = 0.1789013

We see some trend like higher x higher y to some extent.

CASE -2

vars  n  mean   sd median trimmed  mad   min  max range skew kurtosis   se

x    1 30 -0.25 0.83  -0.21   -0.26 0.71 -1.91 1.51  3.42 0.09    -0.44 0.15

y    2 30 -0.19 0.88  -0.23   -0.25 0.70 -1.92 1.72  3.64 0.53    -0.21 0.16 Correlation = -0.0138384

Here we see that there is no pattern whatsoever in the x and y variables.

# Part III

3a. As the correlation increases the linear dependence between the variables increases. We can see this from the plots below

Correlation = 0.25 Correlation = 0.5 Correlation = 0.75                                                                                            Correlation = 0.9

Seeing the graphs it is evident that as correlation increases the dependence increases.

3b. y <- x*r+e*(sqrt(1-r^2))this line of code shows that for creating the y variable we are using some part of x i.e. x*r. So, this way we force y to be related to x.  Part IV

4a. The scatterplots are as given below.

The first one is the original xydata and in the second one the outlier has been introduced. 4b. Doing the same exercise with 100 pair of numbers we find that the correlation decreases from 0.7720025 to 0.7072147. So, there is a decrease but less as compared to the one with 30 pairs of numbers. Since, now there are 100 observations so the weight of every observation w.r.t correlation has decreased and hence outlier affects the correlation less.On introducing the outlier the correlation decreases from 0.7606033 to 0.6112308. It decrease as the linear relationship is weakened by the outlier.

Part V: Transformations

5a. By using the transformation xydata\$x2 <- x+2.00 the correlation between the ywo new variables doesn’t change.

5b. Even if we multiply one or both of the variables by a constant, the correlation doesn’t change.

5c. Based on 5a and 5b, the correlation isn’t affected at all by standardization.

And indeed it is true.

xydata\$x2 <- (x – mean(x))/sd(x)

xydata\$y2 <- (y – mean(y))/sd(y)

We find the standardized scores in x2 and y2 and see that the correlation between x2 and y2 is same as that between x and y.

5d. Since in this case x and y are uncorrelated so x^2 and y or log(x) and y also should be uncorrelated, hence the correlation between them should be very close to zero. The logarithmic transformation gave the warning  In log(x) : NaNs producedThis is due to the fact that we are taking logs of negative numbers. So we couldn’t find correlation between log transformed variables. And there wasn’t any well-defined pattern in square transformations.

Part VI: Dichotomization

6a.Created a new variable x2 such that it is 0 when x is less than 0 and 1 otherwise. The correlation table is as follows

x         y        x2

x  1.0000000 0.5174247 0.7965665

y  0.5174247 1.0000000 0.3247136

x2 0.7965665 0.3247136 1.0000000

As is evident from the table the correlation decreased significantly from 0.5174247 to 0.3247136 when we dichotomize the variables.

The R code is as follows

r <- .5

x <- rnorm(200,0,1)

e <- rnorm(200,0,1)

y <- x*r+e*(sqrt(1-r^2))

xydata<- data.frame(x=c(x),y=c(y))

xydata\$x2 <- ifelse(x<0,0,1)

cor(xydata)

6b. We do as asked and get the following correlation table

x         y        x2        x3        x4        x5

x  1.0000000 0.4883766 0.8145386 0.7758389 0.6329432 0.5279236

0.4883766 1.0000000 0.3912904 0.3402826 0.3046051 0.2454053

x2 0.8145386 0.3912904 1.0000000 0.7113803 0.4117647 0.2890440

x3 0.7758389 0.3402826 0.7113803 1.0000000 0.5788250 0.4063144

x4 0.6329432 0.3046051 0.4117647 0.5788250 1.0000000 0.7019641

x5 0.5279236 0.2454053 0.2890440 0.4063144 0.7019641 1.0000000

Where x2 is split at 0, x3 is 0.5SD above mean, x4 is 1SD above mean and x5 is 1.5SD above mean.

As we see, the correlation decreases gradually from x2 to x5.

6c. After dichotomizing both x and y at 1.5SD above mean we have

x         y        x5        y5

x  1.0000000 0.5708160 0.5739103 0.3166254

y  0.5708160 1.0000000 0.3047006 0.5399813

x5 0.5739103 0.3047006 1.0000000 0.2714341

y5 0.3166254 0.5399813 0.2714341 1.0000000

where x5 and y5 are the dichotomized variables. As we see the correlation between x5 and y5

is much lesser than that of between x and y.

x         y        y6

x  1.0000000 0.5614152 0.2004371

y  0.5614152 1.0000000 0.4382670

y6 0.2004371 0.4382670 1.0000000

And this is the case of dichotomizing y only. In this case also correlation decreases.

# Part VII: Range Restriction/Enhancement

7a. Constructed a data set with population correlation 0.5 and below are the scatterplot, the descriptives

and the correlation

Correlation = 0.5353877

vars   n  mean   sd median trimmed  mad   min  max range  skew kurtosis   se

x    1 500 -0.02 0.97   0.02    0.00 0.87 -3.36 3.34  6.70 -0.16     0.41 0.04

y    2 500  0.03 0.99   0.03    0.04 0.99 -3.29 3.14  6.43 -0.10    -0.14 0.04

Next we constructed the range restricted plot and the correlation and the descriptives

Correlation  =0.3882695

vars   n mean   sd median trimmed  mad   min  max range  skew kurtosis   se

x    1 375 0.40 0.67   0.33    0.33 0.70 -0.58 3.34  3.92  0.87     0.79 0.03

y    2 375 0.27 0.89   0.23    0.28 0.84 -2.16 3.14  5.31 -0.06     0.11 0.05

As we compare the descriptives we see that the mean, median has increased and standard deviation has decreased.

Also we note that the correlation has decreased significantly. R codes:

r <- .5

x <- rnorm(500,0,1)

e <- rnorm(500,0,1)

y <- x*r+e*(sqrt(1-r^2))

xydata<- data.frame(x=c(x),y=c(y))

cor(xydata)

plot(xydata)

describe(xydata)

newxy<- xydata[xydata\$x>quantile(xydata\$x, 0.25), ]

cor(newxy)

plot(newxy)

describe(newxy)

7b. Thedescriptives of the original data is as follows

vars   n  mean   sd median trimmed  mad   min  max range  skew kurtosis   se

x    1 500 -0.04 1.00  -0.08   -0.04 0.94 -2.78 2.44  5.21 -0.03    -0.33 0.04

y    2 500  0.00 0.99  -0.01    0.00 0.98 -2.70 2.71  5.41  0.06    -0.24 0.04

and correlation = 0.4903157

whereas for the range enhanced data set descriptives are as follows

vars   n  mean   sd median trimmed  mad   min  max range  skew kurtosis   se

x    1 250 -0.04 1.37  -0.07   -0.03 1.84 -2.78 2.44  5.21 -0.04    -1.48 0.09

y    2 250 -0.04 1.03  -0.05   -0.04 1.12 -2.70 2.71  5.41  0.07    -0.48 0.07

and correlation = 0.6430433

So in this case correlation increased , mean amost remained the same and standard deviation increased i.e just the opposite of what happened in range restriction.

R codes:

r <- .5

x <- rnorm(500,0,1)

e <- rnorm(500,0,1)

y <- x*r+e*(sqrt(1-r^2))

xydata<- data.frame(x=c(x),y=c(y))

cor(xydata)

plot(xydata)

describe(xydata)

new1xy <- xydata[xydata\$x<quantile(xydata\$x, 0.25), ]

new2xy <- xydata[xydata\$x>quantile(xydata\$x, 0.75), ]

new3xy <- rbind(new1xy,new2xy)

cor(new3xy)

plot(new3xy)

describe(new3xy)