# Correlation Coefficients

- Use Stata and the AuSSA 2011 dataset to investigate the predictors of Body Mass Index (Y) by undertaking the following tasks. Include both the do and log files for your analyses in the appendix.

a) BMI is a measure of body fat. In the AuSSA 2011 dataset there is no variable for BMI readily available, but this can be constructed based on information on height and weight. Generate a variable called BMI which captures respondent’s BMI, based on the definition here:http://www.bmi-calculator.net/bmi-formula.php. Ignore missing values.

b) Produce a histogram of the newly created BMI variable and overlay it with a normal probability distribution. Is the variable roughly normally distributed? Using Stata, find this variable’s mean, median, range and standard deviation.

c) Create a variable capturing the total number of children in the household age 0-17. Find a variable in the data capturing __years__ of education completed and give it a meaningful name. Produce descriptive statistics for the 2 variables.

d) Compute the correlations (using listwise deletion) between the variables on BMI, number of children 0-17 in the household, years of education completed, and age. Write down the correlation coefficients that measure the strength of the relationships between BMI and the other 3 variables. Which variable is most strongly correlated with BMI? How can you tell? How do you interpret this correlation coefficient?

e) Produce a scatterplot of BMI versus age using the jitter option and overlay this plot with a linear regression line.

f) Use Stata to regress BMI (Y) on age. Save the model estimates in Stata’s background memory with the name M1. Write down the model’s r^{2} What does it tell you about the strength of the relationship between the 2 variables? __Fully__ interpret the coefficient on age.

- Using Stata and the same set of variables created in Exercise 1 from the AuSSA 2011 dataset, complete the following tasks to develop a multiple linear regression model of BMI. You may continue to use the do and log files for your analyses initiated in the previous exercise. Include them in the appendix.

a) Regress BMI (Y) on the number of children 0-17 in the household, years of education completed, and age. Save the model estimates in Stata’s background memory with the name M2. Write down the prediction equation for this multiple linear regression model.* *

b) What is the R^{2} for this model? What is your interpretation of this? What is the test statistic used to test the null hypothesis that the coefficient of multiple determination is equal to 0? What is the value of the statistic for this model and what is your conclusion about the test?

c) From the analysis of variance table in the Stata output, write down the values for the total sum of squares and the sum of squared errors. Write down the formula for R^{2} using these values and show that using the formula gives the same result for R^{2} as shown in the Stata output.

d) Which of the partial regression coefficients in this model are significantly different from 0? How can you tell? What is the full interpretation of the partial regression coefficient on the variable capturing years of education completed?

e) Display the estimates that you have stored in Stata’s background memory with the names M1 and M2 side by side, including information on the number of observations, the root mean square error (i.e. σ), and r^{2}/R^{2}. Why is the coefficient on the variable age different in M1 and M2? Which model summary statistics have changed? Explain how and why.

f) Test for an interaction effect in the association between age, years of education completed, and BMI (Y), controlling for the number of children 0-17 in the household. Do you find evidence of such an effect? How can you tell? Interpret the coefficients on age, years of education and the new interaction term in the model (do not comment on statistical significance).

3. Use Stata and the WVS 2012 dataset to examine the associations between acceptance of homosexuality (*v203*), country of birth (*v243*) and religiosity (*v147*) by completing the following tasks. Include both the do and log files for your analyses in the appendix.

a) Produce frequency tables for variable *v243* with and without value labels. What is the value for being Australian born? Create a dummy variable called *Australian *which takes the value 1 if the respondent was born in Australia and the value 0 if the respondent was born in any other country. Remember that missing values in the original variable should also be missing in the new variable. Produce a cross-tabulation of the original and new variables (showing missing values) to check that the new variable is correctly derived.

b) Reformat the religiosity variable (*v147*) by executing the command *format v147 %22.0f* in Stata. Then produce 2 frequency tables for it, one with labels and one without labels. Create 3 dummy variables out of the religiosity variable, one for each of its categories. Remember that missing values in the original variable should also be missing in the new variables. Make sure that the new dummy variables have meaningful names. List the values for the original and new variables for the first 10 observations to check that the latter have been derived correctly.

c) Change the name of variable *v203* to *homosexuality*. Regress *homosexuality* (Y) on *Australian *and the dummy variables for religiosity, using individuals who report being ‘a religious person’ as the reference category. Write down the prediction equation for this regression model. What is the predicted Y score for an Australian-born person who reports being religious?

d) What is your full interpretation of the regression coefficient on the variable *Australian*?

e) What is your full interpretation of the regression coefficients associated with each of the dummy variables for religiosity included in the model?

f) Based on the model results, what is the difference in attitudes towards homosexuality between individuals who describe themselves as *‘not a religious person’* and those who describe themselves as *‘an atheist’*? How do you conclude this?

- Use Stata and the AuSSA 2011 dataset to explore the relationship between men’s weekly work hours (Y) and age by completing the following tasks. Include both the do and log files for your analyses in the appendix.

a) Locate the variables containing information on age and weekly work hours and give them meaningful names. Create a variable capturing the square of age. Produce descriptive statistics for the 3 variables.

b) Fit a regression model of weekly work hours with age as a predictorfor male respondents only. What is the value of R^{2} for this model? What does it tell you? How would you fully interpret the regression coefficient on the variable age?

c) Now regress weekly work hours on age and the square of age for male respondents only. What do the results suggest about the shape of the relationship between age and weekly work hours? Do the results indicate that the square term of age should be retained? Has the model been improved by including the variable for the square of age? Explain how you reached your conclusions.

d) Write down the prediction equation for this regression model (use all decimals) and use it to get the predicted work hours for 3 hypothetical individuals who are 25, 45 and 60 years of age.

e) Manually calculate the turning point for the prediction equation in 4d) and interpret the resulting number.

f) Some British literature suggests that the relationship between weekly hours of work and age amongst male workers should actually be cubic. Test this statement empirically using your Australian data. Explain your working and conclusions.

g) Based on the answer to 4f), use the estimates of your preferred model to get Stata to visually illustrate the relationship between weekly work hours and age amongst men. Restrict the X-axis to encompass only typical working ages for men.** **

**Solution**** **

- Use Stata and the AuSSA 2011 dataset to investigate the predictors of Body Mass Index (Y) by undertaking the following tasks. Include both the do and log files for your analyses in the appendix.

a) BMI is a measure of body fat. In the AuSSA 2011 dataset there is no variable for BMI readily available, but this can be constructed based on information on height and weight. Generate a variable called BMI which captures respondent’s BMI, based on the definition here:http://www.bmi-calculator.net/bmi-formula.php. Ignore missing values.

b) Produce a histogram of the newly created BMI variable and overlay it with a normal probability distribution. Is the variable roughly normally distributed? Using Stata, find this variable’s mean, median, range and standard deviation.

* *

The variable seems to be rather normally distributed.

* *

The mean BMI is 0.0026925, the median is 0.0026235, the standard deviation is 0.0005269. The range is 0.0048598.

c) Create a variable capturing the total number of children in the household age 0-17. Find a variable in the data capturing __years__ of education completed and give it a meaningful name. Produce descriptive statistics for the 2 variables.

d) Compute the correlations (using listwise deletion) between the variables on BMI, number of children 0-17 in the household, years of education completed, and age. Write down the correlation coefficients that measure the strength of the relationships between BMI and the other 3 variables. Which variable is most strongly correlated with BMI? How can you tell? How do you interpret this correlation coefficient?

The first column provides the correlation coefficients between BMI and the remaining variables. The years of education has the strongest correlation with BMI, r=-.1355 because this correlation coefficient is highest (in absolute value). Even though it is the strongest correlation, it is not even moderate (the threshold is .4), which means there is very weak negative correlation between BMI and years of education. The meaning is that more years of education correspond to lower BMI.

e) Produce a scatterplot of BMI versus age using the jitter option and overlay this plot with a linear regression line.

f) Use Stata to regress BMI (Y) on age. Save the model estimates in Stata’s background memory with the name M1. Write down the model’s r^{2} What does it tell you about the strength of the relationship between the 2 variables? __Fully__ interpret the coefficient on age.

R-squared = .0116. The R-squared gives information on the goodness of fit of the model in telling the proportion of variability in BMI which is explained by age. In this case that is only 1.16%. Therefore the strength is not high at all. The age coefficient is equal to .000000352 and is significant at the 5% level (p-value < .0001).That means that every additional year in the age of a person leads to increase of the BMI with .000000352.

- Using Stata and the same set of variables created in Exercise 1 from the AuSSA 2011 dataset, complete the following tasks to develop a multiple linear regression model of BMI. You may continue to use the do and log files for your analyses initiated in the previous exercise. Include them in the appendix.
- b. Regress BMI (Y) on the number of children 0-17 in the household, years of education completed, and age. Save the model estimates in Stata’s background memory with the name M2. Write down the prediction equation for this multiple linear regression model.

BMI = .0028 + .000000628*child – .0000163*years_education + .000000310*age* *

- What is the R
^{2}for this model? What is your interpretation of this? What is the test statistic used to test the null hypothesis that the coefficient of multiple determination is equal to 0? What is the value of the statistic for this model and what is your conclusion about the test?

R-squared = .0258 which means that this model is able to explain 2.58% of the variability in BMI. The F-statistic is F(3,1344) = 11.88 with p-value < .0001. The F test rejects the null hypothesis that the R-squared is equal to 0 at the 5% level.

c) From the analysis of variance table in the Stata output, write down the values for the total sum of squares and the sum of squared errors. Write down the formula for R^{2} using these values and show that using the formula gives the same result for R^{2} as shown in the Stata output.

TSS = .00037671

SSE = .00036698

The formula for deriving R-squared from TSS and SSE is:

= 1 – .00036698/.00037671 = 1 – .97417 = .0258

- Which of the partial regression coefficients in this model are significantly different from 0? How can you tell? What is the full interpretation of the partial regression coefficient on the variable capturing years of education completed?

All coefficients except the one for number of children is significantly different from 0 at the 5% level because their p-values are less than .05.

The meaning of the years of education variable is that all other factors unchanged (age and number of children) each additional year of completed education decreases the BMI with .0000163.

e) Display the estimates that you have stored in Stata’s background memory with the names M1 and M2 side by side, including information on the number of observations, the root mean square error (i.e. σ), and r^{2}/R^{2}. Why is the coefficient on the variable age different in M1 and M2? Which model summary statistics have changed? Explain how and why.

The age coefficient is different because in the second model (m2) there are additional variables which change the effect of age. All model statistics have changed due to the different variables in the models.

f) Test for an interaction effect in the association between age, years of education completed, and BMI (Y), controlling for the number of children 0-17 in the household. Do you find evidence of such an effect? How can you tell? Interpret the coefficients on age, years of education and the new interaction term in the model (do not comment on statistical significance).

The interaction term is significant at the 5% level with p-value = .015. The interpretation of the coefficients now is the following. Considering number of children stays unchanged, the effect of every additional year of education on BMI will be – .000054 + .0000000672*age. Vise versus the effect of every additional year in the age of the person on the BMI will be -.000000631 + .0000000672*years_edu.

- Use Stata and the WVS 2012 dataset to examine the associations between acceptance of homosexuality (
*v203*), country of birth (*v243*) and religiosity (*v147*) by completing the following tasks. Include both the do and log files for your analyses in the appendix.

a) Produce frequency tables for variable *v243* with and without value labels. What is the value for being Australian born? Create a dummy variable called *Australian *which takes the value 1 if the respondent was born in Australia and the value 0 if the respondent was born in any other country. Remember that missing values in the original variable should also be missing in the new variable. Produce a cross-tabulation of the original and new variables (showing missing values) to check that the new variable is correctly derived.

Those born in Australia are 1133.

b) Reformat the religiosity variable (*v147*) by executing the command *format v147 %22.0f* in Stata. Then produce 2 frequency tables for it, one with labels and one without labels. Create 3 dummy variables out of the religiosity variable, one for each of its categories. Remember that missing values in the original variable should also be missing in the new variables. Make sure that the new dummy variables have meaningful names. List the values for the original and new variables for the first 10 observations to check that the latter have been derived correctly.

- Change the name of variable
*v203*to*homosexuality*. Regress*homosexuality*(Y) on*Australian*and the dummy variables for religiosity, using individuals who report being ‘a religious person’ as the reference category. Write down the prediction equation for this regression model. What is the predicted Y score for an Australian-born person who reports being religious?

Homosexuality = 4.20 + 0.86*Australian + 1.18*religious + 2.25*non_religious + 3.31*Atheist

The predicted Y score for religious Australian is = 4.20+1.18 = 5.38

- What is your full interpretation of the regression coefficient on the variable
*Australian*?

The score increases with 0.86 if the person is Australian.

- What is your full interpretation of the regression coefficients associated with each of the dummy variables for religiosity included in the model?

Each of the dummy coefficient gives the change in the Y score if the person is from the respective category. Since there is no way that they are all at the same time 1, when one dummy is 1 the others take value 0.

- Based on the model results, what is the difference in attitudes towards homosexuality between individuals who describe themselves as
*‘not a religious person’*and those who describe themselves as*‘an atheist’*? How do you conclude this?

The atheists have higher Y score than the non-religious people, which means that on average for the atheists homosexuality is more justifiable than fort he non-religious people. The reason is due to the meaning of the homosexuality scale: 1 meaning never justify to 10 meaning always justify homosexuality.

- Use Stata and the AuSSA 2011 dataset to explore the relationship between men’s weekly work hours (Y) and age by completing the following tasks. Include both the do and log files for your analyses in the appendix.

a) Locate the variables containing information on age and weekly work hours and give them meaningful names. Create a variable capturing the square of age. Produce descriptive statistics for the 3 variables.

b) Fit a regression model of weekly work hours with age as a predictorfor male respondents only. What is the value of R^{2} for this model? What does it tell you? How would you fully interpret the regression coefficient on the variable age? *(2 marks)*

The R-squared is .0002 which means that the model has extremely poor fit, close to not at all. The variable age is non-significant at the 5% level. However, the interpretation of its coefficient is that every additional year in age leads to increase in the weekly hours worked by .015 hours.

- Now regress weekly work hours on age and the square of age for male respondents only. What do the results suggest about the shape of the relationship between age and weekly work hours? Do the results indicate that the square term of age should be retained? Has the model been improved by including the variable for the square of age? Explain how you reached your conclusions.

Since the squared age is significant as well as age this suggests the relationship is not linear but has a U shape with a peak up to which with the increase of age the work hours increase and after reaching that peak with the additional years of age the work hours start decreasing.

The model was improved by including the age2 variable since the overall model is already significant and the adjusted R-squared hasrised to .0644.

- Write down the prediction equation for this regression model (use all decimals) and use it to get the predicted work hours for 3 hypothetical individuals who are 25, 45 and 60 years of age.

The prediction equation is:

Work hours = 5.679192 + 1.804366*age -.019784*age2

Age 25: Work hours = 38.42

Age 45: Work hours = 46.81

Age 60: Work hours = 42.72

- Manually calculate the turning point for the prediction equation in 4d) and interpret the resulting number.

The turning point can be found using the first derivative and making it equal to 0:

1.804366 – 2*0.019784*age = 0

1.804366 = 0.039568*age

Age= 45.6

There is a peak at 45.6 years up to which with the increase of age the work hours increase and after reaching the peak of 45.6 years with the additional years of age the work hours start decreasing

f) Some British literature suggests that the relationship between weekly hours of work and age amongst male workers should actually be cubic. Test this statement empirically using your Australian data. Explain your working and conclusions.

Including the cubic term of age still makes age and age2 significant, however age^3 is not significant at the 5% level. Therefore those suggestions are not supported by the current data.

g) Based on the answer to 4f), use the estimates of your preferred model to get Stata to visually illustrate the relationship between weekly work hours and age amongst men. Restrict the X-axis to encompass only typical working ages for men.