# Econometric Analysis

Econometric Analysis

Question 1

Using the 2015 Canadian Tobacco Alcohol Drugs Survey available onodesi
(a) How many individuals in your sample had a job in the past week?
(b) Regress whether the individual had a job in the past week on whether
they had kids, province of residence, gender, age and alcohol consumption
in the past week (in number of drinks).
(c) You think that there may be endogeneity between past week alcohol
consumption and whether the individual had a job in the past week. Using
alcohol consumption in the past year (number of drinks), estimate the model
properly. Do this using OLS and using the correct method of estimation.
(d) For OLS, do (c) in steps and using the built-in command in Stata.
Do you get the same results?
(e) Is there a difference between the two methods of estimations used in(c)?
(f) Test whether there is a difference between your results in (b) and in
(c). Were you justified to think that endogeneity was present?
(g) Is the instrument used in (c) a \good instrument”? I expect both a
theoretical answer and a quantitative one here…
(h) Is heteroskedasticity present in these models?
(i) Regardless of your answer in (g), suggest a different instrument. Explain why you think this instrument would be better.

Solution

Econometric Analysis

Question 1

Using the 2015 Canadian Tobacco Alcohol Drugs Survey available on odesi (which you can access through the library’s website)

(a) How many individuals in your sample had a job in the past week?

(You will have to work at this: it is not as straightforward as it sounds).

To answer the question, we combine LMAM_01 (worked as employed last week) with LMAM_02 (absent from job/business last week). Positive answers to any of the two variables indicates having a job.

Code for R

# sets the working directory

setwd(‘b:/R/’)

# checks how data looks like

#tells R to work with MyData

attach(MyData)

######## q1a ###########

########################

# computes a variable to show if respondent has a job

work<- (ifelse(LMAM_01==1,1,0))+(ifelse(LMAM_02==1,1,0))

# see the distribution of cases

table(work)

# shows directly the number of people that had job in the previous week

sum(work)

Results

>table(work)

work

0    1

6462 8692

> # shows directly the number of people that had job in the previous week

>sum(work)

[1] 8692

8692 had work in the past week.

(b) Regress whether the individual had a job in the past week on whether they had kids, province of residence, gender, age and alcohol consumption in the past week (in number of drinks).

First, we set up the independent variables. (given there is no record of number of drinks in the past week, the monthly value is used). Second, we run the model.

Code for R

######## q1b ###########

########################

# preparing independent variables

# Install the car package

install.packages(“car”)

library(car)

# create variable hasKids (0=has not; 1=has)

# province of residence

PROV<-factor(PROV)

levels(PROV)<-c(“Newfoundland and Labrador”, “Prince Edward Island “, “Nova Scotia “,

“New Brunswick “, “Quebec”, “Ontario “, “Manitoba “, “Saskatchewan “,

“Alberta “, “British Colombia “)

#gender

SEX<-factor(SEX)

levels(SEX)<-c(“Male”, “Female”)

#alcohol consumption in the past week (in number of drinks)

table(ALC7D)

recode(ALC7D,”999=’NA'”)

#model

summary(logit1)

Results

>summary(logit1) Call:glm(formula = work ~ hasKids + PROV + SEX + DVAGE + ALC7D, family = binomial(link = “logit”)) Deviance Residuals:     Min       1Q   Median       3Q      Max  -1.8236  -1.0807   0.7964   1.0084   1.5568   Coefficients:                            Estimate Std. Error z value Pr(>|z|)    (Intercept)                1.154e+00  7.078e-02  16.302  < 2e-16 ***hasKidshas kids            1.120e-01  4.282e-02   2.617 0.008880 ** PROVPrince Edward Island   2.717e-01  7.799e-02   3.484 0.000494 ***PROVNova Scotia            1.350e-01  7.766e-02   1.739 0.082102 .  PROVNew Brunswick          1.353e-01  7.775e-02   1.740 0.081802 .  PROVQuebec                 2.724e-01  7.659e-02   3.557 0.000375 ***PROVOntario                2.682e-01  7.778e-02   3.449 0.000563 ***PROVManitoba               4.458e-01  7.743e-02   5.758 8.53e-09 ***PROVSaskatchewan           4.306e-01  7.647e-02   5.631 1.79e-08 ***PROVAlberta                5.532e-01  7.601e-02   7.278 3.39e-13 ***PROVBritish Colombia       3.336e-01  7.620e-02   4.379 1.19e-05 ***SEXFemale                 -2.039e-01  3.456e-02  -5.901 3.61e-09 ***DVAGE                     -2.444e-02  8.691e-04 -28.117  < 2e-16 ***ALC7D                     -2.486e-04  9.129e-05  -2.723 0.006470 ** —Signif.codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1)     Null deviance: 20679  on 15153  degrees of freedomResidual deviance: 19483  on 15140  degrees of freedomAIC: 19511 Number of Fisher Scoring iterations: 4

(c) You think that there may be endogeneity between past week alcohol consumption and whether the individual had a job in the past week. Using alcohol consumption in the past year (number of drinks), estimate the model properly. Do this using OLS and using the correct method of estimation.

First, we instrument ALC_70 with ALC_10. Second, we save the predicted values and add them to the logit model. To compare to question (d), this time we will use a probit model.

Code for R

######## q1c ###########

########################

recode(ALC_10,”96:99=’NA'”)

# first stage

install.packages(“MASS”)

library(“MASS”)

frststg<-rlm(ALC7D~ALC_10)

ivALC7D<-predict(frststg)

# iv model

install.packages(“robustbase”)

library(“robustbase”)

summary(probit2)

Results

>summary(probit2) Call:  glmrob(formula = work ~ hasKids + PROV + SEX + DVAGE + ivALC7D,      family = binomial(link = “probit”))   Coefficients:                            Estimate Std. Error z value Pr(>|z|)    (Intercept)                0.4083039  0.0462825   8.822  < 2e-16 ***hasKidshas kids            0.1116060  0.0267517   4.172 3.02e-05 ***PROVPrince Edward Island   0.1615889  0.0491877   3.285  0.00102 ** PROVNova Scotia            0.0750072  0.0490060   1.531  0.12588    PROVNew Brunswick          0.0853637  0.0490650   1.740  0.08189 .  PROVQuebec0.0858084  0.0483377   1.775  0.07587 .  PROVOntario0.1357301  0.0490417   2.768  0.00565 ** PROVManitoba0.2616697  0.0486682   5.377 7.59e-08 ***PROVSaskatchewan0.2539036  0.0482753   5.259 1.44e-07 ***PROVAlberta0.3187946  0.0478673   6.660 2.74e-11 ***PROVBritish Colombia       0.1391027  0.0479259   2.902  0.00370 ** SEXFemale                 -0.0413826  0.0219149  -1.888  0.05898 .  DVAGE                     -0.0162651  0.0005465 -29.760  < 2e-16 ***ivALC7D                    0.2598341  0.0111240  23.358  < 2e-16 ***—Signif.codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Robustness weights w.r * w.x:  16 observations c(589,884,1734,1836,2522,4078,4286,5254,6850,7144,7249,7405,10704,13171,13570,13602)         are outliers with |weight| <= 2e-08 ( < 6.6e-06);  12921 weights are ~= 1. The remaining 2217 ones are summarized asMin. 1st Qu.Median    Mean 3rd Qu.    Max. 0.4046  0.7770  0.8742  0.8501  0.9458  0.9989  Number of observations: 15154 Fitted by method ‘Mqle’  (in 6 iterations) (Dispersion parameter for binomial family taken to be 1) No deviance values available Algorithmic parameters: acctcc0.0001 1.3450 maxit   50 test.acc  “coef”

(d) For OLS, do (c) in steps and using the built-in command in Stata. Do you get the same results?

Code for R[1]

######## q1d ###########

########################

install.packages(“AER”)

library(AER)

summary(iv1)

Results

>summary(iv1) Call:ivreg(formula = work ~ hasKids + PROV + SEX + DVAGE + ALC7D | hasKids + PROV + SEX + DVAGE + ALC_10) Residuals:    Min      1Q  Median      3Q     Max -0.8726 -0.4731  0.2502  0.3873  1.4289  Coefficients:                            Estimate Std. Error t value Pr(>|t|)    (Intercept)                7.999e-01  1.686e-02  47.437  < 2e-16 ***hasKids                    1.833e-02  9.974e-03   1.838 0.066082 .  PROVPrince Edward Island   5.702e-02  1.861e-02   3.064 0.002189 ** PROVNova Scotia            2.577e-02  1.858e-02   1.387 0.165468    PROVNew Brunswick          2.683e-02  1.860e-02   1.442 0.149255    PROVQuebec                 6.993e-02  1.826e-02   3.829 0.000129 ***PROVOntario                6.494e-02  1.853e-02   3.504 0.000460 ***PROVManitoba               1.025e-01  1.828e-02   5.606 2.11e-08 ***PROVSaskatchewan           1.069e-01  1.812e-02   5.900 3.71e-09 ***PROVAlberta                1.367e-01  1.787e-02   7.649 2.15e-14 ***PROVBritish Colombia       8.637e-02  1.812e-02   4.766 1.90e-06 ***SEXFemale                 -6.616e-02  8.276e-03  -7.994 1.40e-15 ***DVAGE                     -5.491e-03  2.033e-04 -27.003  < 2e-16 ***ALC7D                     -8.307e-04  5.925e-05 -14.019  < 2e-16 ***—Signif.codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4964 on 15140 degrees of freedomMultiple R-Squared: -0.006423,     Adjusted R-squared: -0.007287 Wald test: 104.1 on 13 and 15140 DF,  p-value: < 2.2e-16

The results look different : they remain significant, but the effect is reversed!

(e) Is there a difference between the two methods of estimations used in (c)?

In the first stage, OLS regression is used.

In the second stage, probit is used for estimations.

(f) Test whether there is a difference between your results in (b) and in (c). Were you justified to think that endogeneity was present?

The results look different: they remain significant, but the effect is reversed! One may suspect endogeneity was present.

(g) Is the instrument used in (c) a \good instrument”? I expect both a theoretical answer and a quantitative one here…

A good instrument should be correlated to the instrumented independent variable and should be uncorrelated to the outcome.

Code for R

######## q1g ###########

########################

print(summary(probit3))

Results

>print(summary(probit3)) Call:  glmrob(formula = work ~ hasKids + PROV + SEX + DVAGE + ALC_10,      family = binomial(link = “probit”))   Coefficients:                            Estimate Std. Error z value Pr(>|z|)    (Intercept)                1.4085727  0.0533775  26.389  < 2e-16 ***hasKidshas kids            0.1116060  0.0267517   4.172 3.02e-05 ***PROVPrince Edward Island   0.1615889  0.0491877   3.285  0.00102 ** PROVNova Scotia            0.0750072  0.0490060   1.531  0.12588    PROVNew Brunswick          0.0853637  0.0490650   1.740  0.08189 .  PROVQuebec0.0858084  0.0483377   1.775  0.07587 .  PROVOntario0.1357301  0.0490417   2.768  0.00565 ** PROVManitoba0.2616697  0.0486682   5.377 7.59e-08 ***PROVSaskatchewan0.2539036  0.0482753   5.259 1.44e-07 ***PROVAlberta0.3187946  0.0478673   6.660 2.74e-11 ***PROVBritish Colombia       0.1391027  0.0479259   2.902  0.00370 ** SEXFemale                 -0.0413826  0.0219149  -1.888  0.05898 .  DVAGE                     -0.0162651  0.0005465 -29.760  < 2e-16 ***ALC_10                    -0.1200153  0.0051381 -23.358  < 2e-16 ***—Signif.codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Robustness weights w.r * w.x:  16 observations c(589,884,1734,1836,2522,4078,4286,5254,6850,7144,7249,7405,10704,13171,13570,13602)         are outliers with |weight| <= 2e-08 ( < 6.6e-06);  12921 weights are ~= 1. The remaining 2217 ones are summarized asMin. 1st Qu.Median    Mean 3rd Qu.    Max. 0.4046  0.7770  0.8742  0.8501  0.9458  0.9989  Number of observations: 15154 Fitted by method ‘Mqle’  (in 6 iterations) (Dispersion parameter for binomial family taken to be 1) No deviance values available Algorithmic parameters: acctcc0.0001 1.3450 maxit   50 test.acc  “coef

ALC_10 proves to covariate with probability to have a job, therefore it is not a good instrument.

(h) Is heteroskedasticity present in these models?

First, we inspect the plots (residualdvs fitted values, and residuals vs leverage)

Second, we compute the Brausch-Pagan test

######## q1g ###########

########################

par(mfrow=c(2,2)) # init 4 charts in 1 panel

plot(logit1)

Model 1 (the one in (b)

lmtest::bptest(logit1)

car::ncvTest(logit1)  # Breusch-Pagan test

[1]unfortunally, the ivprobit procedure does not reach convergence. ivreg was used instead.

install.packages(“ivprobit”)

library(ivprobit)

summary(iv1)

>lmtest::bptest(logit1)  # Breusch-Pagan test          studentizedBreusch-Pagan test data:  logit1BP = 363.35, df = 13, p-value < 2.2e-16

Heteroscedasticity is already visible from the plot. The test confirms that the null hypothesis of equal variance of residuals should be rejected.

>lmtest::bptest(probit2)  # Breusch-Pagan test         studentizedBreusch-Pagan testdata:  probit2BP = 602.88, df = 13, p-value < 2.2e-16

The same applies to the two-step iv model.

>lmtest::bptest(iv1)  # Breusch-Pagan test          studentizedBreusch-Pagan test data:  iv1BP = 702.27, df = 13, p-value < 2.2e-16

Similar results are obtained for the ivprobit model. We conclude heteroskedasticity is present.

(i) Regardless of your answer in (g), suggest a different instrument. Explain why you think this instrument would be better. If the variable is in the original dataset, and that you can demonstrate that it is a better instrument, you will earn 2 bonus marks here.

I have checked severl potential instrument in the database, but all are related to the DV. One of the problems is of not including the education level in the model, and therefore most of potential instruments actually stand for education.

A much better instrument might be parental alcohol consumption, but this is not included in the dataset.