# Data Manipulation

### Problem Set 1

Basic Data Manipulation

1. In this question, you will do some basic manipulations of the dataset in STATA.
2. In many datasets, missing  observations are numerically coded (ie, missing=99 or 999). This can affect the regression estimates. Before you get started on this dataset, check to make sure that there are no missing observations that are coded as “99” or “999”. If there are, replace these missing observations so that they are equal to “.”.

Done, there were none.

1. Calculate the means, variances and other descriptive statistics for the hhexp1 and hheducvariables.

Annual HH expenditure per capita in US dollars

————————————————————-

Percentiles      Smallest

1%     17.43458       10.43326

5%     41.17483       12.46576

10%     58.84205       15.54125       Obs                 312

25%     95.83035       17.43458       Sum of Wgt.         312

50%     142.9946                      Mean           181.1164

Largest       Std. Dev.      152.0377

75%     225.2836       918.7469

90%     329.0057       1017.717       Variance       23115.47

95%     406.2822       1112.316       Skewness       3.069079

99%     918.7469       1117.012       Kurtosis       16.42536

Annual HH expenditure per capita in US dollars

————————————————————-

Percentiles      Smallest

1%     17.43458       10.43326

5%     41.17483       12.46576

10%     58.84205       15.54125       Obs                 312

25%     95.83035       17.43458       Sum of Wgt.         312

50%     142.9946                      Mean           181.1164

Largest       Std. Dev.      152.0377

75%     225.2836       918.7469

90%     329.0057       1017.717       Variance       23115.47

95%     406.2822       1112.316       Skewness       3.069079

99%     918.7469       1117.012       Kurtosis       16.42536

1. Produce a scatter  plot  of  the  relationship  between  per  capita  household  expenditures and household education.

Basics of Ordinary Least Squares Estimation

1. Suppose that  (for  policy  analysis  reasons)  we  are  interested  in  learning  about  the  true  effect  of individuals’ years of schooling on their expenditures (at least on average), assuming that we think that expenditures are a good proxy for income or wages. Suppose initially that we believe that the true relationship between schooling and expenditures takes the form:

hhexp = a + b hheduc + u

where I have not bothered to write down subscripts (indicating that this relationship holds true for every individual in some population we are studying). Run an Ordinary Least Squares regression of hourly wages on education by issuing the STATA command:

reg hhexp1 hheduc

1. Write down  the formulas  (see  note  below) for  the  intercept  and  slope  coefficient  estimates  that STATA has just calculated for you, replacing symbols like Y, X and n by the numbers or variables they stand for in this specific application.

hhexp=161.8762+7.114412*hheduc

NOTE: The formula for a coefficient estimate is not the equation for the line in which the coefficient you are trying to estimate belongs.  The formula is an equation with the coefficient estimate you want to calculate on the left hand side, and a mathematical expression describing how you take data and calculate the estimate on the right hand side.

1. These estimators  are  derived  by  applying  the  OLS  Principle  to  the  two-variable  linear  regression model.  What is the OLS Principle?  What is the intuition behind this principle?

OLS principle is to choose parameter estimates that minimize the squared distance between the data and the model. Intuition: to find the OLS estimators that minimize the sum of squared residuals.

Interpretation of the Slope Coefficient Estimate

1. State in plain but precise language (involving numbers and units of measurement) what the estimate of the slope coefficient tells us about what happens on average to household expenditures when the adults in the household get an additional year of education. (Note that you need the precise definitions and units of measurement of each variable).

Household expenditure increases by 7.114412 USD in case one extra year education.

1. Is the estimated effect of education on household expenditure “big” or “small”? Explain. (Note: The best answers to this question will make use of some of the descriptive statistics produced in Question 2 above. In  other  words,  what  percentage  of  mean  household  expenditures  does  this  change represent? Is a one-unit change in the independent variable a big or small change?)

One year has a small effect taking average and SD into account.

1. How would  you  interpret  the  OLS  intercept  coefficient  in  this  regression?  Does  this  make  sense? Why or why not?

It is 161.8762, meaning the expenditure in USD in case if no education (0 years).

Calculating and Examining Properties of Predicted Values and Residuals

1. In this question you will calculate the OLS predicted values two ways: first by hand and then using a built-in  STATA  command.  You  will  give  different  names  to  the  variables  you  create  by  each method, and then check to see that they are basically identical.
1. (“By hand”) Calculate a variable (called hhexphat) that contains the OLS predicted values for each observation, by using the command

genhhexphat= 161.8762+7.114412*hheduc

where you fill in the right expression. [You will need to type in the values of the OLS estimates where relevant.  Please do not round them relative to what appears in the STATA printout.]

1. (“Using STATA command”) Calculate a variable (called hhexphat2) that contains the OLS

predicted values for each observation, by using the command

predict hhexphat2

1. Now check that the two variables (hhexphat and hhexphat2) look the same by looking at summary statistics using the command

summhhexphat hhexphat2

The summarize command lets you make sure that the means, variances, minima and maxima of the two series are (pretty much) the same.  (They might differ a tiny bit as a result of rounding error.)

1. Now compare the means of the actual dependent variable and of the predicted values using the summary command. What do you notice?

They are the same.

1. Graph the scatter plot of the data (hhexp1 and hheduc) with the OLS estimated regression line running through the middle of it, using the following command below.

twoway (scatter hhexp1 hheduc) (line hhexphathheduc)

[Note: In this command, you are superimposing two two-way graphs. The first one is a “scatterplot” of household expenditures and education. This simply puts a “point” in the diagram for each combination of household expenditures and education in the dataset. The second two-way graph is  a “line” plot of hhexp1hat and education. Instead of simply adding a  point  to  the  plot  for  each combination  of  hhexp1hat  and  education  in  the  dataset,  it  connects  the  points  with  straight  lines.

Because all of the hhexp1hat-education combinations lie on a straight line, this simply adds a straight line – the regression line – to the diagram.]

1. In this question you will calculate the OLS residuals in two ways: first “by hand” and then using a built-in STATA command. You will then verify that the two calculations are the same and that the mean of the OLS residuals is zero (apart from small possible rounding error problems).
1. Create a variable (call it uhat) containing the values of the OLS residuals, using a STATA command of the form

genuhat=hhexp1-hhexphat

1. Create a variable (call it uhat2) containing the values of the OLS residuals, using the command

predict uhat2, resid

[Note: Pieces of commands that come after a comma are “options.” In this case, the option “resid” on the predict command tells it to calculate the residuals rather than the predicted values.]

1. Now verify that uhat and uhat2 are pretty much the same. What do you notice about their sample means? Does this make sense?

Yes, were small difference from 0, the rounding is responsible.

[Note: When numbers produced by STATA commands are very large or very small, STATA expresses them using “Scientific Notation,” in which a number is followed by the letter “e”, then a plus or minus sign, and then another number. For example, 2.33e-08. The way to interpret this number is to take the first number (2.33) and multiply it by 10 raised to the power indicated after the e (-08).

That is, 2.33e-08 means:

2.33e-08 = 2.33 x 10-08

= 2.33 x (1/100000000) = .0000000233

When the e is followed by a negative number (as in this example), the number is very small. When e is followed by a positive number (e.g. +09), the number is very large.]

Assessing Goodness of Fit

1. According to  your  estimates,  what  fraction  of  variation  in  household  expenditures  is  explained  by variation in years of education?  Explain briefly.

It is explained 1.11% only, as R-squared.

Understanding the Role of Units of Measurement

1. Create a new variable called ed1000 that is equal to hheduc/1000 (that is, household education divided by 1000).  Run  a  regression  of  per  capita  household  expenditures  on  this  new  variable.  Compare these  new  regression  results  to the  original  regression  results.  For  each of  the  below  items,  write down a mathematical expression stating the relationship between the original and new estimate. (For example, if the two intercepts are identical in the two models, you could write down: old intercept = new intercept.]
2. estimates of the intercept 161.8762=161.8762
3. estimates of the slope 7.114412<7114.411
4. R-squareds for the equation 0.0111=0.0111
1. Create a  new  variable  called  Y  that  is  equal  to  the  hhexp1/1000  (that  is,  household  expenditures divided by 1000). Run a regression of Y on the original education variable. Compare these regression results  to  the  original  regression  results.  For  each  of  the  below  items,  write  down  mathematical expressions stating the relationship between the original and new
2. estimates of the intercept 161.8762>0.1618762
3. estimates of the slope 7.114412>0.007114
4. R-squareds for the equation 0.0111=0.0111

# Problem Set 2

Unbiasedness and Efficiency of the OLS Estimates

Recall the model from problem set 1 that described a relationship between household expenditure and education, where the dependent variable can be seen as a measure of income (even though not precisely correct).

hhexp=b0 +b1hheduc+u

1. Using language that is as intuitive as possible, state what must be true if we are to believe that the OLSestimatorforthismodelisindeedanunbiasedestimatoroftheeffectofschoolingonhousehold expenditures? (State only the assumptions required for unbiasedness. Please state the assumptions in terms of household expenditures and education, not the more abstract Y andX.)
1. Now consider a possible violation of the assumptions you just stated. Suppose that households with higher ability tend to have higher expenditures. Because household ability is not observed and not included in the regression explicitly, the effects of motivation are captured by the error term in the population model. Suppose further that households with higher ability also have more education. Which of the assumptions you mentioned in your answer to question 2 would be violated?Explain.
1. UnderthescenariodescribedinQuestion2,wouldyouexpectOLStohaveatendency(bias)toover- or under-estimate the true effect of education on expenditures?Explain.
1. What additional assumption(s) must be valid if we are to believe that OLS is efficient (as well as unbiased)forthiscase?(Again,pleasestatetheassumptionsintermsthatarespecifictothisexample; not just in terms of abstract Y and X,etc.)
1. The STATA dataset colombia.dta contains data on men and women who are privatesector wage employees from the Colombian EncuestaNacional de Hogaresof 1994, which is an urban household survey. It contains the followingvariables
• wage=average hourly earnings in 1994 ColombianPesos
• school = years of schooling completed by theindividual
• age = age inyears
• male = indicator variable equal to 1 if the individual is male and 0 iffemale
• exper= experience in years

Estimate the following model and report your results:

wage=b0 +b1school +b2 exper+u

1. State in plain but precise language (involving numbers and any relevant units of measurement) what the estimate of the coefficient on “experience” in this regressiontells us about the relationship between experience and wages.
1. Do you consider the size of the estimated effect of experience economically important or not? Explain youranswer.
1. Calculate the 90 percent confidence interval for the coefficient on “experience” in this regression. (We could also call this the α=0.10 interval estimate for thiscoefficient.)
1. State as carefully as possible (with reference to repeated sampling) what it means to say that “this interval we just estimated is the 90 percent confidence interval for this parameter”.
1. Suppose we want to calculate the 95 percent confidence interval for the coefficient on experience. Would the 95 percent confidence interval be wider or narrower than the 90 percent confidence interval you just estimated?Explain.
1. If the assumptions required for unbiasedness and efficiency are true, what is an unbiased estimate of the variance of the OLS estimator of the coefficient on experience in this regression? (Note: I am looking for an actual number for the variance, not just aformula.) Explain your answer (or show the work that you did to calculate the unbiased estimate of thevariance).
1. Making reference to repeated sampling, explain as precisely as possible (in words, not just mathematical or statistical symbols) what the variance you just identifiedrepresents.
1. Now estimate the same model as in Question 1, but add in the variable “male”. State in plain but precise language what the estimate of the coefficient on “male” in thisregression tells us about the relationship between male andwages.

Quiz questions, adapted, on next page

1. Using the STATA output below, calculate the t-statistic for the null hypothesis that the parameter on agesq

1. Using the STATA output above, what can you say about the sign of the coefficienton

baths?

1. Briefly discuss the following statement about the STATA outputbelow:

Given the low R-squared value of 0.0078 we can conclude that the effects of income and education on cigarette consumption are neither statistically nor economically (practically) significant.

Solution

# Problem Set 2

Unbiasedness and Efficiency of the OLS Estimates

Recall the model from problem set 1 that described a relationship between household expenditure and education, where the dependent variable can be seen as a measure of income (even though not precisely correct).

hhexp=b0 +b1hheduc+u

hhexp=161.8762+7.114412*hheduc

1. Using language that is as intuitive as possible, state what must be true if we are to believe that the OLSestimatorforthismodelisindeedanunbiasedestimatoroftheeffectofschoolingonhousehold expenditures? (State only the assumptions required for unbiasedness. Please state the assumptions in terms of household expenditures and education, not the more abstract Y andX.)

No other confounders (or adjustments for them)

1. Now consider a possible violation of the assumptions you just stated. Suppose that households with higher ability tend to have higher expenditures. Because household ability is not observed and not included in the regression explicitly, the effects of motivation are captured by the error term in the population model. Suppose further that households with higher ability also have more education. Which of the assumptions you mentioned in your answer to question 2 would be violated?Explain.

confounding factor without adjustment (partially responsible for the effect)

1. Underthescenario described inQuestion2,wouldyouexpectOLStohaveatendency(bias)toover- or under-estimate the true effect of education on expenditures?Explain.

it would be an underestimationas without the adjustment the effect are from both factors but can be shown only in one (hh education)

1. What additional assumption(s) must be valid if we are to believe that OLS is efficient (as well as unbiased)forthiscase?not just in terms of abstract Y and X,etc.)

Homoskedasticity(variance of hh educations is nearly the same as variance of hh expenditure)

No autocorrelation (between hh expenditure and hh education

No measurement error

1. The STATA dataset colombia.dta contains data on men and women who are privatesector wage employees from the Colombian EncuestaNacional de Hogaresof 1994, which is an urban household survey. It contains the followingvariables
1. wage=average hourly earnings in 1994 ColombianPesos
2. school = years of schooling completed by theindividual
3. age = age inyears
4. male = indicator variable equal to 1 if the individual is male and 0 iffemale
5. exper= experience inyears

Estimate the following model and report your results:

wage=b0 +b1school +b2 exper+u

Source |       SS           df       MS      Number of obs   =    28,329

————-+———————————-   F(2, 28326)     =    979.72

Model |  2.4863e+13         2  1.2432e+13   Prob> F        =    0.0000

Residual |  3.5943e+14    28,326  1.2689e+10   R-squared       =    0.0647

————-+———————————-   Adj R-squared   =    0.0646

Total |  3.8429e+14    28,328  1.3566e+10   Root MSE        =    1.1e+05

——————————————————————————

wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————-

school |   8092.455    183.208    44.17   0.000     7733.359    8451.552

exper |   1494.628   64.92604    23.02   0.000     1367.369    1621.886

_cons |  -48120.95   2482.583   -19.38   0.000    -52986.93   -43254.97

——————————————————————————

wage=-48120.95+8092.455*school+1494.628+exper

1. State in plain but precise language (involving numbers and any relevant units of measurement) what the estimate of the coefficient on “experience” in this regressiontells us about the relationship between experience andwages.

There is a significant positive association between wages and experience. One extra experience year increases wage by 1494.628 pesos.

1. Do you consider the size of the estimated effect of experience economically important or not? Explain youranswer.

It has a weak economically importance, close to moderate.(Cohen (1988) suggested R2 values for endogenous latent variables are assessed as follows: 0.26 (substantial), 0.13 (moderate), 0.02 (weak).)

1. Calculate the 90 percent confidence interval for the coefficient on “experience” in this regression. (We could also call this the α=0.10 interval estimate for thiscoefficient.)

regress wage school exper, level(90)

Source |       SS           df       MS      Number of obs   =    28,329

————-+———————————-   F(2, 28326)     =    979.72

Model |  2.4863e+13         2  1.2432e+13   Prob> F        =    0.0000

Residual |  3.5943e+14    28,326  1.2689e+10   R-squared       =    0.0647

————-+———————————-   Adj R-squared   =    0.0646

Total |  3.8429e+14    28,328  1.3566e+10   Root MSE        =    1.1e+05

——————————————————————————

wage |      Coef.   Std. Err.      t    P>|t|     [90% Conf. Interval]

————-+—————————————————————-

school |   8092.455    183.208    44.17   0.000     7791.095    8393.816

exper |   1494.628   64.92604    23.02   0.000      1387.83    1601.425

_cons |  -48120.95   2482.583   -19.38   0.000    -52204.57   -44037.33

lower: 1387.83    upper:1601.425

1. State as carefully as possible (with reference to repeated sampling) what it means to say that “this interval we just estimated is the 90 percent confidence interval for this parameter”.

Out of 100 estimations, 90 will fall within this range, or by other words 90 is the probability, that the truth is in this range.

1. Suppose we want to calculate the 95 percent confidence interval for the coefficient on experience. Would the 95 percent confidence interval be wider or narrower than the 90 percent confidence interval you just estimated?Explain.

The 95% is wider, as critical number is bigger (which we multiply SE by), also wider range for repeated measurement to be inside (point estimation).

1. If the assumptions required for unbiasedness and efficiency are true, what is an unbiased estimate of the variance of the OLS estimator of the coefficient on experience in this regression? (Note: I am looking for an actual number for the variance, not just aformula.) Explain your answer (or show the work that you did to calculate the unbiased estimate of thevariance).

Its SE*SE which is 4215.3907 (sqrt of var is SE which is SD/sqrt(n))

1. Making reference to repeated sampling, explain as precisely as possible (in words, not just mathematical or statistical symbols) what the variance you just identifiedrepresents.

smaller variance (and SD) results more precise estimation, less vary, tend to be more significant

1. Now estimate the same model as in Question 1, but add in the variable “male”. State in plain but precise language what the estimate of the coefficient on “male” in thisregression tells us about the relationship between male andwages.

The model gets stronger as R-squared is bigger. Male gender is a significant factor, and being male will increase wages by 18702.84 pesos.

Quiz questions, adapted, on next page

11. Using the STATA output below, calculate the t-statistic for the null hypothesis that the parameter on agesq

t=coeff/SE=0.0000284/0.0000117=2.43

1. Using the STATA output above, what can you say about the sign of the coefficienton

baths?

coeff=t*SE=0.0340505*6.95=0.23665097

1. Briefly discuss the following statement about the STATA outputbelow:

Statistical significance is present (as p<0.05) in both cases, while practically significance is low, as outcome is predicted by the predictor in less than 1 %.
Given the low R-squared value of 0.0078 we can conclude that the effects of income and education on cigarette consumption are neither statistically nor economically (practically) significant.