# Data Manipulation

### Problem Set 1** **

**Basic Data Manipulation**

- In this question, you will do some basic manipulations of the dataset in STATA.
- In many datasets, missing observations are numerically coded (ie, missing=99 or 999). This can affect the regression estimates. Before you get started on this dataset, check to make sure that there are no missing observations that are coded as “99” or “999”. If there are, replace these missing observations so that they are equal to “.”.

**Done, there were none.**

- Calculate the means, variances and other descriptive statistics for the hhexp1 and hheducvariables.

Annual HH expenditure per capita in US dollars

————————————————————-

Percentiles Smallest

1% 17.43458 10.43326

5% 41.17483 12.46576

10% 58.84205 15.54125 Obs 312

25% 95.83035 17.43458 Sum of Wgt. 312

50% 142.9946 Mean 181.1164

Largest Std. Dev. 152.0377

75% 225.2836 918.7469

90% 329.0057 1017.717 Variance 23115.47

95% 406.2822 1112.316 Skewness 3.069079

99% 918.7469 1117.012 Kurtosis 16.42536

Annual HH expenditure per capita in US dollars

————————————————————-

Percentiles Smallest

1% 17.43458 10.43326

5% 41.17483 12.46576

10% 58.84205 15.54125 Obs 312

25% 95.83035 17.43458 Sum of Wgt. 312

50% 142.9946 Mean 181.1164

Largest Std. Dev. 152.0377

75% 225.2836 918.7469

90% 329.0057 1017.717 Variance 23115.47

95% 406.2822 1112.316 Skewness 3.069079

99% 918.7469 1117.012 Kurtosis 16.42536

- Produce a scatter plot of the relationship between per capita household expenditures and household education.

Basics of Ordinary Least Squares Estimation

- Suppose that (for policy analysis reasons) we are interested in learning about the true effect of individuals’ years of schooling on their expenditures (at least on average), assuming that we think that expenditures are a good proxy for income or wages. Suppose initially that we believe that the true relationship between schooling and expenditures takes the form:

hhexp = a + b hheduc + u

where I have not bothered to write down subscripts (indicating that this relationship holds true for every individual in some population we are studying). Run an Ordinary Least Squares regression of hourly wages on education by issuing the STATA command:

reg hhexp1 hheduc

- Write down the formulas (see note below) for the intercept and slope coefficient estimates that STATA has just calculated for you, replacing symbols like Y, X and n by the numbers or variables they stand for in this specific application.

**hhexp=161.8762+7.114412*hheduc**

NOTE: The formula for a coefficient estimate is not the equation for the line in which the coefficient you are trying to estimate belongs. The formula is an equation with the coefficient estimate you want to calculate on the left hand side, and a mathematical expression describing how you take data and calculate the estimate on the right hand side.

- These estimators are derived by applying the OLS Principle to the two-variable linear regression model. What is the OLS Principle? What is the intuition behind this principle?

**OLS principle is to choose parameter estimates that minimize the squared distance between the data and the model. Intuition: to find the OLS estimators that minimize the sum of squared residuals.**

Interpretation of the Slope Coefficient Estimate

- State in plain but precise language (involving numbers and units of measurement) what the estimate of the slope coefficient tells us about what happens on average to household expenditures when the adults in the household get an additional year of education. (Note that you need the precise definitions and units of measurement of each variable).

** Household expenditure increases by 7.114412 USD in case one extra year education. **

- Is the estimated effect of education on household expenditure “big” or “small”? Explain. (Note: The best answers to this question will make use of some of the descriptive statistics produced in Question 2 above. In other words, what percentage of mean household expenditures does this change represent? Is a one-unit change in the independent variable a big or small change?)

** One year has a small effect taking average and SD into account.**

- How would you interpret the OLS intercept coefficient in this regression? Does this make sense? Why or why not?

**It is 161.8762, meaning the expenditure in USD in case if no education (0 years).**

Calculating and Examining Properties of Predicted Values and Residuals

- In this question you will calculate the OLS predicted values two ways: first by hand and then using a built-in STATA command. You will give different names to the variables you create by each method, and then check to see that they are basically identical.

- (“By hand”) Calculate a variable (called hhexphat) that contains the OLS predicted values for each observation, by using the command

genhhexphat= **161.8762+7.114412*hheduc**

where you fill in the right expression. [You will need to type in the values of the OLS estimates where relevant. Please do not round them relative to what appears in the STATA printout.]

- (“Using STATA command”) Calculate a variable (called hhexphat2) that contains the OLS

predicted values for each observation, by using the command

predict hhexphat2

- Now check that the two variables (hhexphat and hhexphat2) look the same by looking at summary statistics using the command

summhhexphat hhexphat2

The summarize command lets you make sure that the means, variances, minima and maxima of the two series are (pretty much) the same. (They might differ a tiny bit as a result of rounding error.)

- Now compare the means of the actual dependent variable and of the predicted values using the summary command. What do you notice?

**They are the same.**

- Graph the scatter plot of the data (hhexp1 and hheduc) with the OLS estimated regression line running through the middle of it, using the following command below.

twoway (scatter hhexp1 hheduc) (line hhexphathheduc)

[Note: In this command, you are superimposing two two-way graphs. The first one is a “scatterplot” of household expenditures and education. This simply puts a “point” in the diagram for each combination of household expenditures and education in the dataset. The second two-way graph is a “line” plot of hhexp1hat and education. Instead of simply adding a point to the plot for each combination of hhexp1hat and education in the dataset, it connects the points with straight lines.

Because all of the hhexp1hat-education combinations lie on a straight line, this simply adds a straight line – the regression line – to the diagram.]

- In this question you will calculate the OLS residuals in two ways: first “by hand” and then using a built-in STATA command. You will then verify that the two calculations are the same and that the mean of the OLS residuals is zero (apart from small possible rounding error problems).

- Create a variable (call it uhat) containing the values of the OLS residuals, using a STATA command of the form

genuhat=hhexp1-hhexphat

- Create a variable (call it uhat2) containing the values of the OLS residuals, using the command

predict uhat2, resid

[Note: Pieces of commands that come after a comma are “options.” In this case, the option “resid” on the predict command tells it to calculate the residuals rather than the predicted values.]

- Now verify that uhat and uhat2 are pretty much the same. What do you notice about their sample means? Does this make sense?

**Yes, were small difference from 0, the rounding is responsible.**

[Note: When numbers produced by STATA commands are very large or very small, STATA expresses them using “Scientific Notation,” in which a number is followed by the letter “e”, then a plus or minus sign, and then another number. For example, 2.33e-08. The way to interpret this number is to take the first number (2.33) and multiply it by 10 raised to the power indicated after the e (-08).

That is, 2.33e-08 means:

2.33e-08 = 2.33 x 10-08

= 2.33 x (1/100000000) = .0000000233

When the e is followed by a negative number (as in this example), the number is very small. When e is followed by a positive number (e.g. +09), the number is very large.]

Assessing Goodness of Fit

- According to your estimates, what fraction of variation in household expenditures is explained by variation in years of education? Explain briefly.

**It is explained 1.11% only, as R-squared.**

Understanding the Role of Units of Measurement

- Create a new variable called ed1000 that is equal to hheduc/1000 (that is, household education divided by 1000). Run a regression of per capita household expenditures on this new variable. Compare these new regression results to the original regression results. For each of the below items, write down a mathematical expression stating the relationship between the original and new estimate. (For example, if the two intercepts are identical in the two models, you could write down: old intercept = new intercept.]
- estimates of the intercept 161.8762=161.8762
- estimates of the slope 7.114412<7114.411
- R-squareds for the equation 0.0111=0.0111

- Create a new variable called Y that is equal to the hhexp1/1000 (that is, household expenditures divided by 1000). Run a regression of Y on the original education variable. Compare these regression results to the original regression results. For each of the below items, write down mathematical expressions stating the relationship between the original and new
- estimates of the intercept 161.8762>0.1618762
- estimates of the slope 7.114412>0.007114
- R-squareds for the equation 0.0111=0.0111

# Problem Set 2

**Unbiasedness and Efficiency of the OLS Estimates*** *

Recall the model from problem set 1 that described a relationship between household expenditure and education, where the dependent variable can be seen as a measure of income (even though not precisely correct).

*hh*exp=*b*0 +*b*1*hheduc*+*u** *

- Using language that is as intuitive as possible, state what must be true if we are to believe that the OLSestimatorforthismodelisindeedanunbiasedestimatoroftheeffectofschoolingonhousehold expenditures? (
*State only the assumptions required for unbiasedness. Please state the assumptions in terms of household expenditures and education, not the more abstract Y andX.)*

- Now consider a possible violation of the assumptions you just stated. Suppose that households with higher ability tend to have higher expenditures. Because household ability is not observed and not included in the regression explicitly, the effects of motivation are captured by the error term in the population model. Suppose further that households with higher ability also have more education. Which of the assumptions you mentioned in your answer to question 2 would be violated?Explain.

- UnderthescenariodescribedinQuestion2,wouldyouexpectOLStohaveatendency(bias)toover- or under-estimate the true effect of education on expenditures?Explain.

- What additional assumption(s) must be valid if we are to believe that OLS is efficient (as well as unbiased)forthiscase?(Again,pleasestatetheassumptionsintermsthatarespecifictothisexample; not just in terms of abstract Y and X,etc.)

- The STATA dataset colombia.dta contains data on men and women who are privatesector wage employees from the Colombian
*EncuestaNacional de Hogares*of 1994, which is an urban household survey. It contains the followingvariables*wage*=average hourly earnings in 1994 ColombianPesos*school*= years of schooling completed by theindividual*age*= age inyears*male*= indicator variable equal to 1 if the individual is male and 0 iffemale*exper*= experience in years

Estimate the following model and report your results:

*wage*=*b*0 +*b*1*school *+*b*2 exp*er*+*u** *

- State in plain but precise language (involving numbers and any relevant units of measurement) what the estimate of the coefficient on “experience” in this regressiontells us about the relationship between experience and wages.

- Do you consider the size of the estimated effect of experience economically important or not? Explain youranswer.

- Calculate the 90 percent confidence interval for the coefficient on “experience” in this regression. (We could also call this the
*α*=0.10 interval estimate for thiscoefficient.)

- State as carefully as possible (with reference to repeated sampling) what it means to say that “this interval we just estimated is the 90 percent confidence interval for this parameter”.

- Suppose we want to calculate the 95 percent confidence interval for the coefficient on experience. Would the 95 percent confidence interval be wider or narrower than the 90 percent confidence interval you just estimated?Explain.

- If the assumptions required for unbiasedness and efficiency are true, what is an unbiased estimate of the variance of the OLS estimator of the coefficient on experience in this regression? (Note: I am looking for an actual number for the variance, not just aformula.) Explain your answer (or show the work that you did to calculate the unbiased estimate of thevariance).

- Making reference to repeated sampling, explain as precisely as possible (in words, not just mathematical or statistical symbols) what the variance you just identifiedrepresents.

- Now estimate the same model as in Question 1, but add in the variable “male”. State in plain but precise language what the estimate of the coefficient on “male” in thisregression tells us about the relationship between male andwages.

Quiz questions, adapted, on next page

- Using the STATA output below, calculate the t-statistic for the null hypothesis that the parameter on
*agesq*

- Using the STATA output above, what can you say about the sign of the coefficienton

*baths*?

- Briefly discuss the following statement about the STATA outputbelow:

** **Given the low R-squared value of 0.0078 we can conclude that the effects of income and education on cigarette consumption are neither statistically nor economically (practically) significant.** **

**Solution**

# Problem Set 2* *

*Unbiasedness and Efficiency of the OLS Estimates** *

Recall the model from problem set 1 that described a relationship between household expenditure and education, where the dependent variable can be seen as a measure of income (even though not precisely correct).

*hh*exp=*b*0 +*b*1*hheduc*+*u*** **

**hhexp=161.8762+7.114412*hheduc*** *

- Using language that is as intuitive as possible, state what must be true if we are to believe that the OLSestimatorforthismodelisindeedanunbiasedestimatoroftheeffectofschoolingonhousehold expenditures? (
*State only the assumptions required for unbiasedness. Please state the assumptions in terms of household expenditures and education, not the more abstract Y andX.)*

**No other confounders (or adjustments for them)**

- Now consider a possible violation of the assumptions you just stated. Suppose that households with higher ability tend to have higher expenditures. Because household ability is not observed and not included in the regression explicitly, the effects of motivation are captured by the error term in the population model. Suppose further that households with higher ability also have more education. Which of the assumptions you mentioned in your answer to question 2 would be violated?Explain.

**confounding factor without adjustment (partially responsible for the effect)**

- Underthescenario described inQuestion2,wouldyouexpectOLStohaveatendency(bias)toover- or under-estimate the true effect of education on expenditures?Explain.

**it would be an underestimationas without the adjustment the effect are from both factors but can be shown only in one (hh education)**

- What additional assumption(s) must be valid if we are to believe that OLS is efficient (as well as unbiased)forthiscase?not just in terms of abstract Y and X,etc.)

**Homoskedasticity(variance of hh educations is nearly the same as variance of hh expenditure)**

**No autocorrelation (between hh expenditure and hh education**

**No measurement error**

- The STATA dataset colombia.dta contains data on men and women who are privatesector wage employees from the Colombian
*EncuestaNacional de Hogares*of 1994, which is an urban household survey. It contains the followingvariables*wage*=average hourly earnings in 1994 ColombianPesos*school*= years of schooling completed by theindividual*age*= age inyears*male*= indicator variable equal to 1 if the individual is male and 0 iffemale*exper*= experience inyears

Estimate the following model and report your results:

*wage*=*b*0 +*b*1*school *+*b*2 exp*er*+*u*

* Source | SS df MS Number of obs = 28,329*

*————-+———————————- F(2, 28326) = 979.72*

* Model | 2.4863e+13 2 1.2432e+13 Prob> F = 0.0000*

* Residual | 3.5943e+14 28,326 1.2689e+10 R-squared = 0.0647*

*————-+———————————- Adj R-squared = 0.0646*

* Total | 3.8429e+14 28,328 1.3566e+10 Root MSE = 1.1e+05** *

*——————————————————————————*

*wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]*

*————-+—————————————————————-*

*school | 8092.455 183.208 44.17 0.000 7733.359 8451.552*

*exper | 1494.628 64.92604 23.02 0.000 1367.369 1621.886*

* _cons | -48120.95 2482.583 -19.38 0.000 -52986.93 -43254.97*

*——————————————————————————*

*wage=-48120.95+8092.455*school+1494.628+exper*

- State in plain but precise language (involving numbers and any relevant units of measurement) what the estimate of the coefficient on “experience” in this regressiontells us about the relationship between experience andwages.

**There is a significant positive association between wages and experience. One extra experience year increases wage by 1494.628 pesos.**

- Do you consider the size of the estimated effect of experience economically important or not? Explain youranswer.

**It has a weak economically importance, close to moderate.(Cohen (1988) suggested R2 values for endogenous latent variables are assessed as follows: 0.26 (substantial), 0.13 (moderate), 0.02 (weak).)**** **

- Calculate the 90 percent confidence interval for the coefficient on “experience” in this regression. (We could also call this the
*α*=0.10 interval estimate for thiscoefficient.)

**regress wage school exper, level(90)**

** Source | SS df MS Number of obs = 28,329**

**————-+———————————- F(2, 28326) = 979.72**

** Model | 2.4863e+13 2 1.2432e+13 Prob> F = 0.0000**

** Residual | 3.5943e+14 28,326 1.2689e+10 R-squared = 0.0647**

**————-+———————————- Adj R-squared = 0.0646**

** Total | 3.8429e+14 28,328 1.3566e+10 Root MSE = 1.1e+05**** **

**——————————————————————————**

**wage | Coef. Std. Err. t P>|t| [90% Conf. Interval]**

**————-+—————————————————————-**

**school | 8092.455 183.208 44.17 0.000 7791.095 8393.816**

**exper | 1494.628 64.92604 23.02 0.000 1387.83 1601.425**

** _cons | -48120.95 2482.583 -19.38 0.000 -52204.57 -44037.33**

**lower: 1387.83 upper:1601.425 **

- State as carefully as possible (with reference to repeated sampling) what it means to say that “this interval we just estimated is the 90 percent confidence interval for this parameter”.

**Out of 100 estimations, 90 will fall within this range, or by other words 90 is the probability, that the truth is in this range.**

- Suppose we want to calculate the 95 percent confidence interval for the coefficient on experience. Would the 95 percent confidence interval be wider or narrower than the 90 percent confidence interval you just estimated?Explain.

**The 95% is wider, as critical number is bigger (which we multiply SE by), also wider range for repeated measurement to be inside (point estimation).**

- If the assumptions required for unbiasedness and efficiency are true, what is an unbiased estimate of the variance of the OLS estimator of the coefficient on experience in this regression? (Note: I am looking for an actual number for the variance, not just aformula.) Explain your answer (or show the work that you did to calculate the unbiased estimate of thevariance).

**Its SE*SE which is 4215.3907 (sqrt of var is SE which is SD/sqrt(n))**

- Making reference to repeated sampling, explain as precisely as possible (in words, not just mathematical or statistical symbols) what the variance you just identifiedrepresents.

**smaller variance (and SD) results more precise estimation, less vary, tend to be more significant**

- Now estimate the same model as in Question 1, but add in the variable “male”. State in plain but precise language what the estimate of the coefficient on “male” in thisregression tells us about the relationship between male andwages.

**The model gets stronger as R-squared is bigger. Male gender is a significant factor, and being male will increase wages by 18702.84 pesos.**

Quiz questions, adapted, on next page

11. Using the STATA output below, calculate the t-statistic for the null hypothesis that the parameter on *agesq*

**t=coeff/SE=0.0000284/0.0000117=2.43**

- Using the STATA output above, what can you say about the sign of the coefficienton

*baths*?

**coeff=t*SE=0.0340505*6.95=0.23665097**

- Briefly discuss the following statement about the STATA outputbelow:

**
Statistical significance is present (as p<0.05) in both cases, while practically significance is low, as outcome is predicted by the predictor in less than 1 %.**Given the low R-squared value of 0.0078 we can conclude that the effects of income and education on cigarette consumption are neither statistically nor economically (practically) significant.