Chapter 02 - The Simple Regression Model#
import stata_setup
stata_setup.config("C:/Program Files/Stata18/", "se", splash=False)
Problem 2.1. Papke1995 (401k)#
a. Average prate & mrate
%%stata
use 401K.dta, clear
d, short
mean pra mra
. use 401K.dta, clear
. d, short
Contains data from 401K.dta
Observations: 1,534
Variables: 8 9 Jun 1998 08:20
Sorted by:
. mean pra mra
Mean estimation Number of obs = 1,534
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
prate |
87.36291 .4268091 86.52572 88.2001
mrate | .7315124 .0199033 .6924718 .770553
--------------------------------------------------------------
.
b & c. Run-regres prate on mrate, interprate intercept & coef.
%%stata
reg prate mrate
Source | SS df MS Number of obs = 1,534
-------------+---------------------------------- F(1, 1532) = 123.68
Model | 32001.7271 1 32001.7271 Prob > F = 0.0000
Residual | 396383.812 1,532 258.73617 R-squared = 0.0747
-------------+---------------------------------- Adj R-squared = 0.0741
Total | 428385.539 1,533 279.442622 Root MSE = 16.085
------------------------------------------------------------------------------
prate | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mrate | 5.861079 .5270107 11.12 0.000 4.82734 6.894818
_cons | 83.07546 .5632844 147.48 0.000 81.97057 84.18035
------------------------------------------------------------------------------
d. predict at mrate=3.5
%%stata
display _b[_cons] + _b[mrate]*3.5
103.58923
e. How much of the variation in prate is explained by mrate? Is it a lot?
%%stata
display " R-squared = " e(r2)
R-squared = .0747031
Problem 2.2.#
a. Average salary & average tenure
%%stata
use ceosal2.dta , clear
d, short
mean lsalary ceoten comten
. use ceosal2.dta , clear
. d, short
Contains data from ceosal2.dta
Observations: 177
Variables: 15 17 Aug 1999 23:14
Sorted by:
. mean lsalary ceoten comten
Mean estimation Number of obs = 177
--------------------------------------------------------------
| Mean Std. err. [95% conf. interval]
-------------+------------------------------------------------
lsalary | 6.582848 .0455542 6.492945 6.67275
ceoten | 7.954802 .537489 6.894049 9.015555
comten | 22.50282 .9241289 20.67902 24.32662
--------------------------------------------------------------
.
b. CEO at their first year (ceoten=0)
%%stata
count if ceoten==0
sum ceoten
display r(max)
. count if ceoten==0
5
. sum ceoten
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
ceoten | 177 7.954802 7.150826 0 37
. display r(max)
37
.
c. ols lsalary on ceoten, …
%%stata
reg lsalary ceoten
Source | SS df MS Number of obs = 177
-------------+---------------------------------- F(1, 175) = 2.33
Model | .850907024 1 .850907024 Prob > F = 0.1284
Residual | 63.795306 175 .364544606 R-squared = 0.0132
-------------+---------------------------------- Adj R-squared = 0.0075
Total | 64.6462131 176 .367308029 Root MSE = .60378
------------------------------------------------------------------------------
lsalary | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
ceoten | .0097236 .0063645 1.53 0.128 -.0028374 .0222846
_cons | 6.505498 .0679911 95.68 0.000 6.37131 6.639686
------------------------------------------------------------------------------
Problem 2.3. sleep75.dta (Biddle&Hamermesh1990)#
a. ols sleep on totwrk & report in equation form. Interprate intercept.
%%stata
use sleep75.dta , clear
d sleep totwrk, short
reg sleep totwrk
. use sleep75.dta , clear
. d sleep totwrk, short
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
sleep int %9.0g mins sleep at night, per wk
totwrk int %9.0g mins worked per week
. reg sleep totwrk
Source | SS df MS Number of obs = 706
-------------+---------------------------------- F(1, 704) = 81.09
Model | 14381717.2 1 14381717.2 Prob > F = 0.0000
Residual | 124858119 704 177355.282 R-squared = 0.1033
-------------+---------------------------------- Adj R-squared = 0.1020
Total | 139239836 705 197503.313 Root MSE = 421.14
------------------------------------------------------------------------------
sleep | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
totwrk | -.1507458 .0167403 -9.00 0.000 -.1836126 -.117879
_cons | 3586.377 38.91243 92.17 0.000 3509.979 3662.775
------------------------------------------------------------------------------
.
b. If totwrk increases by 2 hours, by how much is sleep estimated to fall?
%%stata
display _b[totwrk]*2*60
-18.089499
Problem 2.4. Wage2: ols salary on iq#
a. average Salary, average IQ and sample sd of IQ
%%stata
use wage2.dta, clear
sum wage IQ
. use wage2.dta, clear
. sum wage IQ
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
wage | 935 957.9455 404.3608 115 3078
IQ | 935 101.2824 15.05264 50 145
.
b. efect of 15 point increase in IQ on Wage (constant dollar)
%%stata
reg wage IQ
display "wage= " %5.3f _b[_cons] "+" %5.3f _b[IQ] "IQ; N=" _N ",Rsq=" %5.4f e(r2)
display _b[IQ]*15
. reg wage IQ
Source | SS df MS Number of obs = 935
-------------+---------------------------------- F(1, 933) = 98.55
Model | 14589782.6 1 14589782.6 Prob > F = 0.0000
Residual | 138126386 933 148045.429 R-squared = 0.0955
-------------+---------------------------------- Adj R-squared = 0.0946
Total | 152716168 934 163507.675 Root MSE = 384.77
------------------------------------------------------------------------------
wage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
IQ | 8.303064 .8363951 9.93 0.000 6.661631 9.944498
_cons | 116.9916 85.64153 1.37 0.172 -51.08078 285.0639
------------------------------------------------------------------------------
. display "wage= " %5.3f _b[_cons] "+" %5.3f _b[IQ] "IQ; N=" _N ",Rsq=" %5.4f e
> (r2)
wage= 116.992+8.303IQ; N=935,Rsq=0.0955
. display _b[IQ]*15
124.54596
.
c. efect of 15 point increase in IQ on Wage (percentage)
%%stata
reg lwage IQ
display "lwage= " %5.3f _b[_cons] "+" %5.3f _b[IQ] "IQ; N=" _N ",Rsq=" %5.4f e(r2)
display "0" _b[IQ]*15
. reg lwage IQ
Source | SS df MS Number of obs = 935
-------------+---------------------------------- F(1, 933) = 102.62
Model | 16.4150939 1 16.4150939 Prob > F = 0.0000
Residual | 149.241189 933 .159958402 R-squared = 0.0991
-------------+---------------------------------- Adj R-squared = 0.0981
Total | 165.656283 934 .177362188 Root MSE = .39995
------------------------------------------------------------------------------
lwage | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
IQ | .0088072 .0008694 10.13 0.000 .007101 .0105134
_cons | 5.886994 .0890206 66.13 0.000 5.712291 6.061698
------------------------------------------------------------------------------
. display "lwage= " %5.3f _b[_cons] "+" %5.3f _b[IQ] "IQ; N=" _N ",Rsq=" %5.4f
> e(r2)
lwage= 5.887+0.009IQ; N=935,Rsq=0.0991
. display "0" _b[IQ]*15
0.13210734
.
Problem 2.5 rdchem: r&d on sales#
a. Model for elasticity?
\(log(rd)=\beta_0 +\beta_1 log(sales) ; \beta_1\) is parameter elasticity
b. Estimate b1?
%%stata
use rdchem.dta , clear
reg lrd lsale
. use rdchem.dta , clear
. reg lrd lsale
Source | SS df MS Number of obs = 32
-------------+---------------------------------- F(1, 30) = 302.72
Model | 84.8395785 1 84.8395785 Prob > F = 0.0000
Residual | 8.40768588 30 .280256196 R-squared = 0.9098
-------------+---------------------------------- Adj R-squared = 0.9068
Total | 93.2472644 31 3.00797627 Root MSE = .52939
------------------------------------------------------------------------------
lrd | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lsales | 1.075731 .0618275 17.40 0.000 .9494619 1.201999
_cons | -4.104722 .4527678 -9.07 0.000 -5.029398 -3.180047
------------------------------------------------------------------------------
.
Problem 2.6 meap93: math pass rate (math4) & spending per student (expend)#
a. Diminishing effect
b. $\(math_{10} = \beta_0 + \beta_1 ln(expend) + u \\
\frac{dy}{dlnx}\cdot\frac{dlnx}{dx} = c\% \iff \frac{dy}{dlnx}\cdot\frac{1}{x}; \\
\frac{dy}{dlnx}=\beta_1=\gamma x \implies x=\frac{\beta_1 }{\gamma}\)\(
c. ols \)math_{10}\( on \)lexpend$,
%%stata
use meap93.dta, clear
reg math10 lexpend
display "math10= " %5.3f _b[_cons] "+" %5.3f _b[lexpend] "log(expend); N=" _N ",Rsq=" %5.4f e(r2)
. use meap93.dta, clear
. reg math10 lexpend
Source | SS df MS Number of obs = 408
-------------+---------------------------------- F(1, 406) = 12.41
Model | 1329.42517 1 1329.42517 Prob > F = 0.0005
Residual | 43487.7553 406 107.112698 R-squared = 0.0297
-------------+---------------------------------- Adj R-squared = 0.0273
Total | 44817.1805 407 110.115923 Root MSE = 10.35
------------------------------------------------------------------------------
math10 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
lexpend | 11.16439 3.169011 3.52 0.000 4.934677 17.39411
_cons | -69.3411 26.53013 -2.61 0.009 -121.4947 -17.18753
------------------------------------------------------------------------------
. display "math10= " %5.3f _b[_cons] "+" %5.3f _b[lexpend] "log(expend); N=" _N
> ",Rsq=" %5.4f e(r2)
math10= -69.341+11.164log(expend); N=408,Rsq=0.0297
.
d. How big is the effect? If spending increases by 10%?
%%stata
display _b[lexpend]/10 "%"
1.1164395%
e. Why is "math10>100" not much of a worry in this data set?
Problem 2.7 charity: gifts and mailings; imported from R (wooldridge package)#
a. & b.
%%stata
use charity.dta, clear
sum gift mails
count if gift==0
display 100*r(N)/4268 "%"
. use charity.dta, clear
(Written by R. )
. sum gift mails
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
gift | 4,268 7.44447 15.06256 0 250
mailsyear | 4,268 2.049555 .66758 .25 3.5
. count if gift==0
2,561
. display 100*r(N)/4268 "%"
60.004686%
.
c. Regeress gift on mails per year,
%%stata
reg gift mails
display "gift= " %5.3f _b[_cons] "+" %5.3f _b[mails] "mails; N=" _N ",Rsq=" %5.4f e(r2)
. reg gift mails
Source | SS df MS Number of obs = 4,268
-------------+---------------------------------- F(1, 4266) = 59.65
Model | 13349.7251 1 13349.7251 Prob > F = 0.0000
Residual | 954750.114 4,266 223.804528 R-squared = 0.0138
-------------+---------------------------------- Adj R-squared = 0.0136
Total | 968099.84 4,267 226.880675 Root MSE = 14.96
------------------------------------------------------------------------------
gift | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mailsyear | 2.649546 .3430598 7.72 0.000 1.976971 3.322122
_cons | 2.01408 .7394696 2.72 0.006 .5643347 3.463825
------------------------------------------------------------------------------
. display "gift= " %5.3f _b[_cons] "+" %5.3f _b[mails] "mails; N=" _N ",Rsq=" %
> 5.4f e(r2)
gift= 2.014+2.650mails; N=4268,Rsq=0.0138
.
d. Does the charity make profit if per unit cost of mailing is one guilder?
%%stata
display _b[mails] - 1
1.6495464
e. The smallest predicted gift (i.e., mail=0)
%%stata
margins, at(mail=0)
Adjusted predictions Number of obs = 4,268
Model VCE: OLS
Expression: Linear prediction, predict()
At: mailsyear = 0
------------------------------------------------------------------------------
| Delta-method
| Margin std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
_cons | 2.01408 .7394696 2.72 0.006 .5643347 3.463825
------------------------------------------------------------------------------
Problem 2.8#
a.
%%stata
clear
set obs 500
g x_ = uniform()
g x = x_ *10
sum x
. clear
. set obs 500
Number of observations (_N) was 0, now 500.
. g x_ = uniform()
. g x = x_ *10
. sum x
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
x | 500 5.077128 2.924345 .013631 9.997507
.
b.
%%stata
g u_ = runiform()
g u = u_ *6
sum u
. g u_ = runiform()
. g u = u_ *6
. sum u
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
u | 500 2.893636 1.738422 .0011084 5.987668
.
c.
%%stata
g y = 1 + 2*x + u
reg y x
. g y = 1 + 2*x + u
. reg y x
Source | SS df MS Number of obs = 500
-------------+---------------------------------- F(1, 498) = 5664.28
Model | 17151.3257 1 17151.3257 Prob > F = 0.0000
Residual | 1507.93458 498 3.02798109 R-squared = 0.9192
-------------+---------------------------------- Adj R-squared = 0.9190
Total | 18659.2603 499 37.3933072 Root MSE = 1.7401
------------------------------------------------------------------------------
y | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
x | 2.004795 .0266378 75.26 0.000 1.952459 2.057131
_cons | 3.869292 .1560343 24.80 0.000 3.562726 4.175859
------------------------------------------------------------------------------
.
d.
%%stata
qui reg y x
predict uh, residual
g xuh=x*uh
//verify if E(uh)=E(x'uh)=0 ; compare results with E(u)=E(x'u)=0. Discuss.
sum xuh uh u
. qui reg y x
. predict uh, residual
. g xuh=x*uh
. //verify if E(uh)=E(x'uh)=0 ; compare results with E(u)=E(x'u)=0. Discuss.
. sum xuh uh u
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
xuh | 500 2.68e-09 10.04505 -28.13847 26.71127
uh | 500 -9.37e-10 1.738365 -2.895379 3.101357
u | 500 2.893636 1.738422 .0011084 5.987668
.
e.
%%stata
g xu = x * u
sum xu xuh uh u
. g xu = x * u
. sum xu xuh uh u
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
xu | 500 14.73229 13.00821 .0034057 53.14323
xuh | 500 2.68e-09 10.04505 -28.13847 26.71127
uh | 500 -9.37e-10 1.738365 -2.895379 3.101357
u | 500 2.893636 1.738422 .0011084 5.987668
.
f. Rerun 2 or 3 times and compare results and conclude!
Problem 2.9 CountyMurders only 1996#
a. how many counties had zero murders in 1996?
%%stata
use countymurders.dta, clear
keep if year==1996
count if murder==0 //counties with zero murder
count if execs>0 //counties with at least one execution
sum execs if murder>0
display r(max)
. use countymurders.dta, clear
(Written by R. )
. keep if year==1996
(35,152 observations deleted)
. count if murder==0 //counties with zero murder
1,051
. count if execs>0 //counties with at least one execution
31
. sum execs if murder>0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
execs | 1,146 .0296684 .1937704 0 3
. display r(max)
3
.
b. ols murder = f (execs); report results the usual way with N & R^2 included
%%stata
reg murders execs
display "murders= " %5.2f _b[_cons] "+" %5.2f _b[execs] "execs; N= " _N ",Rsq=" %5.4f e(r2)
. reg murders execs
Source | SS df MS Number of obs = 2,197
-------------+---------------------------------- F(1, 2195) = 100.77
Model | 152381.693 1 152381.693 Prob > F = 0.0000
Residual | 3319359.01 2,195 1512.23645 R-squared = 0.0439
-------------+---------------------------------- Adj R-squared = 0.0435
Total | 3471740.7 2,196 1580.93839 Root MSE = 38.887
------------------------------------------------------------------------------
murders | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
execs | 58.55548 5.833255 10.04 0.000 47.1162 69.99476
_cons | 5.457241 .834838 6.54 0.000 3.820086 7.094396
------------------------------------------------------------------------------
. display "murders= " %5.2f _b[_cons] "+" %5.2f _b[execs] "execs; N= " _N ",Rsq
> =" %5.4f e(r2)
murders= 5.46+58.56execs; N= 2197,Rsq=0.0439
.
c. Interprate the slope coef. d. The smallest murder that can be predicted using this model is when execution is zero.
%%stata
display _b[_cons] + _b[execs]*0
predict u, residual
sum u if murder==0 & execs==0
. display _b[_cons] + _b[execs]*0
5.4572409
. predict u, residual
. sum u if murder==0 & execs==0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
u | 1,050 -5.457241 0 -5.457241 -5.457241
.
e. Why OLS is not suitable? Endogeniety issues: Omitted variable, measurment error, simultaniety.
Problem 2.10#
a. Sample size, mean & SD of math12 & read12.
%%stata
use catholic.dta, clear
sum math12 read12
(Written by R. )
. use catholic.dta, clear
. sum math12 read12
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
math12 | 7,430 52.13362 9.459117 29.5 71.37
read12 | 7,430 51.7724 9.407761 29.15 68.09
.
b. Ols math12 on read12.
%%stata
reg math12 read12
display "math12= " %5.2f _b[_cons] "+" %5.2f _b[read12] "read12; N= " _N ",Rsq=" %5.4f e(r2)
. reg math12 read12
Source | SS df MS Number of obs = 7,430
-------------+---------------------------------- F(1, 7428) = 7568.58
Model | 335470.113 1 335470.113 Prob > F = 0.0000
Residual | 329238.93 7,428 44.3240347 R-squared = 0.5047
-------------+---------------------------------- Adj R-squared = 0.5046
Total | 664709.043 7,429 89.4749015 Root MSE = 6.6576
------------------------------------------------------------------------------
math12 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
read12 | .7142915 .0082105 87.00 0.000 .6981966 .7303863
_cons | 15.15304 .432036 35.07 0.000 14.30612 15.99995
------------------------------------------------------------------------------
. display "math12= " %5.2f _b[_cons] "+" %5.2f _b[read12] "read12; N= " _N ",Rs
> q=" %5.4f e(r2)
math12= 15.15+ 0.71read12; N= 7430,Rsq=0.5047
.
c. Interprate the intercept.
d. Are you surprised by the b1 that you found? What about R2?
e. I would run the reverse regression to refute the comment.
%%stata
reg read12 math12
Source | SS df MS Number of obs = 7,430
-------------+---------------------------------- F(1, 7428) = 7568.58
Model | 331837.266 1 331837.266 Prob > F = 0.0000
Residual | 325673.561 7,428 43.8440443 R-squared = 0.5047
-------------+---------------------------------- Adj R-squared = 0.5046
Total | 657510.828 7,429 88.5059668 Root MSE = 6.6215
------------------------------------------------------------------------------
read12 | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
math12 | .7065563 .0081216 87.00 0.000 .6906358 .7224769
_cons | 14.93706 .4303184 34.71 0.000 14.09352 15.78061
------------------------------------------------------------------------------
Spurious correlation or causality?