Chapter 2. The Simple Regression Model#
import statsmodels.api as sm
import statsmodels.formula.api as smf
from wooldridge import *
Example 2.3. CEO Salary & Return on Equity#
dataWoo()
J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
Cengage Learning, 6th edition.
401k 401ksubs admnrev affairs airfare
alcohol apple approval athlet1 athlet2
attend audit barium beauty benefits
beveridge big9salary bwght bwght2 campus
card catholic cement census2000 ceosal1
ceosal2 charity consump corn countymurders
cps78_85 cps91 crime1 crime2 crime3
crime4 discrim driving earns econmath
elem94_95 engin expendshares ezanders ezunem
fair fertil1 fertil2 fertil3 fish
fringe gpa1 gpa2 gpa3 happiness
hprice1 hprice2 hprice3 hseinv htv
infmrt injury intdef intqrt inven
jtrain jtrain2 jtrain3 kielmc lawsch85
loanapp lowbrth mathpnl meap00_01 meap01
meap93 meapsingle minwage mlb1 mroz
murder nbasal nyse okun openness
pension phillips pntsprd prison prminwge
rdchem rdtelec recid rental return
saving sleep75 slp75_81 smoke traffic1
traffic2 twoyear volat vote1 vote2
voucher wage1 wage2 wagepan wageprc
wine
df = dataWoo('ceosal1')
dataWoo('ceosal1', description=True)
name of dataset: ceosal1
no of variables: 12
no of observations: 209
+----------+-------------------------------+
| variable | label |
+----------+-------------------------------+
| salary | 1990 salary, thousands $ |
| pcsalary | % change salary, 89-90 |
| sales | 1990 firm sales, millions $ |
| roe | return on equity, 88-90 avg |
| pcroe | % change roe, 88-90 |
| ros | return on firm's stock, 88-90 |
| indus | =1 if industrial firm |
| finance | =1 if financial firm |
| consprod | =1 if consumer product firm |
| utility | =1 if transport. or utilties |
| lsalary | natural log of salary |
| lsales | natural log of sales |
+----------+-------------------------------+
I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
sm.stats.anova_lm(salary_ols)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
roe | 1.0 | 5.166419e+06 | 5.166419e+06 | 2.766532 | 0.097768 |
Residual | 207.0 | 3.865666e+08 | 1.867471e+06 | NaN | NaN |
salary_ols.params
Intercept 963.191336
roe 18.501186
dtype: float64
print(salary_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 2.767
Date: Mon, 11 Dec 2023 Prob (F-statistic): 0.0978
Time: 18:36:14 Log-Likelihood: -1804.5
No. Observations: 209 AIC: 3613.
Df Residuals: 207 BIC: 3620.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 963.1913 213.240 4.517 0.000 542.790 1383.592
roe 18.5012 11.123 1.663 0.098 -3.428 40.431
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Example 2.4. Wage Equation#
df = dataWoo('wage1')
dataWoo('wage1', description=True)
name of dataset: wage1
no of variables: 24
no of observations: 526
+----------+---------------------------------+
| variable | label |
+----------+---------------------------------+
| wage | average hourly earnings |
| educ | years of education |
| exper | years potential experience |
| tenure | years with current employer |
| nonwhite | =1 if nonwhite |
| female | =1 if female |
| married | =1 if married |
| numdep | number of dependents |
| smsa | =1 if live in SMSA |
| northcen | =1 if live in north central U.S |
| south | =1 if live in southern region |
| west | =1 if live in western region |
| construc | =1 if work in construc. indus. |
| ndurman | =1 if in nondur. manuf. indus. |
| trcommpu | =1 if in trans, commun, pub ut |
| trade | =1 if in wholesale or retail |
| services | =1 if in services indus. |
| profserv | =1 if in prof. serv. indus. |
| profocc | =1 if in profess. occupation |
| clerocc | =1 if in clerical occupation |
| servocc | =1 if in service occupation |
| lwage | log(wage) |
| expersq | exper^2 |
| tenursq | tenure^2 |
+----------+---------------------------------+
These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.
df[['wage','educ']].describe()
wage | educ | |
---|---|---|
count | 526.000000 | 526.000000 |
mean | 5.896103 | 12.562738 |
std | 3.693086 | 2.769022 |
min | 0.530000 | 0.000000 |
25% | 3.330000 | 12.000000 |
50% | 4.650000 | 12.000000 |
75% | 6.880000 | 14.000000 |
max | 24.980000 | 18.000000 |
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
sm.stats.anova_lm(wage_ols)
print(wage_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.165
Model: OLS Adj. R-squared: 0.163
Method: Least Squares F-statistic: 103.4
Date: Mon, 11 Dec 2023 Prob (F-statistic): 2.78e-22
Time: 18:36:14 Log-Likelihood: -1385.7
No. Observations: 526 AIC: 2775.
Df Residuals: 524 BIC: 2784.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.9049 0.685 -1.321 0.187 -2.250 0.441
educ 0.5414 0.053 10.167 0.000 0.437 0.646
==============================================================================
Omnibus: 212.554 Durbin-Watson: 1.824
Prob(Omnibus): 0.000 Jarque-Bera (JB): 807.843
Skew: 1.861 Prob(JB): 3.79e-176
Kurtosis: 7.797 Cond. No. 60.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Example 2.6. Table2.2#
df['salary_hat'] = salary_ols.fittedvalues
df['uhat'] = salary_ols.resid
df = round(df.iloc[:,0:4], 3)
df.head(16)
state | district | democA | voteA | |
---|---|---|---|---|
0 | AL | 7 | 1 | 68 |
1 | AK | 1 | 0 | 62 |
2 | AZ | 2 | 1 | 73 |
3 | AZ | 3 | 0 | 69 |
4 | AR | 3 | 0 | 75 |
5 | AR | 4 | 1 | 69 |
6 | CA | 2 | 0 | 59 |
7 | CA | 3 | 1 | 71 |
8 | CA | 5 | 1 | 76 |
9 | CA | 6 | 1 | 73 |
10 | CA | 7 | 1 | 68 |
11 | CA | 11 | 1 | 71 |
12 | CA | 12 | 0 | 52 |
13 | CA | 16 | 1 | 79 |
14 | CA | 19 | 0 | 50 |
15 | CA | 23 | 1 | 64 |
df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print(salary_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 2.767
Date: Mon, 11 Dec 2023 Prob (F-statistic): 0.0978
Time: 18:36:15 Log-Likelihood: -1804.5
No. Observations: 209 AIC: 3613.
Df Residuals: 207 BIC: 3620.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 963.1913 213.240 4.517 0.000 542.790 1383.592
roe 18.5012 11.123 1.663 0.098 -3.428 40.431
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Example 2.7. Wage & education#
df0 = dataWoo('wage1')
df = df0[['wage', 'educ']]
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
print(wage_ols.summary())
OLS Regression Results
==============================================================================
Dep. Variable: wage R-squared: 0.165
Model: OLS Adj. R-squared: 0.163
Method: Least Squares F-statistic: 103.4
Date: Mon, 11 Dec 2023 Prob (F-statistic): 2.78e-22
Time: 18:36:15 Log-Likelihood: -1385.7
No. Observations: 526 AIC: 2775.
Df Residuals: 524 BIC: 2784.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -0.9049 0.685 -1.321 0.187 -2.250 0.441
educ 0.5414 0.053 10.167 0.000 0.437 0.646
==============================================================================
Omnibus: 212.554 Durbin-Watson: 1.824
Prob(Omnibus): 0.000 Jarque-Bera (JB): 807.843
Skew: 1.861 Prob(JB): 3.79e-176
Kurtosis: 7.797 Cond. No. 60.2
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# if educ=12.56, then wage_hat is
wage_hat = 0.5414*12.56 - 0.9049
wage_hat
5.895084000000001
Example 2.8. CEO Salary - R-squared#
df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print('Parameters:', salary_ols.params)
print('Std:', salary_ols.bse)
print('R2: ', salary_ols.rsquared)
Parameters: Intercept 963.191336
roe 18.501186
dtype: float64
Std: Intercept 213.240257
roe 11.123251
dtype: float64
R2: 0.01318862408103405
Example2.9 Voting outcome - R-squared.#
See example 2.5 for details.#
from statsmodels.iolib.summary2 import summary_col
df0 = dataWoo('vote1')
df = df0[['voteA', 'shareA']]
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()
print(summary_col([vote_ols],stars=True, float_format='%0.2f'))
print('R2: ', vote_ols.rsquared)
=======================
voteA
-----------------------
Intercept 26.81***
(0.89)
shareA 0.46***
(0.01)
R-squared 0.86
R-squared Adj. 0.86
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
R2: 0.8561408655827665
Example 2.3. in session2.4: Units of measurement & functional form#
df = dataWoo('ceosal1')
df['salary1000']=df.salary*1000
df['roe'] = df.roe
salary_ols1000 = smf.ols(formula='salary1000 ~ 1 + roe', data=df).fit()
print(salary_ols1000.summary())
OLS Regression Results
==============================================================================
Dep. Variable: salary1000 R-squared: 0.013
Model: OLS Adj. R-squared: 0.008
Method: Least Squares F-statistic: 2.767
Date: Mon, 11 Dec 2023 Prob (F-statistic): 0.0978
Time: 18:36:15 Log-Likelihood: -3248.3
No. Observations: 209 AIC: 6501.
Df Residuals: 207 BIC: 6507.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 9.632e+05 2.13e+05 4.517 0.000 5.43e+05 1.38e+06
roe 1.85e+04 1.11e+04 1.663 0.098 -3428.196 4.04e+04
==============================================================================
Omnibus: 311.096 Durbin-Watson: 2.105
Prob(Omnibus): 0.000 Jarque-Bera (JB): 31120.902
Skew: 6.915 Prob(JB): 0.00
Kurtosis: 61.158 Cond. No. 43.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print(summary_col([salary_ols1000,salary_ols],stars=True,float_format='%0.2f',
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)}))
=====================================
salary1000 salary
-------------------------------------
Intercept 963191.34*** 963.19***
(213240.26) (213.24)
roe 18501.19* 18.50*
(11123.25) (11.12)
R-squared 0.01 0.01
R-squared Adj. 0.01 0.01
N 209 209
R2 0.01 0.01
=====================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
Example 2.10. A log wage equation (log-lin model, semi-elasticity )#
df0 = dataWoo('wage1')
df = df0[['wage', 'lwage', 'educ']]
lwage_ols = smf.ols(formula='lwage ~ 1 + educ', data=df).fit()
print(summary_col([lwage_ols],stars=True,float_format='%0.3f',
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.3f}".format(x.rsquared)}))
=======================
lwage
-----------------------
Intercept 0.584***
(0.097)
educ 0.083***
(0.008)
R-squared 0.186
R-squared Adj. 0.184
N 526
R2 0.186
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
Example 2.11. Ceo Salary & Fim Sales (log-log model, elasticity)#
df = dataWoo('ceosal1')
lsalary_ols = smf.ols(formula='lsalary ~ 1 + lsales', data=df).fit()
print(summary_col([lsalary_ols],stars=True,float_format='%0.3f',
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.3f}".format(x.rsquared)}))
=======================
lsalary
-----------------------
Intercept 4.822***
(0.288)
lsales 0.257***
(0.035)
R-squared 0.211
R-squared Adj. 0.207
N 209
R2 0.211
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
Example 2.12 Student math performance#
df = dataWoo('meap93')
df[['math10', 'lnchprg','lsalary']].describe()
math10 | lnchprg | lsalary | |
---|---|---|---|
count | 408.000000 | 408.000000 | 408.000000 |
mean | 24.106863 | 25.201471 | 10.354385 |
std | 10.493613 | 13.610075 | 0.154316 |
min | 1.900000 | 1.400000 | 9.891618 |
25% | 16.625000 | 14.625000 | 10.246563 |
50% | 23.400000 | 23.849999 | 10.350286 |
75% | 30.050000 | 33.825000 | 10.448707 |
max | 66.699997 | 79.500000 | 10.874494 |
math_ols = smf.ols(formula='math10 ~ 1 + lnchprg', data=df).fit()
print(summary_col([math_ols],stars=True,float_format='%0.3f',
info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.3f}".format(x.rsquared)}))
========================
math10
------------------------
Intercept 32.143***
(0.998)
lnchprg -0.319***
(0.035)
R-squared 0.171
R-squared Adj. 0.169
N 408
R2 0.171
========================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01