Chapter 2. The Simple Regression Model#

Home | Stata | R

import statsmodels.api as sm
import statsmodels.formula.api as smf

from wooldridge import *

Example 2.3. CEO Salary & Return on Equity#

dataWoo()
  J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine
df = dataWoo('ceosal1')
dataWoo('ceosal1', description=True)
name of dataset: ceosal1
no of variables: 12
no of observations: 209

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| salary   | 1990 salary, thousands $      |
| pcsalary | % change salary, 89-90        |
| sales    | 1990 firm sales, millions $   |
| roe      | return on equity, 88-90 avg   |
| pcroe    | % change roe, 88-90           |
| ros      | return on firm's stock, 88-90 |
| indus    | =1 if industrial firm         |
| finance  | =1 if financial firm          |
| consprod | =1 if consumer product firm   |
| utility  | =1 if transport. or utilties  |
| lsalary  | natural log of salary         |
| lsales   | natural log of sales          |
+----------+-------------------------------+

I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
sm.stats.anova_lm(salary_ols)
df sum_sq mean_sq F PR(>F)
roe 1.0 5.166419e+06 5.166419e+06 2.766532 0.097768
Residual 207.0 3.865666e+08 1.867471e+06 NaN NaN
salary_ols.params
Intercept    963.191336
roe           18.501186
dtype: float64
print(salary_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:14   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.4. Wage Equation#

df = dataWoo('wage1')
dataWoo('wage1', description=True)
name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.        |
| profserv | =1 if in prof. serv. indus.     |
| profocc  | =1 if in profess. occupation    |
| clerocc  | =1 if in clerical occupation    |
| servocc  | =1 if in service occupation     |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
| tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.
df[['wage','educ']].describe()
wage educ
count 526.000000 526.000000
mean 5.896103 12.562738
std 3.693086 2.769022
min 0.530000 0.000000
25% 3.330000 12.000000
50% 4.650000 12.000000
75% 6.880000 14.000000
max 24.980000 18.000000
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
sm.stats.anova_lm(wage_ols)
print(wage_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           2.78e-22
Time:                        18:36:14   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
==============================================================================
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.5. Vote share#

df = dataWoo('vote1')
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()
print(vote_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  voteA   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.855
Method:                 Least Squares   F-statistic:                     1018.
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           6.63e-74
Time:                        18:36:15   Log-Likelihood:                -565.20
No. Observations:                 173   AIC:                             1134.
Df Residuals:                     171   BIC:                             1141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     26.8122      0.887     30.221      0.000      25.061      28.564
shareA         0.4638      0.015     31.901      0.000       0.435       0.493
==============================================================================
Omnibus:                       20.747   Durbin-Watson:                   1.826
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.613
Skew:                           0.525   Prob(JB):                     2.05e-10
Kurtosis:                       5.255   Cond. No.                         112.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.6. Table2.2#

df['salary_hat'] = salary_ols.fittedvalues
df['uhat'] = salary_ols.resid

df = round(df.iloc[:,0:4], 3)
df.head(16)
state district democA voteA
0 AL 7 1 68
1 AK 1 0 62
2 AZ 2 1 73
3 AZ 3 0 69
4 AR 3 0 75
5 AR 4 1 69
6 CA 2 0 59
7 CA 3 1 71
8 CA 5 1 76
9 CA 6 1 73
10 CA 7 1 68
11 CA 11 1 71
12 CA 12 0 52
13 CA 16 1 79
14 CA 19 0 50
15 CA 23 1 64
df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print(salary_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:15   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.7. Wage & education#

df0 = dataWoo('wage1')
df = df0[['wage', 'educ']]
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
print(wage_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           2.78e-22
Time:                        18:36:15   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
==============================================================================
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# if educ=12.56, then wage_hat is 
wage_hat = 0.5414*12.56 - 0.9049
wage_hat
5.895084000000001

Example 2.8. CEO Salary - R-squared#

df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print('Parameters:', salary_ols.params)
print('Std:', salary_ols.bse)
print('R2: ', salary_ols.rsquared)
Parameters: Intercept    963.191336
roe           18.501186
dtype: float64
Std: Intercept    213.240257
roe           11.123251
dtype: float64
R2:  0.01318862408103405

Example2.9 Voting outcome - R-squared.#

See example 2.5 for details.#

from statsmodels.iolib.summary2 import summary_col

df0 = dataWoo('vote1')
df = df0[['voteA', 'shareA']]
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()

print(summary_col([vote_ols],stars=True, float_format='%0.2f'))
print('R2: ', vote_ols.rsquared)
=======================
                voteA  
-----------------------
Intercept      26.81***
               (0.89)  
shareA         0.46*** 
               (0.01)  
R-squared      0.86    
R-squared Adj. 0.86    
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
R2:  0.8561408655827665

Example 2.3. in session2.4: Units of measurement & functional form#

df = dataWoo('ceosal1')
df['salary1000']=df.salary*1000
df['roe'] = df.roe
salary_ols1000 = smf.ols(formula='salary1000 ~ 1 + roe', data=df).fit()

print(salary_ols1000.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             salary1000   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:15   Log-Likelihood:                -3248.3
No. Observations:                 209   AIC:                             6501.
Df Residuals:                     207   BIC:                             6507.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   9.632e+05   2.13e+05      4.517      0.000    5.43e+05    1.38e+06
roe          1.85e+04   1.11e+04      1.663      0.098   -3428.196    4.04e+04
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()

print(summary_col([salary_ols1000,salary_ols],stars=True,float_format='%0.2f',
                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)}))
=====================================
                salary1000    salary 
-------------------------------------
Intercept      963191.34*** 963.19***
               (213240.26)  (213.24) 
roe            18501.19*    18.50*   
               (11123.25)   (11.12)  
R-squared      0.01         0.01     
R-squared Adj. 0.01         0.01     
N              209          209      
R2             0.01         0.01     
=====================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Example 2.10. A log wage equation (log-lin model, semi-elasticity )#

df0 = dataWoo('wage1')
df = df0[['wage', 'lwage', 'educ']]
lwage_ols = smf.ols(formula='lwage ~ 1 + educ', data=df).fit()
print(summary_col([lwage_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
=======================
                lwage  
-----------------------
Intercept      0.584***
               (0.097) 
educ           0.083***
               (0.008) 
R-squared      0.186   
R-squared Adj. 0.184   
N              526     
R2             0.186   
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01

Example 2.11. Ceo Salary & Fim Sales (log-log model, elasticity)#

df = dataWoo('ceosal1')
lsalary_ols = smf.ols(formula='lsalary ~ 1 + lsales', data=df).fit()
print(summary_col([lsalary_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
=======================
               lsalary 
-----------------------
Intercept      4.822***
               (0.288) 
lsales         0.257***
               (0.035) 
R-squared      0.211   
R-squared Adj. 0.207   
N              209     
R2             0.211   
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01

Example 2.12 Student math performance#

df = dataWoo('meap93')
df[['math10', 'lnchprg','lsalary']].describe()
math10 lnchprg lsalary
count 408.000000 408.000000 408.000000
mean 24.106863 25.201471 10.354385
std 10.493613 13.610075 0.154316
min 1.900000 1.400000 9.891618
25% 16.625000 14.625000 10.246563
50% 23.400000 23.849999 10.350286
75% 30.050000 33.825000 10.448707
max 66.699997 79.500000 10.874494
math_ols = smf.ols(formula='math10 ~ 1 + lnchprg', data=df).fit()
print(summary_col([math_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
========================
                 math10 
------------------------
Intercept      32.143***
               (0.998)  
lnchprg        -0.319***
               (0.035)  
R-squared      0.171    
R-squared Adj. 0.169    
N              408      
R2             0.171    
========================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01