Chapter 2. The Simple Regression Model

Contents

Chapter 2. The Simple Regression Model#

Home | Stata | R

import statsmodels.api as sm
import statsmodels.formula.api as smf

from wooldridge import *

Example 2.3. CEO Salary & Return on Equity#

dataWoo()

  J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine

df = dataWoo('ceosal1')
dataWoo('ceosal1', description=True)

name of dataset: ceosal1
no of variables: 12
no of observations: 209

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| salary   | 1990 salary, thousands $      |
| pcsalary | % change salary, 89-90        |
| sales    | 1990 firm sales, millions $   |
| roe      | return on equity, 88-90 avg   |
| pcroe    | % change roe, 88-90           |
| ros      | return on firm's stock, 88-90 |
| indus    | =1 if industrial firm         |
| finance  | =1 if financial firm          |
| consprod | =1 if consumer product firm   |
| utility  | =1 if transport. or utilties  |
| lsalary  | natural log of salary         |
| lsales   | natural log of sales          |
+----------+-------------------------------+

I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.

salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
sm.stats.anova_lm(salary_ols)

	df	sum_sq	mean_sq	F	PR(>F)
roe	1.0	5.166419e+06	5.166419e+06	2.766532	0.097768
Residual	207.0	3.865666e+08	1.867471e+06	NaN	NaN

salary_ols.params

Intercept    963.191336
roe           18.501186
dtype: float64

print(salary_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:14   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.4. Wage Equation#

df = dataWoo('wage1')
dataWoo('wage1', description=True)

name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.        |
| profserv | =1 if in prof. serv. indus.     |
| profocc  | =1 if in profess. occupation    |
| clerocc  | =1 if in clerical occupation    |
| servocc  | =1 if in service occupation     |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
| tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.

df[['wage','educ']].describe()

	wage	educ
count	526.000000	526.000000
mean	5.896103	12.562738
std	3.693086	2.769022
min	0.530000	0.000000
25%	3.330000	12.000000
50%	4.650000	12.000000
75%	6.880000	14.000000
max	24.980000	18.000000

wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
sm.stats.anova_lm(wage_ols)
print(wage_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           2.78e-22
Time:                        18:36:14   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
==============================================================================
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.6. Table2.2#

df['salary_hat'] = salary_ols.fittedvalues
df['uhat'] = salary_ols.resid

df = round(df.iloc[:,0:4], 3)
df.head(16)

	state	district	democA	voteA
0	AL	7	1	68
1	AK	1	0	62
2	AZ	2	1	73
3	AZ	3	0	69
4	AR	3	0	75
5	AR	4	1	69
6	CA	2	0	59
7	CA	3	1	71
8	CA	5	1	76
9	CA	6	1	73
10	CA	7	1	68
11	CA	11	1	71
12	CA	12	0	52
13	CA	16	1	79
14	CA	19	0	50
15	CA	23	1	64

df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print(salary_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:15   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.7. Wage & education#

df0 = dataWoo('wage1')
df = df0[['wage', 'educ']]
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
print(wage_ols.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           2.78e-22
Time:                        18:36:15   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
==============================================================================
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# if educ=12.56, then wage_hat is 
wage_hat = 0.5414*12.56 - 0.9049
wage_hat

5.895084000000001

Example 2.8. CEO Salary - R-squared#

df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print('Parameters:', salary_ols.params)
print('Std:', salary_ols.bse)
print('R2: ', salary_ols.rsquared)

Parameters: Intercept    963.191336
roe           18.501186
dtype: float64
Std: Intercept    213.240257
roe           11.123251
dtype: float64
R2:  0.01318862408103405

Example2.9 Voting outcome - R-squared.#

See example 2.5 for details.#

from statsmodels.iolib.summary2 import summary_col

df0 = dataWoo('vote1')
df = df0[['voteA', 'shareA']]
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()

print(summary_col([vote_ols],stars=True, float_format='%0.2f'))
print('R2: ', vote_ols.rsquared)

=======================
                voteA  
-----------------------
Intercept      26.81***
               (0.89)  
shareA         0.46*** 
               (0.01)  
R-squared      0.86    
R-squared Adj. 0.86    
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
R2:  0.8561408655827665

Example 2.3. in session2.4: Units of measurement & functional form#

df = dataWoo('ceosal1')
df['salary1000']=df.salary*1000
df['roe'] = df.roe
salary_ols1000 = smf.ols(formula='salary1000 ~ 1 + roe', data=df).fit()

print(salary_ols1000.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:             salary1000   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:15   Log-Likelihood:                -3248.3
No. Observations:                 209   AIC:                             6501.
Df Residuals:                     207   BIC:                             6507.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   9.632e+05   2.13e+05      4.517      0.000    5.43e+05    1.38e+06
roe          1.85e+04   1.11e+04      1.663      0.098   -3428.196    4.04e+04
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()

print(summary_col([salary_ols1000,salary_ols],stars=True,float_format='%0.2f',
                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)}))

=====================================
                salary1000    salary 
-------------------------------------
Intercept      963191.34*** 963.19***
               (213240.26)  (213.24) 
roe            18501.19*    18.50*   
               (11123.25)   (11.12)  
R-squared      0.01         0.01     
R-squared Adj. 0.01         0.01     
N              209          209      
R2             0.01         0.01     
=====================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Example 2.10. A log wage equation (log-lin model, semi-elasticity )#

df0 = dataWoo('wage1')
df = df0[['wage', 'lwage', 'educ']]
lwage_ols = smf.ols(formula='lwage ~ 1 + educ', data=df).fit()
print(summary_col([lwage_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))

=======================
                lwage  
-----------------------
Intercept      0.584***
               (0.097) 
educ           0.083***
               (0.008) 
R-squared      0.186   
R-squared Adj. 0.184   
N              526     
R2             0.186   
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01

Example 2.11. Ceo Salary & Fim Sales (log-log model, elasticity)#

df = dataWoo('ceosal1')
lsalary_ols = smf.ols(formula='lsalary ~ 1 + lsales', data=df).fit()
print(summary_col([lsalary_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))

=======================
               lsalary 
-----------------------
Intercept      4.822***
               (0.288) 
lsales         0.257***
               (0.035) 
R-squared      0.211   
R-squared Adj. 0.207   
N              209     
R2             0.211   
=======================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01

Example 2.12 Student math performance#

df = dataWoo('meap93')
df[['math10', 'lnchprg','lsalary']].describe()

	math10	lnchprg	lsalary
count	408.000000	408.000000	408.000000
mean	24.106863	25.201471	10.354385
std	10.493613	13.610075	0.154316
min	1.900000	1.400000	9.891618
25%	16.625000	14.625000	10.246563
50%	23.400000	23.849999	10.350286
75%	30.050000	33.825000	10.448707
max	66.699997	79.500000	10.874494

math_ols = smf.ols(formula='math10 ~ 1 + lnchprg', data=df).fit()
print(summary_col([math_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))

========================
                 math10 
------------------------
Intercept      32.143***
               (0.998)  
lnchprg        -0.319***
               (0.035)  
R-squared      0.171    
R-squared Adj. 0.169    
N              408      
R2             0.171    
========================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01