Python for Introductory Econometrics

Chapter 2. The Simple Regression Model

Example 2.3. CEO Salary & Return on Equity

https://www.solomonegash.com/

In [1]:
import numpy as np
import pandas as pd

import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

from wooldridge import *
In [2]:
dataWoo()
  J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93     meapsingle  minwage       mlb1        mroz
  murder     nbasal      nyse          okun        openness
  pension    phillips    pntsprd       prison      prminwge
  rdchem     rdtelec     recid         rental      return
  saving     sleep75     slp75_81      smoke       traffic1
  traffic2   twoyear     volat         vote1       vote2
  voucher    wage1       wage2         wagepan     wageprc
  wine
In [3]:
df = dataWoo('ceosal1')
dataWoo('ceosal1', description=True)
name of dataset: ceosal1
no of variables: 12
no of observations: 209

+----------+-------------------------------+
| variable | label                         |
+----------+-------------------------------+
| salary   | 1990 salary, thousands $      |
| pcsalary | % change salary, 89-90        |
| sales    | 1990 firm sales, millions $   |
| roe      | return on equity, 88-90 avg   |
| pcroe    | % change roe, 88-90           |
| ros      | return on firm's stock, 88-90 |
| indus    | =1 if industrial firm         |
| finance  | =1 if financial firm          |
| consprod | =1 if consumer product firm   |
| utility  | =1 if transport. or utilties  |
| lsalary  | natural log of salary         |
| lsales   | natural log of sales          |
+----------+-------------------------------+

I took a random sample of data reported in the May 6, 1991 issue of
Businessweek.
In [4]:
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
sm.stats.anova_lm(salary_ols)
Out[4]:
df sum_sq mean_sq F PR(>F)
roe 1.0 5.166419e+06 5.166419e+06 2.766532 0.097768
Residual 207.0 3.865666e+08 1.867471e+06 NaN NaN
In [5]:
salary_ols.params
Out[5]:
Intercept    963.191336
roe           18.501186
dtype: float64
In [6]:
print(salary_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Wed, 08 Apr 2020   Prob (F-statistic):             0.0978
Time:                        21:22:07   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.4. Wage Equation

In [7]:
df = dataWoo('wage1')
dataWoo('wage1', description=True)
name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.        |
| profserv | =1 if in prof. serv. indus.     |
| profocc  | =1 if in profess. occupation    |
| clerocc  | =1 if in clerical occupation    |
| servocc  | =1 if in service occupation     |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
| tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.
In [8]:
df[['wage','educ']].describe()
Out[8]:
wage educ
count 526.000000 526.000000
mean 5.896103 12.562738
std 3.693086 2.769022
min 0.530000 0.000000
25% 3.330000 12.000000
50% 4.650000 12.000000
75% 6.880000 14.000000
max 24.980000 18.000000
In [9]:
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
sm.stats.anova_lm(wage_ols)
print(wage_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Wed, 08 Apr 2020   Prob (F-statistic):           2.78e-22
Time:                        21:22:07   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
==============================================================================
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.5. Vote share

In [10]:
df = dataWoo('vote1')
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()
print(vote_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  voteA   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.855
Method:                 Least Squares   F-statistic:                     1018.
Date:                Wed, 08 Apr 2020   Prob (F-statistic):           6.63e-74
Time:                        21:22:07   Log-Likelihood:                -565.20
No. Observations:                 173   AIC:                             1134.
Df Residuals:                     171   BIC:                             1141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     26.8122      0.887     30.221      0.000      25.061      28.564
shareA         0.4638      0.015     31.901      0.000       0.435       0.493
==============================================================================
Omnibus:                       20.747   Durbin-Watson:                   1.826
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.613
Skew:                           0.525   Prob(JB):                     2.05e-10
Kurtosis:                       5.255   Cond. No.                         112.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.6. Table2.2

In [23]:
df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print(salary_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Wed, 08 Apr 2020   Prob (F-statistic):             0.0978
Time:                        21:22:34   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [25]:
df['salary_hat'] = salary_ols.fittedvalues
df['uhat'] = salary_ols.resid

df = round(df.iloc[:,0:4], 3)
df.head(16) 
Out[25]:
roe salary salary_hat uhat
0 14.1 1095 1224.058 -129.058
1 10.9 1001 1164.854 -163.854
2 23.5 1122 1397.969 -275.969
3 5.9 578 1072.348 -494.348
4 13.8 1368 1218.508 149.492
5 20.0 1145 1333.215 -188.215
6 16.4 1078 1266.611 -188.611
7 16.3 1094 1264.761 -170.761
8 10.5 1237 1157.454 79.546
9 26.3 833 1449.773 -616.773
10 25.9 567 1442.372 -875.372
11 26.8 933 1459.023 -526.023
12 14.8 1339 1237.009 101.991
13 22.3 937 1375.768 -438.768
14 56.3 2011 2004.808 6.192
15 12.6 1585 1196.306 388.694

Example 2.7. Wage & education

In [13]:
df0 = dataWoo('wage1')
df = df0[['wage', 'educ']]
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
print(wage_ols.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Wed, 08 Apr 2020   Prob (F-statistic):           2.78e-22
Time:                        21:22:07   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
==============================================================================
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [14]:
# if educ=12.56, then wage_hat is 
wage_hat = 0.5414*12.56 - 0.9049
wage_hat
Out[14]:
5.895084000000001

Example 2.8. CEO Salary - R-squared

In [15]:
df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print('Parameters:', salary_ols.params)
print('Std:', salary_ols.bse)
print('R2: ', salary_ols.rsquared)
Parameters: Intercept    963.191336
roe           18.501186
dtype: float64
Std: Intercept    213.240257
roe           11.123251
dtype: float64
R2:  0.01318862408103405

Example2.9 Voting outcome - R-squared.

See example 2.5 for details.

In [16]:
from statsmodels.iolib.summary2 import summary_col

df0 = dataWoo('vote1')
df = df0[['voteA', 'shareA']]
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()

print(summary_col([vote_ols],stars=True, float_format='%0.2f'))
print('R2: ', vote_ols.rsquared)
==================
           voteA  
------------------
Intercept 26.81***
          (0.89)  
shareA    0.46*** 
          (0.01)  
==================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01
R2:  0.8561408655827665

Example 2.3. in session2.4: Units of measurement & functional form

In [17]:
df = dataWoo('ceosal1')
df['salary1000']=df.salary*1000
df['roe'] = df.roe
salary_ols1000 = smf.ols(formula='salary1000 ~ 1 + roe', data=df).fit()

print(salary_ols1000.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             salary1000   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Wed, 08 Apr 2020   Prob (F-statistic):             0.0978
Time:                        21:22:07   Log-Likelihood:                -3248.3
No. Observations:                 209   AIC:                             6501.
Df Residuals:                     207   BIC:                             6507.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   9.632e+05   2.13e+05      4.517      0.000    5.43e+05    1.38e+06
roe          1.85e+04   1.11e+04      1.663      0.098   -3428.196    4.04e+04
==============================================================================
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [18]:
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()

print(summary_col([salary_ols1000,salary_ols],stars=True,float_format='%0.2f',
                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)}))
================================
           salary1000    salary 
--------------------------------
Intercept 963191.34*** 963.19***
          (213240.26)  (213.24) 
roe       18501.19*    18.50*   
          (11123.25)   (11.12)  
N         209          209      
R2        0.01         0.01     
================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Example 2.10. A log wage equation (log-lin model, semi-elasticity )

In [19]:
df0 = dataWoo('wage1')
df = df0[['wage', 'lwage', 'educ']]
lwage_ols = smf.ols(formula='lwage ~ 1 + educ', data=df).fit()
print(summary_col([lwage_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
==================
           lwage  
------------------
Intercept 0.584***
          (0.097) 
educ      0.083***
          (0.008) 
N         526     
R2        0.186   
==================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01

Example 2.11. Ceo Salary & Fim Sales (log-log model, elasticity)

In [20]:
df = dataWoo('ceosal1')
lsalary_ols = smf.ols(formula='lsalary ~ 1 + lsales', data=df).fit()
print(summary_col([lsalary_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
==================
          lsalary 
------------------
Intercept 4.822***
          (0.288) 
lsales    0.257***
          (0.035) 
N         209     
R2        0.211   
==================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01

Example 2.12 Student math performance

In [21]:
df = dataWoo('meap93')
df.describe()
Out[21]:
lnchprg enroll staff expend salary benefits droprate gradrate math10 sci11 totcomp ltotcomp lexpend lenroll lstaff bensal lsalary
count 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000 408.000000
mean 25.201471 2663.806373 100.641667 4376.578431 31774.507353 6463.428922 5.066422 83.651716 24.106863 49.183088 38237.936275 10.539960 8.370177 7.509714 4.603369 0.204503 10.354385
std 13.610075 2696.820560 13.299518 775.789717 5038.303826 1456.337659 5.485072 13.368375 10.493613 12.524668 5985.086038 0.151267 0.161882 0.867304 0.126683 0.037533 0.154316
min 1.400000 212.000000 65.900002 3332.000000 19764.000000 0.000000 0.000000 23.500000 1.900000 7.200000 24498.000000 10.106347 8.111328 5.356586 4.188138 0.000000 9.891618
25% 14.625000 1037.500000 91.450001 3821.250000 28185.500000 5536.500000 1.900000 77.000000 16.625000 41.299999 34032.000000 10.435057 8.248333 6.944566 4.515792 0.187977 10.246563
50% 23.849999 1840.500000 99.000000 4145.000000 31266.000000 6304.500000 3.700000 86.300003 23.400000 49.100000 37443.500000 10.530588 8.329658 7.517791 4.595120 0.202401 10.350286
75% 33.825000 3084.750000 108.025000 4658.750000 34499.750000 7228.000000 6.500000 93.224998 30.050000 57.149999 41637.000000 10.636744 8.446502 8.034225 4.682363 0.220256 10.448707
max 79.500000 16793.000000 166.600006 7419.000000 52812.000000 11618.000000 61.900002 127.099998 66.699997 85.699997 63518.000000 11.059078 8.911799 9.728718 5.115596 0.449985 10.874494
In [22]:
math_ols = smf.ols(formula='math10 ~ 1 + lnchprg', data=df).fit()
print(summary_col([math_ols],stars=True,float_format='%0.3f', 
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
===================
            math10 
-------------------
Intercept 32.143***
          (0.998)  
lnchprg   -0.319***
          (0.035)  
N         408      
R2        0.171    
===================
Standard errors in
parentheses.
* p<.1, ** p<.05,
***p<.01