Chapter 2. The Simple Regression Model#

import statsmodels.api as sm
import statsmodels.formula.api as smf

from wooldridge import *

Example 2.3. CEO Salary & Return on Equity#

  J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

df = dataWoo('ceosal1')
dataWoo('ceosal1', description=True)
name of dataset: ceosal1
no of variables: 12
no of observations: 209

| variable | label                         |
| salary   | 1990 salary, thousands $      |
| pcsalary | % change salary, 89-90        |
| sales    | 1990 firm sales, millions $   |
| roe      | return on equity, 88-90 avg   |
| pcroe    | % change roe, 88-90           |
| ros      | return on firm's stock, 88-90 |
| indus    | =1 if industrial firm         |
| finance  | =1 if financial firm          |
| consprod | =1 if consumer product firm   |
| utility  | =1 if transport. or utilties  |
| lsalary  | natural log of salary         |
| lsales   | natural log of sales          |

I took a random sample of data reported in the May 6, 1991 issue of
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
df sum_sq mean_sq F PR(>F)
roe 1.0 5.166419e+06 5.166419e+06 2.766532 0.097768
Residual 207.0 3.865666e+08 1.867471e+06 NaN NaN
Intercept    963.191336
roe           18.501186
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:14   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.4. Wage Equation#

df = dataWoo('wage1')
dataWoo('wage1', description=True)
name of dataset: wage1
no of variables: 24
no of observations: 526

| variable | label                           |
| wage     | average hourly earnings         |
| educ     | years of education              |
| exper    | years potential experience      |
| tenure   | years with current employer     |
| nonwhite | =1 if nonwhite                  |
| female   | =1 if female                    |
| married  | =1 if married                   |
| numdep   | number of dependents            |
| smsa     | =1 if live in SMSA              |
| northcen | =1 if live in north central U.S |
| south    | =1 if live in southern region   |
| west     | =1 if live in western region    |
| construc | =1 if work in construc. indus.  |
| ndurman  | =1 if in nondur. manuf. indus.  |
| trcommpu | =1 if in trans, commun, pub ut  |
| trade    | =1 if in wholesale or retail    |
| services | =1 if in services indus.        |
| profserv | =1 if in prof. serv. indus.     |
| profocc  | =1 if in profess. occupation    |
| clerocc  | =1 if in clerical occupation    |
| servocc  | =1 if in service occupation     |
| lwage    | log(wage)                       |
| expersq  | exper^2                         |
| tenursq  | tenure^2                        |

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.
wage educ
count 526.000000 526.000000
mean 5.896103 12.562738
std 3.693086 2.769022
min 0.530000 0.000000
25% 3.330000 12.000000
50% 4.650000 12.000000
75% 6.880000 14.000000
max 24.980000 18.000000
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           2.78e-22
Time:                        18:36:14   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.5. Vote share#

df = dataWoo('vote1')
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()
                            OLS Regression Results                            
Dep. Variable:                  voteA   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.855
Method:                 Least Squares   F-statistic:                     1018.
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           6.63e-74
Time:                        18:36:15   Log-Likelihood:                -565.20
No. Observations:                 173   AIC:                             1134.
Df Residuals:                     171   BIC:                             1141.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept     26.8122      0.887     30.221      0.000      25.061      28.564
shareA         0.4638      0.015     31.901      0.000       0.435       0.493
Omnibus:                       20.747   Durbin-Watson:                   1.826
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.613
Skew:                           0.525   Prob(JB):                     2.05e-10
Kurtosis:                       5.255   Cond. No.                         112.

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.6. Table2.2#

df['salary_hat'] = salary_ols.fittedvalues
df['uhat'] = salary_ols.resid

df = round(df.iloc[:,0:4], 3)
state district democA voteA
0 AL 7 1 68
1 AK 1 0 62
2 AZ 2 1 73
3 AZ 3 0 69
4 AR 3 0 75
5 AR 4 1 69
6 CA 2 0 59
7 CA 3 1 71
8 CA 5 1 76
9 CA 6 1 73
10 CA 7 1 68
11 CA 11 1 71
12 CA 12 0 52
13 CA 16 1 79
14 CA 19 0 50
15 CA 23 1 64
df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
                            OLS Regression Results                            
Dep. Variable:                 salary   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:15   Log-Likelihood:                -1804.5
No. Observations:                 209   AIC:                             3613.
Df Residuals:                     207   BIC:                             3620.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept    963.1913    213.240      4.517      0.000     542.790    1383.592
roe           18.5012     11.123      1.663      0.098      -3.428      40.431
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 2.7. Wage & education#

df0 = dataWoo('wage1')
df = df0[['wage', 'educ']]
wage_ols = smf.ols(formula='wage ~ 1 + educ', data=df).fit()
                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.165
Model:                            OLS   Adj. R-squared:                  0.163
Method:                 Least Squares   F-statistic:                     103.4
Date:                Mon, 11 Dec 2023   Prob (F-statistic):           2.78e-22
Time:                        18:36:15   Log-Likelihood:                -1385.7
No. Observations:                 526   AIC:                             2775.
Df Residuals:                     524   BIC:                             2784.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept     -0.9049      0.685     -1.321      0.187      -2.250       0.441
educ           0.5414      0.053     10.167      0.000       0.437       0.646
Omnibus:                      212.554   Durbin-Watson:                   1.824
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              807.843
Skew:                           1.861   Prob(JB):                    3.79e-176
Kurtosis:                       7.797   Cond. No.                         60.2

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# if educ=12.56, then wage_hat is 
wage_hat = 0.5414*12.56 - 0.9049

Example 2.8. CEO Salary - R-squared#

df0 = dataWoo('ceosal1')
df = df0[['roe', 'salary']]
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()
print('Parameters:', salary_ols.params)
print('Std:', salary_ols.bse)
print('R2: ', salary_ols.rsquared)
Parameters: Intercept    963.191336
roe           18.501186
dtype: float64
Std: Intercept    213.240257
roe           11.123251
dtype: float64
R2:  0.01318862408103405

Example2.9 Voting outcome - R-squared.#

See example 2.5 for details.#

from statsmodels.iolib.summary2 import summary_col

df0 = dataWoo('vote1')
df = df0[['voteA', 'shareA']]
vote_ols = smf.ols(formula='voteA ~ 1 + shareA', data=df).fit()

print(summary_col([vote_ols],stars=True, float_format='%0.2f'))
print('R2: ', vote_ols.rsquared)
Intercept      26.81***
shareA         0.46*** 
R-squared      0.86    
R-squared Adj. 0.86    
Standard errors in
* p<.1, ** p<.05,
R2:  0.8561408655827665

Example 2.3. in session2.4: Units of measurement & functional form#

df = dataWoo('ceosal1')
df['roe'] = df.roe
salary_ols1000 = smf.ols(formula='salary1000 ~ 1 + roe', data=df).fit()

                            OLS Regression Results                            
Dep. Variable:             salary1000   R-squared:                       0.013
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     2.767
Date:                Mon, 11 Dec 2023   Prob (F-statistic):             0.0978
Time:                        18:36:15   Log-Likelihood:                -3248.3
No. Observations:                 209   AIC:                             6501.
Df Residuals:                     207   BIC:                             6507.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept   9.632e+05   2.13e+05      4.517      0.000    5.43e+05    1.38e+06
roe          1.85e+04   1.11e+04      1.663      0.098   -3428.196    4.04e+04
Omnibus:                      311.096   Durbin-Watson:                   2.105
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            31120.902
Skew:                           6.915   Prob(JB):                         0.00
Kurtosis:                      61.158   Cond. No.                         43.3

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
salary_ols = smf.ols(formula='salary ~ 1 + roe', data=df).fit()

                  info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.2f}".format(x.rsquared)}))
                salary1000    salary 
Intercept      963191.34*** 963.19***
               (213240.26)  (213.24) 
roe            18501.19*    18.50*   
               (11123.25)   (11.12)  
R-squared      0.01         0.01     
R-squared Adj. 0.01         0.01     
N              209          209      
R2             0.01         0.01     
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Example 2.10. A log wage equation (log-lin model, semi-elasticity )#

df0 = dataWoo('wage1')
df = df0[['wage', 'lwage', 'educ']]
lwage_ols = smf.ols(formula='lwage ~ 1 + educ', data=df).fit()
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
Intercept      0.584***
educ           0.083***
R-squared      0.186   
R-squared Adj. 0.184   
N              526     
R2             0.186   
Standard errors in
* p<.1, ** p<.05,

Example 2.11. Ceo Salary & Fim Sales (log-log model, elasticity)#

df = dataWoo('ceosal1')
lsalary_ols = smf.ols(formula='lsalary ~ 1 + lsales', data=df).fit()
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
Intercept      4.822***
lsales         0.257***
R-squared      0.211   
R-squared Adj. 0.207   
N              209     
R2             0.211   
Standard errors in
* p<.1, ** p<.05,

Example 2.12 Student math performance#

df = dataWoo('meap93')
df[['math10', 'lnchprg','lsalary']].describe()
math10 lnchprg lsalary
count 408.000000 408.000000 408.000000
mean 24.106863 25.201471 10.354385
std 10.493613 13.610075 0.154316
min 1.900000 1.400000 9.891618
25% 16.625000 14.625000 10.246563
50% 23.400000 23.849999 10.350286
75% 30.050000 33.825000 10.448707
max 66.699997 79.500000 10.874494
math_ols = smf.ols(formula='math10 ~ 1 + lnchprg', data=df).fit()
                 info_dict={'N':lambda x: "{0:d}".format(int(x.nobs)),
                             'R2':lambda x: "{:.3f}".format(x.rsquared)}))
Intercept      32.143***
lnchprg        -0.319***
R-squared      0.171    
R-squared Adj. 0.169    
N              408      
R2             0.171    
Standard errors in
* p<.1, ** p<.05,