11. CHAPTER 11. STATISTICAL INFERENCE FOR MULTIPLE REGRESSION#

SET UP

library(foreign) # to open stata.dta files
library(psych) # for better sammary of descriptive statistics
library(repr) # to combine graphs with adjustable plot dimensions
options(repr.plot.width = 12, repr.plot.height = 6) # Plot dimensions (in inches)
options(width = 150) # To increase character width of printed output

11.1. 11.1 - 11.4 EXAMPLE: HOUSE PRICE AND CHARACTERISTICS#

df = read.dta(file = "Dataset/AED_HOUSE.DTA") 
attach(df) 
print(describe(df))
          vars  n      mean       sd median   trimmed      mad    min    max  range  skew kurtosis      se
price        1 29 253910.34 37390.71 244000 249296.00 22239.00 204000 375000 171000  1.48     2.23 6943.28
size         2 29   1882.76   398.27   1800   1836.00   296.52   1400   3300   1900  1.64     3.29   73.96
bedrooms     3 29      3.79     0.68      4      3.72     0.00      3      6      3  0.92     1.89    0.13
bathrooms    4 29      2.21     0.34      2      2.16     0.00      2      3      1  1.28     0.23    0.06
lotsize      5 29      2.14     0.69      2      2.16     0.00      1      3      2 -0.17    -1.00    0.13
age          6 29     36.41     7.12     35     36.20     5.93     23     51     28  0.44    -0.75    1.32
monthsold    7 29      5.97     1.68      6      6.04     1.48      3      8      5 -0.47    -1.03    0.31
list         8 29 257824.14 40860.26 245000 252364.00 23721.60 199900 386000 186100  1.65     2.70 7587.56

11.1.1. Table 11.2#

Multiple regression

library(jtools)
summ(ols <- lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold))
MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(6,22) = 6.83, p = 0.00
R² = 0.65
Adj. R² = 0.56 

Standard errors:OLS
--------------------------------------------------------
                         Est.       S.E.   t val.      p
----------------- ----------- ---------- -------- ------
(Intercept)         137791.07   61464.95     2.24   0.04
size                    68.37      15.39     4.44   0.00
bedrooms              2685.32    9192.53     0.29   0.77
bathrooms             6832.88   15721.19     0.43   0.67
lotsize               2303.22    7226.54     0.32   0.75
age                   -833.04     719.33    -1.16   0.26
monthsold            -2088.50    3520.90    -0.59   0.56
--------------------------------------------------------

Confidence intervals

summ(ols, digits=3, confint = TRUE) 
MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(6,22) = 6.826, p = 0.000
R² = 0.651
Adj. R² = 0.555 

Standard errors:OLS
-------------------------------------------------------------------------
                          Est.         2.5%        97.5%   t val.       p
----------------- ------------ ------------ ------------ -------- -------
(Intercept)         137791.066    10320.557   265261.574    2.242   0.035
size                    68.369       36.454      100.285    4.443   0.000
bedrooms              2685.315   -16378.816    21749.447    0.292   0.773
bathrooms             6832.880   -25770.876    39436.636    0.435   0.668
lotsize               2303.221   -12683.695    17290.138    0.319   0.753
age                   -833.039    -2324.847      658.770   -1.158   0.259
monthsold            -2088.504    -9390.399     5213.392   -0.593   0.559
-------------------------------------------------------------------------

Confidence interval manually for size

coef=summary(ols)$coefficients["size",1]
sterror=summary(ols)$coefficients["size",2]
df = summary(ols)$df[2]
N = length(price)
cbind(coef,sterror,df)
coef + c(-1,1)*sterror*qt(0.975, df)
A matrix: 1 × 3 of type dbl
coef sterror df
68.36942 15.38947 22
  1. 36.4536078247386
  2. 100.28523012857

Test that beta_size = 50 against beta_size not = 50

coef=summary(ols)$coefficients["size",1]
sterror=summary(ols)$coefficients["size",2]
cbind(coef,sterror)
N = length(size)
t = (coef - 50)/sterror
p = 2*(1-pt(abs(t),N-2))
tcrit = qt(.975,N-2)
print("t statistic, p-value, critical value")
cbind(t, p, tcrit)
A matrix: 1 × 2 of type dbl
coef sterror
68.36942 15.38947
[1] "t statistic, p-value, critical value"
A matrix: 1 × 3 of type dbl
t p tcrit
1.193635 0.2430033 2.051831

11.2. 11.5 JOINT HYPOTHESIS TESTS#

Joint significance of all coefficients

Included in complete OLS output

ols.full = lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold) 
sum.full = summary(ols.full)
F = sum.full$fstatistic
k = sum.full$df[1]
df = sum.full$df[2]
critval = qf(.95,k-1,df)
pval = 1 - pf(F,k-1,df)
cbind(F[1], pval[1], critval)
A matrix: 1 × 3 of type dbl
critval
value 6.826098 0.0003424253 2.549061

Joint test of zero coeffs for bedrooms bathrooms lotsize age monthsold

library(car)
print(
    linearHypothesis(ols.full,c("bedrooms=0", "bathrooms=0", "lotsize",
      "age=0", "monthsold=0"))
)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:psych':

    logit
Linear hypothesis test:
bedrooms = 0
bathrooms = 0
lotsize = 0
age = 0
monthsold = 0

Model 1: restricted model
Model 2: price ~ size + bedrooms + bathrooms + lotsize + age + monthsold

  Res.Df        RSS Df  Sum of Sq      F Pr(>F)
1     27 1.4975e+10                            
2     22 1.3679e+10  5 1295703800 0.4168  0.832

11.3. 11.6 F STATISTIC UNDER ASSUMPTIONS 1-4#

Summary after lm saves a number of quantitites

sigma is the root mean squared error (RMSE) = ResSS/(n-k)

r.squared is R2, adj.r.squared is adjusted R2, $fstatistic is F statisic

df saves three things: (k, N-k, k*) where k is effective number of parameters and k*>=k is the number of regressors including possibly nonidentified regressors if there is perfect collinearity

so k = \(df[1] and df = N-k = \)df[2]

sum.full = summary(ols.full)

Compute adjusted R-squared

k = sum.full$df[1]
df = sum.full$df[2]
rsq = sum.full$r.squared
s = sum.full$sigma
print(cbind(k,df,rsq,s))
     k df       rsq        s
[1,] 7 22 0.6505528 24935.73
ResSS = df*s^2 # Since s^2 = ResSS/(N-k) we have ResSS = (N-k)*s^2
TotSS = ResSS / (1 - rsq) # Since Rsq = 1 - (ResSS / TotSS) use TotSS = ResSS / (1 - Rsq)
ExpSS = TotSS - ResSS # And ExpSS = TotSS - ResSS
print(cbind(ResSS, ExpSS, TotSS))
           ResSS       ExpSS       TotSS
[1,] 13679397855 25466429042 39145826897
Foverall = (ExpSS/(k-1)) / (ResSS/df)
critval = qf(.95,k-1,df)
pval = 1 - pf(Foverall[1],k-1,df)
print(cbind(Foverall[1], pval[1], critval))
                            critval
[1,] 6.826098 0.0003424253 2.549061

Subset test where drop bedrooms bathrooms lotsize age monthsold

Unrestricted model

ols.unrest = lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold)
sum.unrest = summary(ols.unrest)
k_u = sum.unrest$df[1]
df_u = sum.unrest$df[2]
s_u = sum.unrest$sigma
ResSS_u = df_u*s_u^2

Restricted model

ols.rest = lm(price ~ size) 
sum.rest = summary(ols.rest)
k_r = sum.rest$df[1]
df_r = sum.rest$df[2]
s_r = sum.rest$sigma
ResSS_r = df_r*s_r^2
q = k_u - k_r
Fstat = ((ResSS_r-ResSS_u)/q) / (ResSS_u/df_u)
critval = qf(.95,q,df_u)
pval = 1 - pf(Fstat[1],q,df_u)
cbind(Fstat[1], pval[1], critval)
A matrix: 1 × 3 of type dbl
critval
0.4167652 0.8319759 2.661274

11.4. 11.7 PRESENTATION OF REGRESSION RESULTS#

The following combines output from two models in a single table It uses packages jtools which also needs huxtable for tables

library(huxtable)

# The two models
ols.full = lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold)
summ(ols.full)
ols.small = lm(price ~ size)
summ(ols.small)
MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(6,22) = 6.83, p = 0.00
R² = 0.65
Adj. R² = 0.56 

Standard errors:OLS
--------------------------------------------------------
                         Est.       S.E.   t val.      p
----------------- ----------- ---------- -------- ------
(Intercept)         137791.07   61464.95     2.24   0.04
size                    68.37      15.39     4.44   0.00
bedrooms              2685.32    9192.53     0.29   0.77
bathrooms             6832.88   15721.19     0.43   0.67
lotsize               2303.22    7226.54     0.32   0.75
age                   -833.04     719.33    -1.16   0.26
monthsold            -2088.50    3520.90    -0.59   0.56
--------------------------------------------------------
MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(1,27) = 43.58, p = 0.00
R² = 0.62
Adj. R² = 0.60 

Standard errors:OLS
--------------------------------------------------------
                         Est.       S.E.   t val.      p
----------------- ----------- ---------- -------- ------
(Intercept)         115017.28   21489.36     5.35   0.00
size                    73.77      11.17     6.60   0.00
--------------------------------------------------------

The following uses default standard errors. For heteroskedastic-robust standard errors add , robust = “HC1”. Default includes standard errors

export_summs(ols.full, ols.small, scale = TRUE)
A huxtable: 18 × 3
names Model 1 Model 2
<chr> <chr> <chr>
Model 1 Model 2
1 (Intercept) 253910.344827586 *** 253910.344827586 ***
2 (4630.44948878858) (4373.24701673169)
3 size 27229.6341330319 *** 29380.9493926936 ***
4 (6129.19772656264) (4450.65562831248)
5 bedrooms 1812.66731922359
6 (6205.2273627909)
7 bathrooms 2330.99683705654
8 (5363.19204846579)
9 lotsize 1596.20979166114
10 (5008.2316865653)
11 age -5930.38085788559
12 (5120.92452909128)
13 monthsold -3507.31651374698
14 (5912.79950609896)
1.1 N 29 29
2.1 R2 0.650552844601402 0.61745343394899
.1 All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05.

This gives t-statistics

export_summs(ols.full, ols.small, scale = TRUE, 
             error_format = "({statistic})")
A huxtable: 18 × 3
names Model 1 Model 2
<chr> <chr> <chr>
Model 1 Model 2
1 (Intercept) 253910.344827586 *** 253910.344827586 ***
2 (54.8349237892268) (58.0599138022951)
3 size 27229.6341330319 *** 29380.9493926936 ***
4 (4.4426098402772) (6.60148792591122)
5 bedrooms 1812.66731922359
6 (0.29211940405167)
7 bathrooms 2330.99683705654
8 (0.434628634587746)
9 lotsize 1596.20979166114
10 (0.318717242244006)
11 age -5930.38085788559
12 (-1.15806839647722)
13 monthsold -3507.31651374698
14 (-0.59317359063659)
1.1 N 29 29
2.1 R2 0.650552844601402 0.61745343394899
.1 All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05.

This gives t-statistics and p-values

export_summs(ols.full, ols.small, scale = TRUE, 
    error_format = "({statistic}, p = {p.value})")
A huxtable: 18 × 3
names Model 1 Model 2
<chr> <chr> <chr>
Model 1 Model 2
1 (Intercept) 253910.344827586 *** 253910.344827586 ***
2 (54.8349237892268, p = 0) (58.0599138022951, p = 0)
3 size 27229.6341330319 *** 29380.9493926936 ***
4 (4.4426098402772, p = 0.000204637210605776) (6.60148792591122, p = 4.40875159537941e-07)
5 bedrooms 1812.66731922359
6 (0.29211940405167, p = 0.772932358924004)
7 bathrooms 2330.99683705654
8 (0.434628634587746, p = 0.668065373738364)
9 lotsize 1596.20979166114
10 (0.318717242244006, p = 0.752947112842582)
11 age -5930.38085788559
12 (-1.15806839647722, p = 0.259254037652819)
13 monthsold -3507.31651374698
14 (-0.59317359063659, p = 0.55911430746301)
1.1 N 29 29
2.1 R2 0.650552844601402 0.61745343394899
.1 All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05.

This gives 95% confidence interval

export_summs(ols.full, ols.small, scale = TRUE,
             error_format = "[{conf.low}, {conf.high}]")
A huxtable: 18 × 3
names Model 1 Model 2
<chr> <chr> <chr>
Model 1 Model 2
1 (Intercept) 253910.344827586 *** 253910.344827586 ***
2 [244307.380340498, 263513.309314675] [244937.18314255, 262883.506512623]
3 size 27229.6341330319 *** 29380.9493926936 ***
4 [14518.456040055, 39940.8122260087] [20248.9583561773, 38512.9404292099]
5 bedrooms 1812.66731922359
6 [-11056.1865886896, 14681.5212271367]
7 bathrooms 2330.99683705654
8 [-8791.58271025368, 13453.5763843668]
9 lotsize 1596.20979166114
10 [-8790.2270209302, 11982.6466042525]
11 age -5930.38085788559
12 [-16550.5283215371, 4689.76660576591]
13 monthsold -3507.31651374698
14 [-15769.7121653618, 8755.07913786788]
1.1 N 29 29
2.1 R2 0.650552844601402 0.61745343394899
.1 All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05.