CHAPTER 11. STATISTICAL INFERENCE FOR MULTIPLE REGRESSION

11. CHAPTER 11. STATISTICAL INFERENCE FOR MULTIPLE REGRESSION#

SET UP

library(foreign) # to open stata.dta files
library(psych) # for better sammary of descriptive statistics
library(repr) # to combine graphs with adjustable plot dimensions
options(repr.plot.width = 12, repr.plot.height = 6) # Plot dimensions (in inches)
options(width = 150) # To increase character width of printed output

11.1. 11.1 - 11.4 EXAMPLE: HOUSE PRICE AND CHARACTERISTICS#

df = read.dta(file = "Dataset/AED_HOUSE.DTA") 
attach(df) 
print(describe(df))

          vars  n      mean       sd median   trimmed      mad    min    max  range  skew kurtosis      se
price        1 29 253910.34 37390.71 244000 249296.00 22239.00 204000 375000 171000  1.48     2.23 6943.28
size         2 29   1882.76   398.27   1800   1836.00   296.52   1400   3300   1900  1.64     3.29   73.96
bedrooms     3 29      3.79     0.68      4      3.72     0.00      3      6      3  0.92     1.89    0.13
bathrooms    4 29      2.21     0.34      2      2.16     0.00      2      3      1  1.28     0.23    0.06
lotsize      5 29      2.14     0.69      2      2.16     0.00      1      3      2 -0.17    -1.00    0.13
age          6 29     36.41     7.12     35     36.20     5.93     23     51     28  0.44    -0.75    1.32
monthsold    7 29      5.97     1.68      6      6.04     1.48      3      8      5 -0.47    -1.03    0.31
list         8 29 257824.14 40860.26 245000 252364.00 23721.60 199900 386000 186100  1.65     2.70 7587.56

11.1.1. Table 11.2#

Multiple regression

library(jtools)
summ(ols <- lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold))

MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(6,22) = 6.83, p = 0.00
R² = 0.65
Adj. R² = 0.56 

Standard errors:OLS
--------------------------------------------------------
                         Est.       S.E.   t val.      p
----------------- ----------- ---------- -------- ------
(Intercept)         137791.07   61464.95     2.24   0.04
size                    68.37      15.39     4.44   0.00
bedrooms              2685.32    9192.53     0.29   0.77
bathrooms             6832.88   15721.19     0.43   0.67
lotsize               2303.22    7226.54     0.32   0.75
age                   -833.04     719.33    -1.16   0.26
monthsold            -2088.50    3520.90    -0.59   0.56
--------------------------------------------------------

Confidence intervals

summ(ols, digits=3, confint = TRUE) 

MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(6,22) = 6.826, p = 0.000
R² = 0.651
Adj. R² = 0.555 

Standard errors:OLS
-------------------------------------------------------------------------
                          Est.         2.5%        97.5%   t val.       p
----------------- ------------ ------------ ------------ -------- -------
(Intercept)         137791.066    10320.557   265261.574    2.242   0.035
size                    68.369       36.454      100.285    4.443   0.000
bedrooms              2685.315   -16378.816    21749.447    0.292   0.773
bathrooms             6832.880   -25770.876    39436.636    0.435   0.668
lotsize               2303.221   -12683.695    17290.138    0.319   0.753
age                   -833.039    -2324.847      658.770   -1.158   0.259
monthsold            -2088.504    -9390.399     5213.392   -0.593   0.559
-------------------------------------------------------------------------

Confidence interval manually for size

coef=summary(ols)$coefficients["size",1]
sterror=summary(ols)$coefficients["size",2]
df = summary(ols)$df[2]
N = length(price)
cbind(coef,sterror,df)
coef + c(-1,1)*sterror*qt(0.975, df)

A matrix: 1 × 3 of type dbl
coef	sterror	df
68.36942	15.38947	22

36.4536078247386
100.28523012857

Test that beta_size = 50 against beta_size not = 50

coef=summary(ols)$coefficients["size",1]
sterror=summary(ols)$coefficients["size",2]
cbind(coef,sterror)
N = length(size)
t = (coef - 50)/sterror
p = 2*(1-pt(abs(t),N-2))
tcrit = qt(.975,N-2)
print("t statistic, p-value, critical value")
cbind(t, p, tcrit)

A matrix: 1 × 2 of type dbl
coef	sterror
68.36942	15.38947

[1] "t statistic, p-value, critical value"

A matrix: 1 × 3 of type dbl
t	p	tcrit
1.193635	0.2430033	2.051831

11.2. 11.5 JOINT HYPOTHESIS TESTS#

Joint significance of all coefficients

Included in complete OLS output

ols.full = lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold) 
sum.full = summary(ols.full)
F = sum.full$fstatistic
k = sum.full$df[1]
df = sum.full$df[2]
critval = qf(.95,k-1,df)
pval = 1 - pf(F,k-1,df)
cbind(F[1], pval[1], critval)

A matrix: 1 × 3 of type dbl
			critval
value	6.826098	0.0003424253	2.549061

Joint test of zero coeffs for bedrooms bathrooms lotsize age monthsold

library(car)
print(
    linearHypothesis(ols.full,c("bedrooms=0", "bathrooms=0", "lotsize",
      "age=0", "monthsold=0"))
)

Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:psych':

    logit

Linear hypothesis test:
bedrooms = 0
bathrooms = 0
lotsize = 0
age = 0
monthsold = 0

Model 1: restricted model
Model 2: price ~ size + bedrooms + bathrooms + lotsize + age + monthsold

  Res.Df        RSS Df  Sum of Sq      F Pr(>F)
1     27 1.4975e+10                            
2     22 1.3679e+10  5 1295703800 0.4168  0.832

11.3. 11.6 F STATISTIC UNDER ASSUMPTIONS 1-4#

Summary after lm saves a number of quantitites

sigma is the root mean squared error (RMSE) = ResSS/(n-k)

r.squared is R2, adj.r.squared is adjusted R2, $fstatistic is F statisic

df saves three things: (k, N-k, k*) where k is effective number of parameters and k*>=k is the number of regressors including possibly nonidentified regressors if there is perfect collinearity

so k = $df[1] and df = N-k = $df[2]

sum.full = summary(ols.full)

Compute adjusted R-squared

k = sum.full$df[1]
df = sum.full$df[2]
rsq = sum.full$r.squared
s = sum.full$sigma
print(cbind(k,df,rsq,s))

     k df       rsq        s
[1,] 7 22 0.6505528 24935.73

ResSS = df*s^2 # Since s^2 = ResSS/(N-k) we have ResSS = (N-k)*s^2
TotSS = ResSS / (1 - rsq) # Since Rsq = 1 - (ResSS / TotSS) use TotSS = ResSS / (1 - Rsq)
ExpSS = TotSS - ResSS # And ExpSS = TotSS - ResSS
print(cbind(ResSS, ExpSS, TotSS))

           ResSS       ExpSS       TotSS
[1,] 13679397855 25466429042 39145826897

Foverall = (ExpSS/(k-1)) / (ResSS/df)
critval = qf(.95,k-1,df)
pval = 1 - pf(Foverall[1],k-1,df)
print(cbind(Foverall[1], pval[1], critval))

                            critval
[1,] 6.826098 0.0003424253 2.549061

Subset test where drop bedrooms bathrooms lotsize age monthsold

Unrestricted model

ols.unrest = lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold)
sum.unrest = summary(ols.unrest)
k_u = sum.unrest$df[1]
df_u = sum.unrest$df[2]
s_u = sum.unrest$sigma
ResSS_u = df_u*s_u^2

Restricted model

ols.rest = lm(price ~ size) 
sum.rest = summary(ols.rest)
k_r = sum.rest$df[1]
df_r = sum.rest$df[2]
s_r = sum.rest$sigma
ResSS_r = df_r*s_r^2
q = k_u - k_r
Fstat = ((ResSS_r-ResSS_u)/q) / (ResSS_u/df_u)
critval = qf(.95,q,df_u)
pval = 1 - pf(Fstat[1],q,df_u)
cbind(Fstat[1], pval[1], critval)

A matrix: 1 × 3 of type dbl
		critval
0.4167652	0.8319759	2.661274

11.4. 11.7 PRESENTATION OF REGRESSION RESULTS#

The following combines output from two models in a single table It uses packages jtools which also needs huxtable for tables

library(huxtable)

# The two models
ols.full = lm(price ~ size+bedrooms+bathrooms+lotsize+age+monthsold)
summ(ols.full)
ols.small = lm(price ~ size)
summ(ols.small)

MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(6,22) = 6.83, p = 0.00
R² = 0.65
Adj. R² = 0.56 

Standard errors:OLS
--------------------------------------------------------
                         Est.       S.E.   t val.      p
----------------- ----------- ---------- -------- ------
(Intercept)         137791.07   61464.95     2.24   0.04
size                    68.37      15.39     4.44   0.00
bedrooms              2685.32    9192.53     0.29   0.77
bathrooms             6832.88   15721.19     0.43   0.67
lotsize               2303.22    7226.54     0.32   0.75
age                   -833.04     719.33    -1.16   0.26
monthsold            -2088.50    3520.90    -0.59   0.56
--------------------------------------------------------

MODEL INFO:
Observations: 29
Dependent Variable: price
Type: OLS linear regression 

MODEL FIT:
F(1,27) = 43.58, p = 0.00
R² = 0.62
Adj. R² = 0.60 

Standard errors:OLS
--------------------------------------------------------
                         Est.       S.E.   t val.      p
----------------- ----------- ---------- -------- ------
(Intercept)         115017.28   21489.36     5.35   0.00
size                    73.77      11.17     6.60   0.00
--------------------------------------------------------

The following uses default standard errors. For heteroskedastic-robust standard errors add , robust = “HC1”. Default includes standard errors

export_summs(ols.full, ols.small, scale = TRUE)

A huxtable: 18 × 3
	names	Model 1	Model 2
	<chr>	<chr>	<chr>
		Model 1	Model 2
1	(Intercept)	253910.344827586 ***	253910.344827586 ***
2		(4630.44948878858)	(4373.24701673169)
3	size	27229.6341330319 ***	29380.9493926936 ***
4		(6129.19772656264)	(4450.65562831248)
5	bedrooms	1812.66731922359
6		(6205.2273627909)
7	bathrooms	2330.99683705654
8		(5363.19204846579)
9	lotsize	1596.20979166114
10		(5008.2316865653)
11	age	-5930.38085788559
12		(5120.92452909128)
13	monthsold	-3507.31651374698
14		(5912.79950609896)
1.1	N	29	29
2.1	R2	0.650552844601402	0.61745343394899
.1	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.

This gives t-statistics

export_summs(ols.full, ols.small, scale = TRUE, 
             error_format = "({statistic})")

A huxtable: 18 × 3
	names	Model 1	Model 2
	<chr>	<chr>	<chr>
		Model 1	Model 2
1	(Intercept)	253910.344827586 ***	253910.344827586 ***
2		(54.8349237892268)	(58.0599138022951)
3	size	27229.6341330319 ***	29380.9493926936 ***
4		(4.4426098402772)	(6.60148792591122)
5	bedrooms	1812.66731922359
6		(0.29211940405167)
7	bathrooms	2330.99683705654
8		(0.434628634587746)
9	lotsize	1596.20979166114
10		(0.318717242244006)
11	age	-5930.38085788559
12		(-1.15806839647722)
13	monthsold	-3507.31651374698
14		(-0.59317359063659)
1.1	N	29	29
2.1	R2	0.650552844601402	0.61745343394899
.1	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.

This gives t-statistics and p-values

export_summs(ols.full, ols.small, scale = TRUE, 
    error_format = "({statistic}, p = {p.value})")

A huxtable: 18 × 3
	names	Model 1	Model 2
	<chr>	<chr>	<chr>
		Model 1	Model 2
1	(Intercept)	253910.344827586 ***	253910.344827586 ***
2		(54.8349237892268, p = 0)	(58.0599138022951, p = 0)
3	size	27229.6341330319 ***	29380.9493926936 ***
4		(4.4426098402772, p = 0.000204637210605776)	(6.60148792591122, p = 4.40875159537941e-07)
5	bedrooms	1812.66731922359
6		(0.29211940405167, p = 0.772932358924004)
7	bathrooms	2330.99683705654
8		(0.434628634587746, p = 0.668065373738364)
9	lotsize	1596.20979166114
10		(0.318717242244006, p = 0.752947112842582)
11	age	-5930.38085788559
12		(-1.15806839647722, p = 0.259254037652819)
13	monthsold	-3507.31651374698
14		(-0.59317359063659, p = 0.55911430746301)
1.1	N	29	29
2.1	R2	0.650552844601402	0.61745343394899
.1	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.

This gives 95% confidence interval

export_summs(ols.full, ols.small, scale = TRUE,
             error_format = "[{conf.low}, {conf.high}]")

A huxtable: 18 × 3
	names	Model 1	Model 2
	<chr>	<chr>	<chr>
		Model 1	Model 2
1	(Intercept)	253910.344827586 ***	253910.344827586 ***
2		[244307.380340498, 263513.309314675]	[244937.18314255, 262883.506512623]
3	size	27229.6341330319 ***	29380.9493926936 ***
4		[14518.456040055, 39940.8122260087]	[20248.9583561773, 38512.9404292099]
5	bedrooms	1812.66731922359
6		[-11056.1865886896, 14681.5212271367]
7	bathrooms	2330.99683705654
8		[-8791.58271025368, 13453.5763843668]
9	lotsize	1596.20979166114
10		[-8790.2270209302, 11982.6466042525]
11	age	-5930.38085788559
12		[-16550.5283215371, 4689.76660576591]
13	monthsold	-3507.31651374698
14		[-15769.7121653618, 8755.07913786788]
1.1	N	29	29
2.1	R2	0.650552844601402	0.61745343394899
.1	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.	All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. * p < 0.001; p < 0.01; * p < 0.05.