class: center, middle, inverse, title-slide # Day 6 ## Linear model - regression ### Michael W. Kearney📊
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- Build course directory on your computer: ``` r source( "https://raw.githubusercontent.com/mkearney/stat/master/R/build-course.R" ) ``` --- class: inverse, center, middle ## Linear model - regression --- ## Key terms **DV**: Dependent Variable: + The output or thing (singular) we are predicting + Synonyms: output, outcome, endogenous **IV**: Independent Variable: + The inputs or things (one or more) we use to predict + Synonyms: input, endogenous, predictor, covariate --- ## Line of best fit + Straight line that *best fits* the data - **Best fit**: Criterion of minimizing the *sum of squared errors* + Line represents **predicted** (or **fitted**) values - For any value of an input, we predict the outcome value will be on the line --- ## Useful functions + `lm()`: Creates a regression model - Important args: `formula` and `data` + `summary()`: Results of the regression + `anova()`: Sum of squares information + `confint()`: Confidence intervals + `plot()`: Diagnostic displays --- class: inverse, center, middle ## Models lexicon --- ## Labelling stats/models **The [general] linear model**: statistical method for predicting an outcome variable using a linear combination of inputs (predictors) **Difference tests** + t-test, f-test **Analysis of variance** + ANOVA, ANCOVA, MANOVA, MANCOVA **Regression** + Ordinary linear squares (OLS) regression, linear regression --- ## Equation **The [general] linear model**: statistical method for predicting an outcome variable using a linear combination of inputs (predictors) + Linear models can be represented with a formula `$$Y_{i}=A + BX_{i} + E_{i}$$` + `X` and `Y` are observations of predictor and outcome + `A` is the intercept (fitted value of Y when X is zero) + `B` is slope (effect of `X` on `Y`) for a one-unit change in the value of `X` --- ## Example equation(s) For a single predictor `$$Y_{credibility} = A + B * X_{experience}$$` For any number of predictors `$$Y_{credibility} = A + B_{1} * X_{experience} + B_{2} * X_{market}$$` --- ## Ordinary Least Squares + All linear models attempt to find the **best fit** (matching observed with predicted values) - Maximize accuracy/minimize **residuals** (error) `$$Residual = y_{i} - \widehat{Y}$$` + The most common method is **least squares**, which attempts to minimize the distance of squared errors - The squared difference between observed and predicted values of the outcome variable) + **OLS regression** is simple, powerful, and widely used --- ## What can we regress? <p style="align:center"> <img src="img/06-1.png"</img> </p> --- ## Yep! <p style="align:center"> <img src="img/06-2.png"</img> </p> --- ## What about this? <p style="align:center"> <img src="img/06-3.png"</img> </p> --- ## Not precise! <p style="align:center"> <img src="img/06-4.png"</img> </p> --- ## And this one? <p style="align:center"> <img src="img/06-5.png"</img> </p> --- ## Yep! <p style="align:center"> <img src="img/06-6.png"</img> </p> --- ## And this one? <p style="align:center"> <img src="img/06-7.png"</img> </p> --- ## Prob not! <p style="align:center"> <img src="img/06-8.png"</img> </p> --- class: inverse, center, middle ## Assumptions --- ## Assumptions of OLS regression + We don't assume much about predictors + We assume ***a lot*** about the outcome variables --- ## Assumptions 1. **Linear** combination of parameters (coefficients) 2. Mean of **residuals** is zero (equally distributed above and below the observed outcomes) 3. **Homoscedasticity** or equal variance 4. **Independence** of residuals 5. Predictor variables are at least somewhat **orthogonal**/independent - Tolerance: multiple correlation of predictors - VIF: 1 / tolerance --- ## IID + **Independent and identically distributed** + Observed scores of the outcome occur independently and are the result of the same underlying data generating process --- ## Model fit + Regression coefficients describe the size, direction, and reliability of the effect of one or more predictors on the expected value of the outcome variable + To assess how well a model fits the data overall, we need **model fit** statistics - Often conflated with **effect size** --- ## R-squared + For OLS regression, the most common fit statistic is `\(R^2\)` - Proportion of total variance explained by the model - Like the correlation coefficient but it can only be positive `$$R^2 = \frac{s^2_{explained}}{s^2_{total}}$$` + Because `\(R^2\)` can never really go down, it's often appropriate to report an adjusted `\(R^2\)` - `\(R^2\)` estimate penalized by number of parameters --- ## Other fit indices + **RMSEA**: average distance of residuals from zero `$$MSE=\frac{1}{n}\sum (\widehat{Y} - y_{i})^2$$` + **F statistic**: compares the estimated model to a simpler model (e.g., intercept-only model) + **AIC/BIC/CFI/TLI**: there are lots of variations but these ones are also common --- ## Formulas in R Indicate a formula in R with the tilde `~` symbol ```r f <- y ~ x class(f) #> [1] "formula" ``` Use this formula notation to specify your models + LHS: always the outcome + RHS: always the predictor(s) separated by plus signs + **`data`**: make sure to specify your data set --- Formula with `cor.test()` ```r cor.test(~ mpg + wt, data = mtcars) #> #> Pearson's product-moment correlation #> #> data: mpg and wt #> t = -9.559, df = 30, p-value = 0.000000000129 #> alternative hypothesis: true correlation is not equal to 0 #> 95 percent confidence interval: #> -0.933826 -0.744087 #> sample estimates: #> cor #> -0.867659 ``` Formula with `lm()` ```r lm(mpg ~ wt, data = mtcars) #> #> Call: #> lm(formula = mpg ~ wt, data = mtcars) #> #> Coefficients: #> (Intercept) wt #> 37.29 -5.34 ``` --- ## Regression in R Use the `lm()` function to specify the model. + First argument, `formula`, is the model specification + Specify data set with `data` argument ```r ## run model m <- lm(mpg ~ wt + gear, data = mtcars) ``` --- Use the `summary()` function on the output from `lm()` to see the regression table results ```r ## view results summary(m) #> #> Call: #> lm(formula = mpg ~ wt + gear, data = mtcars) #> #> Residuals: #> Min 1Q Median 3Q Max #> -4.130 -2.306 -0.293 1.441 6.830 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 38.916 5.097 7.63 0.000000020 *** #> wt -5.485 0.699 -7.85 0.000000012 *** #> gear -0.320 0.927 -0.34 0.73 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 3.09 on 29 degrees of freedom #> Multiple R-squared: 0.754, Adjusted R-squared: 0.737 #> F-statistic: 44.4 on 2 and 29 DF, p-value: 0.00000000149 ``` --- ## broom The broom package is useful for viewing coefficients and model fit statistics too ```r ## another way to view coefficients broom::tidy(m) #> # A tibble: 3 x 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 38.9 5.10 7.63 0.0000000204 #> 2 wt -5.49 0.699 -7.85 0.0000000117 #> 3 gear -0.320 0.927 -0.345 0.733 ## another way to view model fit broom::glance(m) #> # A tibble: 1 x 11 #> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC #> * <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> #> 1 0.754 0.737 3.09 44.4 1.49e-9 3 -79.9 168. 174. #> # ... with 2 more variables: deviance <dbl>, df.residual <int> ``` --- ## stargazer The stargazer package is useful for outputing regression tables (printing to text in the console, html code, latex, etc.) There are a handful of other packages that do this too. I'm honestly not sure which one is best. --- ```r stargazer::stargazer(m, type = "text") #> #> =============================================== #> Dependent variable: #> --------------------------- #> mpg #> ----------------------------------------------- #> wt -5.485*** #> (0.699) #> #> gear -0.320 #> (0.927) #> #> Constant 38.916*** #> (5.097) #> #> ----------------------------------------------- #> Observations 32 #> R2 0.754 #> Adjusted R2 0.737 #> Residual Std. Error 3.092 (df = 29) #> F Statistic 44.405*** (df = 2; 29) #> =============================================== #> Note: *p<0.1; **p<0.05; ***p<0.01 ``` --- ## knitr Or, in Rmarkdown, you can use `knitr::kable()` on a data frame ```r knitr::kable(broom::tidy(m), format = "html") ``` <table> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 38.915653 </td> <td style="text-align:right;"> 5.097397 </td> <td style="text-align:right;"> 7.634417 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:left;"> wt </td> <td style="text-align:right;"> -5.485019 </td> <td style="text-align:right;"> 0.698658 </td> <td style="text-align:right;"> -7.850791 </td> <td style="text-align:right;"> 0.000000 </td> </tr> <tr> <td style="text-align:left;"> gear </td> <td style="text-align:right;"> -0.319553 </td> <td style="text-align:right;"> 0.926543 </td> <td style="text-align:right;"> -0.344887 </td> <td style="text-align:right;"> 0.732668 </td> </tr> </tbody> </table>