Day 6

class: center, middle, inverse, title-slide

# Day 6
## Linear model - regression
### Michael W. Kearney📊<br />School of Journalism <br />Informatics Institute <br />University of Missouri
### <table style="border-style:none;padding-top:30px;" class=".table">
<tr>
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> <i class="fab fa-twitter"></i> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> <i class="fab fa-github"></i> </a>
</th>
</tr>
<tr style="background-color:#fff">
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> <span class="citation">@kearneymw</span> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> <span class="citation">@mkearney</span> </a>
</th>
</tr>
</table>

---

Build course directory on your computer:

``` r
source(
  "https://raw.githubusercontent.com/mkearney/stat/master/R/build-course.R"
)
```

---
class: inverse, center, middle

## Linear model - regression

---

## Key terms

**DV**: Dependent Variable:  
+ The output or thing (singular) we are predicting 
+ Synonyms: output, outcome, endogenous

**IV**: Independent Variable:
+ The inputs or things (one or more) we use to predict 
+ Synonyms: input, endogenous, predictor, covariate

---

## Line of best fit

+ Straight line that *best fits* the data
   - **Best fit**: Criterion of minimizing the *sum of squared errors*
+ Line represents **predicted** (or **fitted**) values
   - For any value of an input, we predict the outcome value will be on the line

---

## Useful functions

+ `lm()`: Creates a regression model
  - Important args: `formula` and `data`
+ `summary()`: Results of the regression
+ `anova()`: Sum of squares information
+ `confint()`: Confidence intervals
+ `plot()`: Diagnostic displays

---
class: inverse, center, middle

## Models lexicon

---

## Labelling stats/models

**The [general] linear model**: statistical method for predicting an outcome 
variable using a linear combination of inputs (predictors)

**Difference tests**
+ t-test, f-test

**Analysis of variance**
+ ANOVA, ANCOVA, MANOVA, MANCOVA

**Regression**
+ Ordinary linear squares (OLS) regression, linear regression

---

## Equation

**The [general] linear model**: statistical method for predicting an outcome 
variable using a linear combination of inputs (predictors)

+ Linear models can be represented with a formula

`$$Y_{i}=A + BX_{i} + E_{i}$$`

+ `X` and `Y` are observations of predictor and outcome
+ `A` is the intercept (fitted value of Y when X is zero)
+ `B` is slope (effect of `X` on `Y`) for a one-unit change in the value of `X`

---

## Example equation(s)

For a single predictor

`$$Y_{credibility} = A + B * X_{experience}$$`

For any number of predictors

`$$Y_{credibility} = A + B_{1} * X_{experience} + B_{2} * X_{market}$$`

---

## Ordinary Least Squares

+ All linear models attempt to find the **best fit** (matching observed with 
predicted values)
   - Maximize accuracy/minimize **residuals** (error)

`$$Residual = y_{i} - \widehat{Y}$$`

+ The most common method is **least squares**, which attempts to minimize
the distance of squared errors
   - The squared difference between observed and predicted values of the outcome variable)
+ **OLS regression** is simple, powerful, and widely used

---

## What can we regress?

---

## Yep!

---

## What about this?

---

## Not precise!

---

## And this one?

---

## Yep!

---

## And this one?

---

## Prob not!

---
class: inverse, center, middle

## Assumptions

---

## Assumptions of OLS regression
+ We don't assume much about predictors
+ We assume ***a lot*** about the outcome variables

---

## Assumptions
1. **Linear** combination of parameters (coefficients)
2. Mean of **residuals** is zero (equally distributed above and below the 
observed outcomes)
3. **Homoscedasticity** or equal variance
4. **Independence** of residuals
5. Predictor variables are at least somewhat **orthogonal**/independent
  - Tolerance: multiple correlation of predictors
  - VIF: 1 / tolerance

---

## IID
+ **Independent and identically distributed**
+ Observed scores of the outcome occur independently and are the result of the 
same underlying data generating process

---

## Model fit

+ Regression coefficients describe the size, direction, and reliability of the 
effect of one or more predictors on the expected value of the outcome variable
+ To assess how well a model fits the data overall, we need **model fit** 
statistics
   - Often conflated with **effect size**

---

## R-squared

+ For OLS regression, the most common fit statistic is `$R^2$`
   - Proportion of total variance explained by the model
   - Like the correlation coefficient but it can only be positive

`$$R^2 = \frac{s^2_{explained}}{s^2_{total}}$$`

+ Because `$R^2$` can never really go down, it's often appropriate to report an 
adjusted `$R^2$`
   - `$R^2$` estimate penalized by number of parameters

---

## Other fit indices
+ **RMSEA**: average distance of residuals from zero

`$$MSE=\frac{1}{n}\sum (\widehat{Y} - y_{i})^2$$`

+ **F statistic**: compares the estimated model to a simpler model (e.g., 
intercept-only model)
+ **AIC/BIC/CFI/TLI**: there are lots of variations but these ones are also 
common

---

## Formulas in R

Indicate a formula in R with the tilde `~` symbol

```r
f <- y ~ x
class(f)
#> [1] "formula"
```

Use this formula notation to specify your models

+ LHS: always the outcome
+ RHS: always the predictor(s) separated by plus signs
+ **`data`**: make sure to specify your data set

---

Formula with `cor.test()`

```r
cor.test(~ mpg + wt, data = mtcars)
#> 
#> 	Pearson's product-moment correlation
#> 
#> data:  mpg and wt
#> t = -9.559, df = 30, p-value = 0.000000000129
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.933826 -0.744087
#> sample estimates:
#>       cor 
#> -0.867659
```

Formula with `lm()`

```r
lm(mpg ~ wt, data = mtcars)
#> 
#> Call:
#> lm(formula = mpg ~ wt, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)           wt  
#>       37.29        -5.34
```

---

## Regression in R

Use the `lm()` function to specify the model.

+ First argument, `formula`, is the model specification
+ Specify data set with `data` argument

```r
## run model
m <- lm(mpg ~ wt + gear, data = mtcars)
```

---

Use the `summary()` function on the output from `lm()` to see the regression 
table results

```r
## view results
summary(m)
#> 
#> Call:
#> lm(formula = mpg ~ wt + gear, data = mtcars)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -4.130 -2.306 -0.293  1.441  6.830 
#> 
#> Coefficients:
#>             Estimate Std. Error t value    Pr(>|t|)    
#> (Intercept)   38.916      5.097    7.63 0.000000020 ***
#> wt            -5.485      0.699   -7.85 0.000000012 ***
#> gear          -0.320      0.927   -0.34        0.73    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.09 on 29 degrees of freedom
#> Multiple R-squared:  0.754,	Adjusted R-squared:  0.737 
#> F-statistic: 44.4 on 2 and 29 DF,  p-value: 0.00000000149
```

---

## broom

The broom package is useful for viewing coefficients and model fit statistics too

```r
## another way to view coefficients
broom::tidy(m)
#> # A tibble: 3 x 5
#>   term        estimate std.error statistic      p.value
#>   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
#> 1 (Intercept)   38.9       5.10      7.63  0.0000000204
#> 2 wt            -5.49      0.699    -7.85  0.0000000117
#> 3 gear          -0.320     0.927    -0.345 0.733

## another way to view model fit
broom::glance(m)
#> # A tibble: 1 x 11
#>   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
#> *     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
#> 1     0.754         0.737  3.09      44.4 1.49e-9     3  -79.9  168.  174.
#> # ... with 2 more variables: deviance <dbl>, df.residual <int>
```

---

## stargazer

The stargazer package is useful for outputing regression tables (printing to 
text in the console, html code, latex, etc.)

There are a handful of other packages that do this too. I'm honestly not sure
which one is best.

---

```r
stargazer::stargazer(m, type = "text")
#> 
#> ===============================================
#>                         Dependent variable:    
#>                     ---------------------------
#>                                 mpg            
#> -----------------------------------------------
#> wt                           -5.485***         
#>                               (0.699)          
#>                                                
#> gear                          -0.320           
#>                               (0.927)          
#>                                                
#> Constant                     38.916***         
#>                               (5.097)          
#>                                                
#> -----------------------------------------------
#> Observations                    32             
#> R2                             0.754           
#> Adjusted R2                    0.737           
#> Residual Std. Error       3.092 (df = 29)      
#> F Statistic           44.405*** (df = 2; 29)   
#> ===============================================
#> Note:               *p<0.1; **p<0.05; ***p<0.01
```

---

## knitr

Or, in Rmarkdown, you can use `knitr::kable()` on a data frame

```r
knitr::kable(broom::tidy(m), format = "html")
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 38.915653 </td>
   <td style="text-align:right;"> 5.097397 </td>
   <td style="text-align:right;"> 7.634417 </td>
   <td style="text-align:right;"> 0.000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> wt </td>
   <td style="text-align:right;"> -5.485019 </td>
   <td style="text-align:right;"> 0.698658 </td>
   <td style="text-align:right;"> -7.850791 </td>
   <td style="text-align:right;"> 0.000000 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> gear </td>
   <td style="text-align:right;"> -0.319553 </td>
   <td style="text-align:right;"> 0.926543 </td>
   <td style="text-align:right;"> -0.344887 </td>
   <td style="text-align:right;"> 0.732668 </td>
  </tr>
</tbody>
</table>