Day 4

class: center, middle, inverse, title-slide

# Day 4
## Correlations and factor analysis
### Michael W. Kearney📊 School of Journalism Informatics Institute University of Missouri
### <table style="border-style:none;padding-top:30px;" class=".table">
<tr>
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> </a>
</th>
</tr>
<tr style="background-color:#fff">
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> @kearneymw </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> @mkearney </a>
</th>
</tr>
</table>

---

class: inverse, center, middle

## Agenda

---

## Agenda
1. Co-variance
   - What is covariance?
   - Calculating covariance
1. Correlations
   - Bivariate correlations
   - Correlation matrices
1. Factor analysis
   - What is a factor?
   - Calculating scores
   - Reliability

---
class: inverse, center, middle

## Test

Try this out:

```r
source("https://raw.githubusercontent.com/mkearney/stat/master/static/build.R")
```

---
class: inverse, center, middle

## Co-variance

---

## Variation (review)

**Variance**: average distance from the mean

`$$s^2=\frac{\sum (x - \bar{x})^2}{n - 1}$$`

+ Single variable: `$x$`
+ Always positive: `$(x - \bar{x})^2$`

---

## Co-variation

**Covariance**: average *related* distance**s** from the mean

`$$cov_{(x, y)} = \frac{\sum (x-\bar{x})(y-\bar{y})}{n - 1}$$`

+ Two variables: `$x$` and `$y$`
+ Can be positive or negative
   - Positive means vary [above/below mean] together
   - Negative means vary [above/below mean] inversely

---

## Note:

When a variable covaries with itself...

`$$cov_{(x, x)} = \frac{\sum (x-\bar{x})(x-\bar{x})}{n - 1}$$`

`$$cov_{(x, x)} = \frac{\sum (x - \bar{x})^2}{n - 1}$$`

`$$cov_{(x, x)} = s^2$$`

> Covariance with self is the same as variance

---

## Correlation

A **correlation** describes how variation in one variable relates to variation in another variable.

In statistics, correlations describe the **magnitude** and **direction** of a relationship between two variables.

---

## Example code...

```r
## create data frame with x and y coming from random normal draws
df <- data_frame(x = rnorm(20), y = x + rnorm(20))

## plot points
ggplot(df, aes(x, y)) + 
  geom_vline(xintercept = 0, color = "#666666") + 
  geom_hline(yintercept = 0, color = "#666666") + 
  geom_point(shape = 21, fill = "#F1B82D", size = 8, alpha = .8) + 
  labs(title = "Scatter plot of y by x", 
    x = "Observations of x", y = "Observations of y") + 
  ggsave("img/scatter_plot.png", width = 8, height = 7, units = "in")
```

---

---

## Correlation coefficient

**`r`**: describes the magnitude (size) and direction (order) of a relationship

There are several types of correlation coefficients, but typically they all...

+ have a **range of possible values** from -1.0 to +1.0
+ **describe** the magnitude (size) and direction (order) of an association
   - **Direction** by **sign** (positive or negative)
   - **Magnitude** by **absolute value** (distance from zero)

---

## Note(s) on "coefficients"

> Throughout statistics, a *coefficient* refers to the multiplicative factor (a numeric estimate) for one variable in relation to another variable.

> In **regression**, coefficients are often referred to as "beta coefficients" or "beta weights"

> In **machine learning**, coefficients are often referred to as [simply] "weights"

---

## Note(s) on notation

In many statistical modeling frameworks, a correlation is often indicated (by convention) by connecting two variables [names] with a double tilde `~~` symbol.

I'll occasionally use this kind of syntax to mean **correlation**

```
x ~~ y
```

---

## Types of correlations

**Pearson product-moment**: linear relationship between two variables 
`$$r_{(x,y)} = x\sim\sim y$$`

**Spearman's rho**: relationship between rankings of two variables 
`$$\rho_{(x,y)} = x\sim\sim y$$`

**Intraclass**: relationship between paired observations 
`$$\varrho_{(x_{t1},x_{t2})} = x_{t1}\sim\sim x_{t2}$$`

---

## Pearson's `r`

**Correlation**: the covariation divided by the total variation

`$$r_{(x, y)} = \frac{cov_{(x, y)}}{s_{x}s_{y}}$$`

+ `$s_{x}$` and `$s_{y}$` are the sample standard deviations of `$x$` and `$y$` 
   - *Note*: `$s$` is the square root of the variance, i.e., `$\sqrt{s^2}$`

---

## Note(s) on `x ~~ x`

A single variable `$x$` correlates perfectly with itself...

`$$r_{(x, x)} = \frac{cov_{(x, x)}}{s_{x}s_{x}}$$`

+ We know from earlier `$cov_{(x,x)}$` is equal to `$s_{x}^2$` (variance)
+ And `$s_{x}s_{x}$` can be rewritten as `$s_{x}^2$`, so...

`$$r_{(x, x)} = \frac{s_{x}^2}{s_{x}^2} = 1.0$$`

---

## Numeric and rank-order data

```r
## create data frame with x and y coming from random normal draws
df <- data_frame(x = rnorm(20), y = x + rnorm(20))

## convert to ranks
dfo <- map_df(df, factor) %>% mutate_all(as.numeric)

## print
print(df, n = 6)
```

|   x    |   y    |
|:------:|:------:|
| 1.698  | 1.325  |
| 0.570  | -0.131 |
| 0.051  | -0.075 |
| -0.027 | 1.236  |
| 0.679  | 1.089  |
| -0.583 | -0.178 |

---

## `cor(method = "pearson")`

By default, `cor()` returns the Pearson product-moment correlation.

```r
## using cor defaults
cor(df$x, df$y)
#> [1] 0.624567

## same thing as
cor(df$x, df$y, method = "pearson")
#> [1] 0.624567
```

Use `method` to specify the type of correlation and `use` to deal with missing data

```r
## pearson product-moment correlation
cor(df$x, df$y, method = "pearson", use = "pairwise.complete.obs")
#> [1] 0.624567
```

---

## `cor(method = ` "spearman"`)`

Get Spearman Rho rank-order correlation.

```r
## spearman rho correlation of df
cor(df$x, df$y, method = "spearman")
#> [1] 0.578947
```

This should be equal to Pearson correlation of the rank order data set

```r
## spearman rho correlation
cor(dfo$x, dfo$y, method = "pearson")
#> [1] 0.578947
```

---

## Rank order correlation

Calculate **Spearman's rho** the same way, only convert observations to ranked values first.

**Example**: converting observations to ranked values

+ `c(22, 2, 5, 1)` converts to `c(4, 2, 3, 1)`
+ `c(7.2, 3.0, 5.25, 1.9)` converts to `c(4, 2, 3, 1)`

---

## Basic example

Let's say we have four values of x and y, and we want to estimate Pearson's R

```r
x <- c(22, 6, 15, 1)
y <- c(1, 3, 9, 10)
cor(x, y, method = "pearson")
#> [1] -0.548093
```

Now estimate Spearman's rho using the same values

```r
cor(x, y, method = "spearman")
#> [1] -0.8
```

Notice a difference?

---

## Example

We can actually replicate the result from the previous slide using Pearson's method if we first convert x and y to ranks

```r
x_ranks <- c(4, 2, 3, 1)
y_ranks <- c(1, 2, 3, 4)
cor(x_ranks, y_ranks, method = "pearson")
#> [1] -0.8
```

Create data frame to visualize difference...

```r
df <- data_frame(
 data = c(rep("raw", 4), rep("rank", 4)), 
 x = c(x, x_ranks), 
 y = c(y, y_ranks))
```

---

```r
ggplot(df, aes(x = x, y = y, fill = data)) + 
  geom_point(shape = 21, size = 8, alpha = .8) + 
  labs(title = "Scatter plot of raw and rank-order values") + 
  scale_fill_manual(values = c(raw = "#F1B82D", rank = "#222222")) + 
  ggsave("img/scatter_plot-xy.png", width = 8, height = 4.5, units = "in")
```

---

## Hypothesis testing

To conduct a signficiance test of a correlation using R, use `cor.test()`.

```r
## correlation test
cor.test(df$x, df$y, method = "pearson", alternative = "two.sided")
#> 
#> 	Pearson's product-moment correlation
#> 
#> data:  df$x and df$y
#> t = -0.2204, df = 6, p-value = 0.833
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  -0.747114  0.656509
#> sample estimates:
#>        cor 
#> -0.0896272
```

.footnote[To convert the output to a more usable data frame, use `tidy()` from the **{broom}** package.]

---

## `method =` "pearson"

Correlation test of interval/ratio data

```r
## correlation sig test
cor.test(df$x, df$y, method = "pearson") %>%
  broom::tidy() %>% print()
```

| estimate | statistic | p.value | parameter | conf.low | conf.high |                method                | alternative |
|:--------:|:---------:|:-------:|:---------:|:--------:|:---------:|:------------------------------------:|:-----------:|
|  -0.09   |   -0.22   |  0.833  |     6     |  -0.747  |   0.657   | Pearson's product-moment correlation |  two.sided  |

---

## `method =` "spearman"
 
Correlation test of ordinal data

```r
## correlation sig test
cor.test(df$x, df$y, method = "spearman") %>%
  broom::tidy() %>% print()
#> Warning in cor.test.default(df$x, df$y, method = "spearman"): Cannot
#> compute exact p-value with ties
```

| estimate | statistic | p.value |             method              | alternative |
|:--------:|:---------:|:-------:|:-------------------------------:|:-----------:|
|  -0.448  |  121.673  |  0.265  | Spearman's rank correlation rho |  two.sided  |

---

## Correlation tool

Visualize correlations of different values (-1 to 1):

+ https://mikewk.shinyapps.io/correlation/

---

## Practice

Guess the correlation coefficient (the game):

+ http://guessthecorrelation.com/

---
class: inverse, center, middle

## Factors and factor analysis

---

## Factor

A **factor** is a psychometric term for variable.

In statistics, factors describe the **latent** variables which we are attempting to measure.

---

## Items

An **item** refers to a single question/prompt/response-provoking stimulus in a questionnaire.

In the context of a study, a **variable** refers to a construct (or factor) to be examined.

Variables can consist of one or more items from a questionnaire.

---

## Multi-item variables

When a variable (or factor) is measured using multiple items, we still want to represent it with one number.

**How can we represent a variable measured with 5 likert-type items using one number per observation (respondent)?**

---

## Example

Our variable of interest is extraversion/introversion. We measure it using four likert-type items:

+ I like talking to lots of people
+ I like attending events where I meet knew people
+ I am comfortable at a party where I don't know anyone else
+ Meeting my friends' friends makes me nervous

---

## Example responses

```r
## first person's responses
person1 <- c(1, 2, 6, 7)
person2 <- c(1, 1, 6, 7)
person3 <- c(2, 1, 6, 6)
person4 <- c(2, 1, 7, 7)
person5 <- c(2, 1, 6, 6)

## convert to data frame with four columns
peeps <- c(person1, person2, person3, person4, person5) %>% 
 matrix(nrow = 5, byrow = TRUE) %>%
 as.data.frame()

## view data
print(peeps)
#> | V1 | V2 | V3 | V4 |
#> |:--:|:--:|:--:|:--:|
#> | 1  | 2  | 6  | 7  |
#> | 1  | 1  | 6  | 7  |
#> | 2  | 1  | 6  | 6  |
#> | 2  | 1  | 7  | 7  |
#> | 2  | 1  | 6  | 6  |
```

---

## Correlation matrix

View multiple bivariate correlations in a matrix.

```r
## view correlation matrix
cor(peeps, method = "pearson")
#>           V1        V2        V3        V4
#> V1  1.000000 -0.612372  0.408248 -0.666667
#> V2 -0.612372  1.000000 -0.250000  0.408248
#> V3  0.408248 -0.250000  1.000000  0.408248
#> V4 -0.666667  0.408248  0.408248  1.000000
```

---

## Reliability

Cronbach's alpha is typically used to summarize correlations between the measurement items using a single estimate

```r
suppressWarnings(psych::alpha(peeps, keys = c("V1", "V2"), warnings = FALSE))
#> In factor.stats, I could not find the RMSEA upper bound . Sorry about that
#> 
#> Reliability analysis   
#> Call: psych::alpha(x = peeps, keys = c("V1", "V2"), warnings = FALSE)
#> 
#>   raw_alpha std.alpha G6(smc) average_r   S/N  ase mean   sd median_r
#>      -0.33     -0.24    0.82    -0.051 -0.19 0.99  3.9 0.22    0.079
#> 
#>  lower alpha upper     95% confidence boundaries
#> -2.28 -0.33 1.61 
#> 
#>  Reliability if an item is dropped:
#>    raw_alpha std.alpha G6(smc) average_r   S/N alpha se var.r med.r
#> V1      0.45      0.41   0.563      0.19  0.70     0.40  0.14  0.41
#> V2      0.00      0.14   1.000      0.05  0.16     0.84  0.39  0.41
#> V3     -2.50     -2.08  -0.269     -0.29 -0.67     2.57  0.37 -0.61
#> V4     -0.60     -0.65   0.037     -0.15 -0.39     1.17  0.27 -0.25
#> 
#>  Item statistics 
#>    n raw.r std.r  r.cor r.drop mean   sd
#> V1 5  0.10  0.07  0.078  -0.46  1.6 0.55
#> V2 5  0.25  0.30 -0.048  -0.25  1.2 0.45
#> V3 5  0.88  0.85  0.942   0.61  6.2 0.45
#> V4 5  0.61  0.62  0.691   0.00  6.6 0.55
#> 
#> Non missing response frequency for each item
#>      1   2   6   7 miss
#> V1 0.4 0.6 0.0 0.0    0
#> V2 0.8 0.2 0.0 0.0    0
#> V3 0.0 0.0 0.8 0.2    0
#> V4 0.0 0.0 0.4 0.6    0
```

---

## Factor analysis

To make sure we have **reliable** measures, we use factor analysis.

**Factor analysis** essentially finds the correlation between responses for similar items.

There are lots of details and variations, but knowing this much will help you in the future!