class: center, middle, inverse, title-slide # Day 4 ## Correlations and factor analysis ### Michael W. Kearney📊
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- class: inverse, center, middle ## Agenda --- ## Agenda 1. Co-variance - What is covariance? - Calculating covariance 1. Correlations - Bivariate correlations - Correlation matrices 1. Factor analysis - What is a factor? - Calculating scores - Reliability --- class: inverse, center, middle ## Test Try this out: ```r source("https://raw.githubusercontent.com/mkearney/stat/master/static/build.R") ``` --- class: inverse, center, middle ## Co-variance --- ## Variation (review) **Variance**: average distance from the mean `$$s^2=\frac{\sum (x - \bar{x})^2}{n - 1}$$` + Single variable: `\(x\)` + Always positive: `\((x - \bar{x})^2\)` --- ## Co-variation **Covariance**: average *related* distance**s** from the mean `$$cov_{(x, y)} = \frac{\sum (x-\bar{x})(y-\bar{y})}{n - 1}$$` + Two variables: `\(x\)` and `\(y\)` + Can be positive or negative - Positive means vary [above/below mean] together - Negative means vary [above/below mean] inversely --- ## Note: When a variable covaries with itself... `$$cov_{(x, x)} = \frac{\sum (x-\bar{x})(x-\bar{x})}{n - 1}$$` `$$cov_{(x, x)} = \frac{\sum (x - \bar{x})^2}{n - 1}$$` `$$cov_{(x, x)} = s^2$$` > Covariance with self is the same as variance --- ## Correlation A **correlation** describes how variation in one variable relates to variation in another variable. In statistics, correlations describe the **magnitude** and **direction** of a relationship between two variables. --- ## Example code... ```r ## create data frame with x and y coming from random normal draws df <- data_frame(x = rnorm(20), y = x + rnorm(20)) ## plot points ggplot(df, aes(x, y)) + geom_vline(xintercept = 0, color = "#666666") + geom_hline(yintercept = 0, color = "#666666") + geom_point(shape = 21, fill = "#F1B82D", size = 8, alpha = .8) + labs(title = "Scatter plot of y by x", x = "Observations of x", y = "Observations of y") + ggsave("img/scatter_plot.png", width = 8, height = 7, units = "in") ``` --- <p style="align:center"> <img src="img/scatter_plot.png" /> </p> --- ## Correlation coefficient **`r`**: describes the magnitude (size) and direction (order) of a relationship There are several types of correlation coefficients, but typically they all... + have a **range of possible values** from -1.0 to +1.0 + **describe** the magnitude (size) and direction (order) of an association - **Direction** by **sign** (positive or negative) - **Magnitude** by **absolute value** (distance from zero) --- ## Note(s) on "coefficients" > Throughout statistics, a *coefficient* refers to the multiplicative factor (a numeric estimate) for one variable in relation to another variable. > In **regression**, coefficients are often referred to as "beta coefficients" or "beta weights" > In **machine learning**, coefficients are often referred to as [simply] "weights" --- ## Note(s) on notation In many statistical modeling frameworks, a correlation is often indicated (by convention) by connecting two variables [names] with a double tilde `~~` symbol. I'll occasionally use this kind of syntax to mean **correlation** ``` x ~~ y ``` --- ## Types of correlations **Pearson product-moment**: linear relationship between two variables `$$r_{(x,y)} = x\sim\sim y$$` **Spearman's rho**: relationship between rankings of two variables `$$\rho_{(x,y)} = x\sim\sim y$$` **Intraclass**: relationship between paired observations `$$\varrho_{(x_{t1},x_{t2})} = x_{t1}\sim\sim x_{t2}$$` --- ## Pearson's `r` **Correlation**: the covariation divided by the total variation `$$r_{(x, y)} = \frac{cov_{(x, y)}}{s_{x}s_{y}}$$` + `\(s_{x}\)` and `\(s_{y}\)` are the sample standard deviations of `\(x\)` and `\(y\)` - *Note*: `\(s\)` is the square root of the variance, i.e., `\(\sqrt{s^2}\)` --- ## Note(s) on `x ~~ x` A single variable `\(x\)` correlates perfectly with itself... `$$r_{(x, x)} = \frac{cov_{(x, x)}}{s_{x}s_{x}}$$` + We know from earlier `\(cov_{(x,x)}\)` is equal to `\(s_{x}^2\)` (variance) + And `\(s_{x}s_{x}\)` can be rewritten as `\(s_{x}^2\)`, so... `$$r_{(x, x)} = \frac{s_{x}^2}{s_{x}^2} = 1.0$$` --- ## Numeric and rank-order data ```r ## create data frame with x and y coming from random normal draws df <- data_frame(x = rnorm(20), y = x + rnorm(20)) ## convert to ranks dfo <- map_df(df, factor) %>% mutate_all(as.numeric) ## print print(df, n = 6) ``` | x | y | |:------:|:------:| | 1.698 | 1.325 | | 0.570 | -0.131 | | 0.051 | -0.075 | | -0.027 | 1.236 | | 0.679 | 1.089 | | -0.583 | -0.178 | --- ## `cor(method = "pearson")` By default, `cor()` returns the Pearson product-moment correlation. ```r ## using cor defaults cor(df$x, df$y) #> [1] 0.624567 ## same thing as cor(df$x, df$y, method = "pearson") #> [1] 0.624567 ``` Use `method` to specify the type of correlation and `use` to deal with missing data ```r ## pearson product-moment correlation cor(df$x, df$y, method = "pearson", use = "pairwise.complete.obs") #> [1] 0.624567 ``` --- ## `cor(method = ` "spearman"`)` Get Spearman Rho rank-order correlation. ```r ## spearman rho correlation of df cor(df$x, df$y, method = "spearman") #> [1] 0.578947 ``` This should be equal to Pearson correlation of the rank order data set ```r ## spearman rho correlation cor(dfo$x, dfo$y, method = "pearson") #> [1] 0.578947 ``` --- ## Rank order correlation Calculate **Spearman's rho** the same way, only convert observations to ranked values first. **Example**: converting observations to ranked values + `c(22, 2, 5, 1)` converts to `c(4, 2, 3, 1)` + `c(7.2, 3.0, 5.25, 1.9)` converts to `c(4, 2, 3, 1)` --- ## Basic example Let's say we have four values of x and y, and we want to estimate Pearson's R ```r x <- c(22, 6, 15, 1) y <- c(1, 3, 9, 10) cor(x, y, method = "pearson") #> [1] -0.548093 ``` Now estimate Spearman's rho using the same values ```r cor(x, y, method = "spearman") #> [1] -0.8 ``` Notice a difference? --- ## Example We can actually replicate the result from the previous slide using Pearson's method if we first convert x and y to ranks ```r x_ranks <- c(4, 2, 3, 1) y_ranks <- c(1, 2, 3, 4) cor(x_ranks, y_ranks, method = "pearson") #> [1] -0.8 ``` Create data frame to visualize difference... ```r df <- data_frame( data = c(rep("raw", 4), rep("rank", 4)), x = c(x, x_ranks), y = c(y, y_ranks)) ``` --- ```r ggplot(df, aes(x = x, y = y, fill = data)) + geom_point(shape = 21, size = 8, alpha = .8) + labs(title = "Scatter plot of raw and rank-order values") + scale_fill_manual(values = c(raw = "#F1B82D", rank = "#222222")) + ggsave("img/scatter_plot-xy.png", width = 8, height = 4.5, units = "in") ``` <p style="align:center"> <img src="img/scatter_plot-xy.png" /> </p> --- ## Hypothesis testing To conduct a signficiance test of a correlation using R, use `cor.test()`. ```r ## correlation test cor.test(df$x, df$y, method = "pearson", alternative = "two.sided") #> #> Pearson's product-moment correlation #> #> data: df$x and df$y #> t = -0.2204, df = 6, p-value = 0.833 #> alternative hypothesis: true correlation is not equal to 0 #> 95 percent confidence interval: #> -0.747114 0.656509 #> sample estimates: #> cor #> -0.0896272 ``` .footnote[To convert the output to a more usable data frame, use `tidy()` from the **{broom}** package.] --- ## `method =` "pearson" Correlation test of interval/ratio data ```r ## correlation sig test cor.test(df$x, df$y, method = "pearson") %>% broom::tidy() %>% print() ``` | estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative | |:--------:|:---------:|:-------:|:---------:|:--------:|:---------:|:------------------------------------:|:-----------:| | -0.09 | -0.22 | 0.833 | 6 | -0.747 | 0.657 | Pearson's product-moment correlation | two.sided | --- ## `method =` "spearman" Correlation test of ordinal data ```r ## correlation sig test cor.test(df$x, df$y, method = "spearman") %>% broom::tidy() %>% print() #> Warning in cor.test.default(df$x, df$y, method = "spearman"): Cannot #> compute exact p-value with ties ``` | estimate | statistic | p.value | method | alternative | |:--------:|:---------:|:-------:|:-------------------------------:|:-----------:| | -0.448 | 121.673 | 0.265 | Spearman's rank correlation rho | two.sided | --- ## Correlation tool Visualize correlations of different values (-1 to 1): + https://mikewk.shinyapps.io/correlation/ --- ## Practice Guess the correlation coefficient (the game): + http://guessthecorrelation.com/ --- class: inverse, center, middle ## Factors and factor analysis --- ## Factor A **factor** is a psychometric term for variable. In statistics, factors describe the **latent** variables which we are attempting to measure. --- ## Items An **item** refers to a single question/prompt/response-provoking stimulus in a questionnaire. In the context of a study, a **variable** refers to a construct (or factor) to be examined. Variables can consist of one or more items from a questionnaire. --- ## Multi-item variables When a variable (or factor) is measured using multiple items, we still want to represent it with one number. **How can we represent a variable measured with 5 likert-type items using one number per observation (respondent)?** --- ## Example Our variable of interest is extraversion/introversion. We measure it using four likert-type items: + I like talking to lots of people + I like attending events where I meet knew people + I am comfortable at a party where I don't know anyone else + Meeting my friends' friends makes me nervous --- ## Example responses ```r ## first person's responses person1 <- c(1, 2, 6, 7) person2 <- c(1, 1, 6, 7) person3 <- c(2, 1, 6, 6) person4 <- c(2, 1, 7, 7) person5 <- c(2, 1, 6, 6) ## convert to data frame with four columns peeps <- c(person1, person2, person3, person4, person5) %>% matrix(nrow = 5, byrow = TRUE) %>% as.data.frame() ## view data print(peeps) #> | V1 | V2 | V3 | V4 | #> |:--:|:--:|:--:|:--:| #> | 1 | 2 | 6 | 7 | #> | 1 | 1 | 6 | 7 | #> | 2 | 1 | 6 | 6 | #> | 2 | 1 | 7 | 7 | #> | 2 | 1 | 6 | 6 | ``` --- ## Correlation matrix View multiple bivariate correlations in a matrix. ```r ## view correlation matrix cor(peeps, method = "pearson") #> V1 V2 V3 V4 #> V1 1.000000 -0.612372 0.408248 -0.666667 #> V2 -0.612372 1.000000 -0.250000 0.408248 #> V3 0.408248 -0.250000 1.000000 0.408248 #> V4 -0.666667 0.408248 0.408248 1.000000 ``` --- ## Reliability Cronbach's alpha is typically used to summarize correlations between the measurement items using a single estimate ```r suppressWarnings(psych::alpha(peeps, keys = c("V1", "V2"), warnings = FALSE)) #> In factor.stats, I could not find the RMSEA upper bound . Sorry about that #> #> Reliability analysis #> Call: psych::alpha(x = peeps, keys = c("V1", "V2"), warnings = FALSE) #> #> raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r #> -0.33 -0.24 0.82 -0.051 -0.19 0.99 3.9 0.22 0.079 #> #> lower alpha upper 95% confidence boundaries #> -2.28 -0.33 1.61 #> #> Reliability if an item is dropped: #> raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r #> V1 0.45 0.41 0.563 0.19 0.70 0.40 0.14 0.41 #> V2 0.00 0.14 1.000 0.05 0.16 0.84 0.39 0.41 #> V3 -2.50 -2.08 -0.269 -0.29 -0.67 2.57 0.37 -0.61 #> V4 -0.60 -0.65 0.037 -0.15 -0.39 1.17 0.27 -0.25 #> #> Item statistics #> n raw.r std.r r.cor r.drop mean sd #> V1 5 0.10 0.07 0.078 -0.46 1.6 0.55 #> V2 5 0.25 0.30 -0.048 -0.25 1.2 0.45 #> V3 5 0.88 0.85 0.942 0.61 6.2 0.45 #> V4 5 0.61 0.62 0.691 0.00 6.6 0.55 #> #> Non missing response frequency for each item #> 1 2 6 7 miss #> V1 0.4 0.6 0.0 0.0 0 #> V2 0.8 0.2 0.0 0.0 0 #> V3 0.0 0.0 0.8 0.2 0 #> V4 0.0 0.0 0.4 0.6 0 ``` --- ## Factor analysis To make sure we have **reliable** measures, we use factor analysis. **Factor analysis** essentially finds the correlation between responses for similar items. There are lots of details and variations, but knowing this much will help you in the future!