Day 2

# Day 2
## Descriptive statistics
### Michael W. Kearney📊 School of Journalism Informatics Institute University of Missouri
### <table style="border-style:none;padding-top:30px;" class=".table">
<tr>
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> </a>
</th>
</tr>
<tr style="background-color:#fff">
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> @kearneymw </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> @mkearney </a>
</th>
</tr>
</table>

---

## Agenda

---

## Agenda
+ Review
   - Sampling
   - Variables
   - Object classes in R
+ Descriptives
   - Central tendency
   - Dispersion

---
class: inverse, center, middle

# But first, some admin...

---

## Qualtrics
+ Access your Mizzou Qualtrics account: [https://missouri.qualtrics.com/](missouri.qualtrics.com)
  + You should have a survey, `JOURN_8016_FA18`, shared with you
  + Browse the questions or click *preview* to view the survey
+ The survey will include at least the following...

---

## Health comm experiment

Effect of **source ideology** and **message direction** in vaccine-related articles shared on Twitter

+ Design: 3 (conservative/liberal/moderate) X 3 (pro/anti/neutral)
+ Nine conditions
   - Conservative source + pro/anti/neutral-vaccine
   - Liberal source + pro/anti/neutral-vaccine
   - Moderate source + pro/anti/neutral-vaccine

---

## Outcome variables
+ Perceived source credibility
+ Perceived political bias

---

## Other variables
+ Media diet
+ Demographics
+ What else? (doesn't have to be related to experiment)

---
class: inverse, center, middle

# Sampling

---

## Random\* sample
+ \* Technically, there are multiple kinds of random distributions
   - We typically assume a "uniform" random sample
- Everyone in a sampling frame (all possible data points) has an equal probability of being selected
- Best method for making inferences but expensive and difficult to do

---

## Quasi-random sample
- Samples that use mathematical rules but lack access to the full population
- One common technique is **probability matching**, which is when you match the demographics in a sample with the desired population

---

## Snowball sample
- When you use one/some to recruit more into the sample
- Not very representative, but useful for getting access to niche or hard-to-reach groups

---

## Convenience sample
- A sample that is not random but selected largely due to convenience (low cost, ease of access, etc.)

---

## Why care about sampling?
- The sampling method and study design has a direct effect on one's ability to make **inferences** from data
- **Inferential statistics** are conclusions drawn from a sample and applied to a population
- **Descriptive statistics** are conclusions drawn about a population

---
class: inverse, center, middle

# Variables

---

## Variable
- A **constant** is a fixed value that never changes
    - e.g., pi, the number 1, etc.
- A **variable** is a value that differs across observations
    - can often be thought of as features or characteristics

---

## Variable values
- **Values** are measurements (observations) on a given variable
    - e.g., Tracy's height (variable) is `6'6''` (value)
    - e.g., Avery's final race position (variable) is `1` (value)
    - e.g., Cory's skill level in chess (variable) is `master` (value)
    - e.g., Rory's hometown (variable) is `Kansas City` (value)
- Different levels of measurement enable different levels of analysis

---

## Levels of measurement
- **Nominal**: values represent different categories [or named things]
- **Ordinal**: values represent meaningful sequence
- **Interval**: values represent meaningful sequence using equi-distant intervals
- **Ratio**: values represent real numbers

---

## Nominal
Values represent different categories [or named things]

- Can be used to operationalize anything
- This is often done using *dummy codes*

---

## Ordinal
Values represent meaningful sequence

- The order people finish in a race
- The distance from 1st to 2nd can vary wildly with the distance from 2nd to 3rd

---

## Interval
Values represent meaningful sequence using equi-distant intervals

- Likert-type items, e.g., *I always make my bed in the morning: Strongly Agree... Strongly Disagree* and other survey items that measure a range of feelings/attitudes using numbers
- This is why the visual representation of numbers on a survey is often important

---
## Ratio 
Values represent real numbers

- Numbers correspond to some non-arbitrary meaning.
- A true 0 (zero) exists

---

## Temperature
**Do the statements below add up?**

- The temperature today is 100 degrees Fahrenheit
- The temperature yesterday was 50 degrees Fahrenheit
- Today is twice as hot as yesterday

---

## Fahrenheit
What if we convert the Farenheit values to Celsius

- `100 F == 38 C`
- `50 F == 10 C`

---

## Celsius
**Do the statements below add up?**

- The temperature today is 38 degrees Celsius
- The temperature yesterday was 10 degrees Celsius
- Today is twice as hot as yesterday

---

## Zero degrees
**Does your scale have a meaningful zero?**

+ In theory, `0` should mean a complete lack or absence of the variable
+ For temperature, this is called a Kelvin scale

---

## Kelvin scale
What if we convert the values to Kelvin

- `100 F == 38 C == 311 K`
- `050 F == 10 C == 283 K`

---

## Kelvin
**Do the statements below add up?**

- The temperature today is 311 degrees Kelvin
- The temperature yesterday was 283 degrees Kelvin
- Today is twice as hot as yesterday

---

## Actual temperature ratios
**Fahrenheit** (2x)

```r
100/50
#> [1] 2
```

**Celcius** (3.8x)

```r
38/10
#> [1] 3.8
```

**Kelvin** (1.1x)

```r
311/283
#> [1] 1.09894
```

---
class: inverse, center, middle

# R packages

---
## Install packages
+ Install the [tidyverse](https://tidyverse.org) set of packages (dplyr, tibble, purrr, ggplot2, readr, tidyr, etc.)

```r
## tidyverse actaully consists of several packages
install.packages("tidyverse")
```

---
## Load packages
+ Load packages with `library()`

```r
## note: you don't need to quote the package name
library(tidyverse)
```

+ Or specify the package directly when using a function

```r
## select cyl and mpg columns in built-in mtcars data
dplyr::select(mtcars, cyl, mpg)
#>                     cyl  mpg
#> Mazda RX4             6 21.0
#> Mazda RX4 Wag         6 21.0
#> Datsun 710            4 22.8
#> Hornet 4 Drive        6 21.4
#> Hornet Sportabout     8 18.7
#> Valiant               6 18.1
#> Duster 360            8 14.3
#> Merc 240D             4 24.4
#> Merc 230              4 22.8
#> Merc 280              6 19.2
#> Merc 280C             6 17.8
#> Merc 450SE            8 16.4
#> Merc 450SL            8 17.3
#> Merc 450SLC           8 15.2
#> Cadillac Fleetwood    8 10.4
#> Lincoln Continental   8 10.4
#> Chrysler Imperial     8 14.7
#> Fiat 128              4 32.4
#> Honda Civic           4 30.4
#> Toyota Corolla        4 33.9
#> Toyota Corona         4 21.5
#> Dodge Challenger      8 15.5
#> AMC Javelin           8 15.2
#> Camaro Z28            8 13.3
#> Pontiac Firebird      8 19.2
#> Fiat X1-9             4 27.3
#> Porsche 914-2         4 26.0
#> Lotus Europa          4 30.4
#> Ford Pantera L        8 15.8
#> Ferrari Dino          6 19.7
#> Maserati Bora         8 15.0
#> Volvo 142E            4 21.4
```

---
class: inverse, center, middle

# Object classes in R

---

## Character/Factor
- I will refer to nominal variables as categorical variables
   + Variables with only two categories, we call *dichotomous*
- Categorical variables are represented in R as `character` and `factor`

```r
## character vector containing values a, b, and c
x <- c('a', 'b', 'c')
x
#> [1] "a" "b" "c"
class(x)
#> [1] "character"

## factor vector containing [finite] values a, b, and c
x <- factor(c('a', 'b', 'c'))
x
#> [1] a b c
#> Levels: a b c
class(x)
#> [1] "factor"
```

---
## Character
+ Character vectors can be any textual representations
   - Unlike factors, characters are not limited to a finite set of possibilities

```r
## character vector
x <- c("a", "a", "a", "b", "b", "c")

## table function returns frequency count
table(x)
#> x
#> a b c 
#> 3 2 1

## convert character to factor
f <- as.factor(x)
f
#> [1] a a a b b c
#> Levels: a b c
```

---
## Factor
+ Factor vectors are labelled integers representing a finite number of categories

```r
## try to convert character (x) to numeric
x <- c("a", "a", "a", "b", "b", "c")
try(as.numeric(x))
#> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs
#> introduced by coercion
#> [1] NA NA NA NA NA NA

## convert factor (f) to numeric
f <- as.factor(x)
as.numeric(f)
#> [1] 1 1 1 2 2 3
```

---
## Factor
+ Factors can preserve information about levels/categories even if they are not observed

```r
## likert-type observations
x <- c("Agree", "Neither", "Agree", "Agree")
table(x)
#> x
#> Agree Neither 
#> 3 1

## convert factor (f) to numeric
x <- factor(x, levels = c("Agree", "Neither", "Disagree"))
table(x)
#> x
#> Agree Neither Disagree 
#> 3 1 0
```

---

## Integer
- In R, we refer to **discrete** numbers as `integer`

```r
## R assumes numbers are continuous (numeric)
x <- c(1, 2, 3)
class(x)
#> [1] "numeric"

## use "L" after whole numbers to indicate integer
x <- c(1L, 2L, 3L)
class(x)
#> [1] "integer"
```

---

## Numeric
- In R, we refer to **continuous** interval and ratio variables as `numeric`
- Values are continuous if they don't only exist as discrete units

```r
x <- c(1.25, -3.5, 4)
class(x)
#> [1] "numeric"
```

---

## Data frames
+ Tabular data (like in Excel) is called a `data.frame`
   - Tibbles (`tbl_df`) are special cases of data frames
+ Data frames contain rows (observations) and columns (variables)
   - Variables can be of different classes

```r
## dplyr data_frame is a useful version of base data.frame()
df <- dplyr::data_frame(
 cat = c("a", "b", "c", "a"),
 int = c(1L, 2L, 1L, 3L),
 num = c(-4.3, 3.14, 2, 0.10)
)
df
#> # A tibble: 4 x 3
#> cat int num
#> <chr> <int> <dbl>
#> 1 a 1 -4.3 
#> 2 b 2 3.14
#> 3 c 1 2 
#> 4 a 3 0.1
```

---
class: inverse, center, middle

# Descriptive statistics

---

## "Statistics"
+ **Descriptive** statistics (AKA *parametrics*)
    - Describing the population of data
    - Not a lot of probability theory required
+ **Inferential** statistics (AKA *statistics*)
    - Describing the population using a sample of data
    - Probability theory is key

---

## Descriptive statistics
+ Measures of **central tendency**
    - Describes the **middle** of the data
+ Measures of dispersion
    - Describes the **spread** of the data

---

## Central tendency
+ **Mean**: the expected value (often called "average")
+ **Median**: the mid-point of the data
+ **Mode**: the most common data point

---

## Mean
+ Calculate the mean `mean()` by summing `sum()` all values and dividing `/` by the number of values `length()`

```r
## 10 numbers from random distribution with mean of 0
x <- rnorm(10, mean = 0)

## calculate mean
sum(x) / length(x)
#> [1] 0.0549654

## use mean function
mean(x)
#> [1] 0.0549654
```

---

## Median
+ Calculate the median `median()` by arranging `sort()` all values and find the middle (or values *tied* for the middle) observation.

```r
## sort the numbers by magnitude
x <- sort(x)

## print and find the middle values (5th and 6th)
x
#>  [1] -1.7042248 -1.3132026 -1.2632057 -1.0404326 -0.0684191  0.2202423
#>  [7]  0.7008736  1.4249242  1.6830939  1.9100045

## use median function
median(x)
#> [1] 0.0759116
```

---

## Mode
+ Calculate the mode by creating a frequency table `table()` and sorting `sort()` in descending order (biggest to smallest)

```r
## sample 100 values from the series of numbers 1:10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- sample(1:10, 100, replace = TRUE)

## create frequency table
y_freq <- table(y)

## sort and print table
sort(y_freq, decreasing = TRUE)
#> y
#>  3  2  9  8  5  1  7  4 10  6 
#> 14 13 12 11 10  9  9  8  8  6
```

---

## Important R note
+ In R missing values are coded as `NA`
+ Inputs containing `NA` will return `NA` for these functions:
    - `sum()`, `mean()`, `median()`, `min()`, `max()`, `range()`, `var()`, `sd()`
+ To avoid this, include the argument `na.rm = TRUE`

Assuming you are **aware** of missing data, it is usually (though not always) desirable to omit `NA` values via `na.rm = TRUE`.

---
## `na.rm = TRUE`
+ Example of finding mean when `x` contains a missing value (`NA`)

```r
## add missing value ot x
x <- c(NA, x)

## returns NA
mean(x)
#> [1] NA

## returns the mean we want
mean(x, na.rm = TRUE)
#> [1] 0.0549654
```

---

## Dispersion
+ **Range**: the minimum and maximum values. Often expressed as a distance between the two.
+ **Variance**: distance from the mean
+ **Standard deviation**: distance from the mean expressed in standardized units

---

## Range
+ Calculate the range `range()` by finding the maximum `max()` and minimum `min()` values.

```r
## min and max values of y
min(y)
#> [1] 1
max(y)
#> [1] 10

## calculate distance between the two
max(y) - min(y)
#> [1] 9

## use the range function
range(y)
#> [1]  1 10
```

---

## Variance
+ Calculate the variance `var()` by summing `sum()` the squared `^2` distance `-` from the mean `mean()` and dividing `/` by the number of observations `length(x) - 1`

```r
## calculate variance
sum((y - mean(y))^2) / (length(y) - 1)
#> [1] 8.5499

## use the var function
var(y)
#> [1] 8.5499
```

---

## Standard deviation
+ Calculate the standard deviation `sd()` by taking the square root `sqrt()` of the variance `var()`.

```r
## calculate variance
yvar <- var(y)

## square root of variance
sqrt(yvar)
#> [1] 2.92402

## use sd function
sd(y)
#> [1] 2.92402
```

---
class: inverse, center, middle

# Getting data into R

---

## CSV
+ **CSV**: comma separated value

```r
## readr is in the tidyverse
d <- readr::read_csv("../data/csv.csv")
#> Parsed with column specification:
#> cols(
#> id = col_integer(),
#> name = col_character(),
#> amount = col_integer()
#> )
d
#> # A tibble: 2 x 3
#> id name amount
#> <int> <chr> <int>
#> 1 1 Alice 100
#> 2 2 Bob 200
```

---

## \*SV/delimited

```r
## uncomment following line for help documentation
#?read.table
d <- read.delim("../data/tsv.tsv")
d
#> x1 x2 x3 x4
#> 1 234.000 10.0 1.5 a
#> 2 0.234 3.0 15.0 b
#> 3 -5.200 -5.3 0.0 c
```

---

## dat
+ Wikipedia describes `.dat` as

> not [a] specific file type, often generic extension for "data" files for a variety of applications

+ Often associated with Mplus (statistical software program)
+ See: \*SV methods, i.e., `read.table()`

---

## xlsx (Excel)

```r
## install readxl package if not already
if (!requireNamespace("readxl", quietly = TRUE)) {
 install.packages("readxl")
}
## read an excel file
d <- readxl::read_excel("../data/xlsx.xlsx")
d
#> # A tibble: 4 x 2
#> name value 
#> <chr> <chr> 
#> 1 Name Clippy 
#> 2 Species paperclip
#> 3 Approx date of death 39083 
#> 4 Weight in grams 0.9
```

---

## .sav (SPSS)

```r
## install readxl package if not already
if (!requireNamespace("haven", quietly = TRUE)) {
 install.packages("haven")
}
## read spss (sav) file
d <- haven::read_spss("../data/sav.sav")
d
#> # A tibble: 32 x 2
#> toxin yield
#> <dbl> <dbl>
#> 1 11 4
#> 2 19 4
#> 3 13 4
#> 4 13 4
#> 5 12 4
#> 6 17 4
#> 7 13 4
#> 8 18 4
#> 9 18 4
#> 10 17 4
#> # ... with 22 more rows
```

---

## dta (Stata)

```r
## install readxl package if not already
if (!requireNamespace("haven", quietly = TRUE)) {
 install.packages("haven")
}
## read stata (dta) file
d <- haven::read_stata("../data/dta.dta")
d
#> # A tibble: 2 x 8
#> datetime_c datetime_big_c date weekly_date
#> <dttm> <dttm> <date> <dbl>
#> 1 2006-11-19 23:13:20 2006-11-19 22:57:03 2010-01-20 2601
#> 2 1959-12-31 20:03:20 1959-12-31 23:35:21 1953-10-02 -601
#> # ... with 4 more variables: monthly_date <dbl>, quarterly_date <dbl>,
#> # half_yearly_date <dbl>, yearly_date <dbl>
```

---

## rds (R)

```r
## my favorite
d <- readRDS("../data/rds.rds")
d
#> # A tibble: 88 x 18
#> expertise MR1 MR2 MR3 MR4 MR6 MR7 MR8 MR9 MR5 MR10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 news pre… 0.99 0 0 0 0 0 0 0 0 0
#> 2 page lay… 0.99 0 0 0 0 0 0 0 0 0
#> 3 history … 0.99 0 0 0 0 0 0 0 0 0
#> 4 databases 0.99 0 0 0 0 0 0 0 0 0
#> 5 graphic … 0.99 0 0 0 0 0 0 0 0 0
#> 6 programm… 0.99 0 0 0 0 0 0 0 0 0
#> 7 practice… 0.99 0 0 0 0 0 0 0 0 0
#> 8 data vis… 0.99 0 0 0 0 0 0 0 0 0
#> 9 data-dri… 0.99 0 0 0 0 0 0 0 0 0
#> 10 data jou… 0.99 0 0 0 0 0 0 0 0 0
#> # ... with 78 more rows, and 7 more variables: MR13 <dbl>, MR11 <dbl>,
#> # MR14 <dbl>, MR12 <dbl>, MR15 <dbl>, MR16 <dbl>, var <fct>
```

---

## rda/Rdata

```r
## try to avoid this one
load("../data/rds.rda")
```

---
class: center, middle, inverse

# Write up

---

## Data
+ After data collection and prior to analysis, research projects should describe, or summarize, the data
+ This should be included in the **methods** section of your paper
+ Descriptives for:
    - Participants
    - Variables

---

## Participants
+ Descriptives in the data summary typically include the following
    - Number of observations (total number `N = 345`)
    - Demographic (age, sex, race, education, etc.) breakdowns (numbers and percents *15.3% were female* (`n = 23`))
        - For age, usually a range and the mean

---

## Variables
+ Study variable information: 
    - min
    - max
    - mean
    - sd
+ Correlation table (don't worry about that yet)

---

## Useful functions/packages
+ Fortunately, there are lots of good functions and packages to choose from
    - `summary()` (base R)
    - `psych::describe()`
    - `skimr::skim()`
    - `summarize()`

---

# Assignment

---
## Assignment
+ We'll work through this one together

---
class: center, middle, inverse

# Rmarkdown

---

## Rmarkdown
+ The **{rmarkdown}** package provides a simple front-end framework that (a) is well integrated into R uses and functions and (b) plays well with others front-end frameworks (html, PDF, even Word docs)
+ **Rstudio**, in particular, makes it easy to access cool rmarkdown features
+ Thus, this presentation provides a brief introduction to **rmarkdown**

---

## What is markdown?
+ **Markdown** is a simple markup language written in plain text. It's how I make all these slides and write papers
+ **Rmarkdown** is a version of markdown that integrates the R environment (code and output) via **code chunks**

---

## How do I use it?
+ In **Rstudio** select the new file dropdown (or `File > New File`) and then select **R Markdown**.

![](img/rmd.png)

---

## Render
+ To render the `.Rmd` file, click the **knit** button on the top pane of the script file.

![](img/rmd_rstudio.png)

---

## Very simple rules:
+ R code exists inside of **code chunks**
    - Every time you click `knit`, the file [and the code] is executed
+ Everything outside of code chunks gets converted to HTML text (similar to Word).
    - There are lots of good examples and tutorials online, just google "Rmarkdown tutorials"