class: center, middle, inverse, title-slide # Day 2 ## Descriptive statistics ### Michael W. Kearney📊
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- class: inverse, center, middle ## Agenda --- ## Agenda + Review - Sampling - Variables - Object classes in R + Descriptives - Central tendency - Dispersion --- class: inverse, center, middle # But first, some admin... --- ## Qualtrics + Access your Mizzou Qualtrics account: [https://missouri.qualtrics.com/](missouri.qualtrics.com) + You should have a survey, `JOURN_8016_FA18`, shared with you + Browse the questions or click *preview* to view the survey + The survey will include at least the following... --- ## Health comm experiment Effect of **source ideology** and **message direction** in vaccine-related articles shared on Twitter + Design: 3 (conservative/liberal/moderate) X 3 (pro/anti/neutral) + Nine conditions - Conservative source + pro/anti/neutral-vaccine - Liberal source + pro/anti/neutral-vaccine - Moderate source + pro/anti/neutral-vaccine --- ## Outcome variables + Perceived source credibility + Perceived political bias --- ## Other variables + Media diet + Demographics + What else? (doesn't have to be related to experiment) --- class: inverse, center, middle # Sampling --- ## Random\* sample + \* Technically, there are multiple kinds of random distributions - We typically assume a "uniform" random sample - Everyone in a sampling frame (all possible data points) has an equal probability of being selected - Best method for making inferences but expensive and difficult to do --- ## Quasi-random sample - Samples that use mathematical rules but lack access to the full population - One common technique is **probability matching**, which is when you match the demographics in a sample with the desired population --- ## Snowball sample - When you use one/some to recruit more into the sample - Not very representative, but useful for getting access to niche or hard-to-reach groups --- ## Convenience sample - A sample that is not random but selected largely due to convenience (low cost, ease of access, etc.) --- ## Why care about sampling? - The sampling method and study design has a direct effect on one's ability to make **inferences** from data - **Inferential statistics** are conclusions drawn from a sample and applied to a population - **Descriptive statistics** are conclusions drawn about a population --- class: inverse, center, middle # Variables --- ## Variable - A **constant** is a fixed value that never changes - e.g., pi, the number 1, etc. - A **variable** is a value that differs across observations - can often be thought of as features or characteristics --- ## Variable values - **Values** are measurements (observations) on a given variable - e.g., Tracy's height (variable) is `6'6''` (value) - e.g., Avery's final race position (variable) is `1` (value) - e.g., Cory's skill level in chess (variable) is `master` (value) - e.g., Rory's hometown (variable) is `Kansas City` (value) - Different levels of measurement enable different levels of analysis --- ## Levels of measurement - **Nominal**: values represent different categories [or named things] - **Ordinal**: values represent meaningful sequence - **Interval**: values represent meaningful sequence using equi-distant intervals - **Ratio**: values represent real numbers --- ## Nominal Values represent different categories [or named things] - Can be used to operationalize anything - This is often done using *dummy codes* --- ## Ordinal Values represent meaningful sequence - The order people finish in a race - The distance from 1st to 2nd can vary wildly with the distance from 2nd to 3rd --- ## Interval Values represent meaningful sequence using equi-distant intervals - Likert-type items, e.g., *I always make my bed in the morning: Strongly Agree... Strongly Disagree* and other survey items that measure a range of feelings/attitudes using numbers - This is why the visual representation of numbers on a survey is often important --- ## Ratio Values represent real numbers - Numbers correspond to some non-arbitrary meaning. - A true 0 (zero) exists --- ## Temperature **Do the statements below add up?** - The temperature today is 100 degrees Fahrenheit - The temperature yesterday was 50 degrees Fahrenheit - Today is twice as hot as yesterday --- ## Fahrenheit What if we convert the Farenheit values to Celsius - `100 F == 38 C` - `50 F == 10 C` --- ## Celsius **Do the statements below add up?** - The temperature today is 38 degrees Celsius - The temperature yesterday was 10 degrees Celsius - Today is twice as hot as yesterday --- ## Zero degrees **Does your scale have a meaningful zero?** + In theory, `0` should mean a complete lack or absence of the variable + For temperature, this is called a Kelvin scale --- ## Kelvin scale What if we convert the values to Kelvin - `100 F == 38 C == 311 K` - `050 F == 10 C == 283 K` --- ## Kelvin **Do the statements below add up?** - The temperature today is 311 degrees Kelvin - The temperature yesterday was 283 degrees Kelvin - Today is twice as hot as yesterday --- ## Actual temperature ratios **Fahrenheit** (2x) ```r 100/50 #> [1] 2 ``` **Celcius** (3.8x) ```r 38/10 #> [1] 3.8 ``` **Kelvin** (1.1x) ```r 311/283 #> [1] 1.09894 ``` --- class: inverse, center, middle # R packages --- ## Install packages + Install the [tidyverse](https://tidyverse.org) set of packages (dplyr, tibble, purrr, ggplot2, readr, tidyr, etc.) ```r ## tidyverse actaully consists of several packages install.packages("tidyverse") ``` --- ## Load packages + Load packages with `library()` ```r ## note: you don't need to quote the package name library(tidyverse) ``` + Or specify the package directly when using a function ```r ## select cyl and mpg columns in built-in mtcars data dplyr::select(mtcars, cyl, mpg) #> cyl mpg #> Mazda RX4 6 21.0 #> Mazda RX4 Wag 6 21.0 #> Datsun 710 4 22.8 #> Hornet 4 Drive 6 21.4 #> Hornet Sportabout 8 18.7 #> Valiant 6 18.1 #> Duster 360 8 14.3 #> Merc 240D 4 24.4 #> Merc 230 4 22.8 #> Merc 280 6 19.2 #> Merc 280C 6 17.8 #> Merc 450SE 8 16.4 #> Merc 450SL 8 17.3 #> Merc 450SLC 8 15.2 #> Cadillac Fleetwood 8 10.4 #> Lincoln Continental 8 10.4 #> Chrysler Imperial 8 14.7 #> Fiat 128 4 32.4 #> Honda Civic 4 30.4 #> Toyota Corolla 4 33.9 #> Toyota Corona 4 21.5 #> Dodge Challenger 8 15.5 #> AMC Javelin 8 15.2 #> Camaro Z28 8 13.3 #> Pontiac Firebird 8 19.2 #> Fiat X1-9 4 27.3 #> Porsche 914-2 4 26.0 #> Lotus Europa 4 30.4 #> Ford Pantera L 8 15.8 #> Ferrari Dino 6 19.7 #> Maserati Bora 8 15.0 #> Volvo 142E 4 21.4 ``` --- class: inverse, center, middle # Object classes in R --- ## Character/Factor - I will refer to nominal variables as categorical variables + Variables with only two categories, we call *dichotomous* - Categorical variables are represented in R as `character` and `factor` ```r ## character vector containing values a, b, and c x <- c('a', 'b', 'c') x #> [1] "a" "b" "c" class(x) #> [1] "character" ## factor vector containing [finite] values a, b, and c x <- factor(c('a', 'b', 'c')) x #> [1] a b c #> Levels: a b c class(x) #> [1] "factor" ``` --- ## Character + Character vectors can be any textual representations - Unlike factors, characters are not limited to a finite set of possibilities ```r ## character vector x <- c("a", "a", "a", "b", "b", "c") ## table function returns frequency count table(x) #> x #> a b c #> 3 2 1 ## convert character to factor f <- as.factor(x) f #> [1] a a a b b c #> Levels: a b c ``` --- ## Factor + Factor vectors are labelled integers representing a finite number of categories ```r ## try to convert character (x) to numeric x <- c("a", "a", "a", "b", "b", "c") try(as.numeric(x)) #> Warning in doTryCatch(return(expr), name, parentenv, handler): NAs #> introduced by coercion #> [1] NA NA NA NA NA NA ## convert factor (f) to numeric f <- as.factor(x) as.numeric(f) #> [1] 1 1 1 2 2 3 ``` --- ## Factor + Factors can preserve information about levels/categories even if they are not observed ```r ## likert-type observations x <- c("Agree", "Neither", "Agree", "Agree") table(x) #> x #> Agree Neither #> 3 1 ## convert factor (f) to numeric x <- factor(x, levels = c("Agree", "Neither", "Disagree")) table(x) #> x #> Agree Neither Disagree #> 3 1 0 ``` --- ## Integer - In R, we refer to **discrete** numbers as `integer` ```r ## R assumes numbers are continuous (numeric) x <- c(1, 2, 3) class(x) #> [1] "numeric" ## use "L" after whole numbers to indicate integer x <- c(1L, 2L, 3L) class(x) #> [1] "integer" ``` --- ## Numeric - In R, we refer to **continuous** interval and ratio variables as `numeric` - Values are continuous if they don't only exist as discrete units ```r x <- c(1.25, -3.5, 4) class(x) #> [1] "numeric" ``` --- ## Data frames + Tabular data (like in Excel) is called a `data.frame` - Tibbles (`tbl_df`) are special cases of data frames + Data frames contain rows (observations) and columns (variables) - Variables can be of different classes ```r ## dplyr data_frame is a useful version of base data.frame() df <- dplyr::data_frame( cat = c("a", "b", "c", "a"), int = c(1L, 2L, 1L, 3L), num = c(-4.3, 3.14, 2, 0.10) ) df #> # A tibble: 4 x 3 #> cat int num #> <chr> <int> <dbl> #> 1 a 1 -4.3 #> 2 b 2 3.14 #> 3 c 1 2 #> 4 a 3 0.1 ``` --- class: inverse, center, middle # Descriptive statistics --- ## "Statistics" + **Descriptive** statistics (AKA *parametrics*) - Describing the population of data - Not a lot of probability theory required + **Inferential** statistics (AKA *statistics*) - Describing the population using a sample of data - Probability theory is key --- ## Descriptive statistics + Measures of **central tendency** - Describes the **middle** of the data + Measures of dispersion - Describes the **spread** of the data --- ## Central tendency + **Mean**: the expected value (often called "average") + **Median**: the mid-point of the data + **Mode**: the most common data point --- ## Mean + Calculate the mean `mean()` by summing `sum()` all values and dividing `/` by the number of values `length()` ```r ## 10 numbers from random distribution with mean of 0 x <- rnorm(10, mean = 0) ## calculate mean sum(x) / length(x) #> [1] 0.0549654 ## use mean function mean(x) #> [1] 0.0549654 ``` --- ## Median + Calculate the median `median()` by arranging `sort()` all values and find the middle (or values *tied* for the middle) observation. ```r ## sort the numbers by magnitude x <- sort(x) ## print and find the middle values (5th and 6th) x #> [1] -1.7042248 -1.3132026 -1.2632057 -1.0404326 -0.0684191 0.2202423 #> [7] 0.7008736 1.4249242 1.6830939 1.9100045 ## use median function median(x) #> [1] 0.0759116 ``` --- ## Mode + Calculate the mode by creating a frequency table `table()` and sorting `sort()` in descending order (biggest to smallest) ```r ## sample 100 values from the series of numbers 1:10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) y <- sample(1:10, 100, replace = TRUE) ## create frequency table y_freq <- table(y) ## sort and print table sort(y_freq, decreasing = TRUE) #> y #> 3 2 9 8 5 1 7 4 10 6 #> 14 13 12 11 10 9 9 8 8 6 ``` --- ## Important R note + In R missing values are coded as `NA` + Inputs containing `NA` will return `NA` for these functions: - `sum()`, `mean()`, `median()`, `min()`, `max()`, `range()`, `var()`, `sd()` + To avoid this, include the argument `na.rm = TRUE` <br> <p class="note">Assuming you are **aware** of missing data, it is usually (though not always) desirable to omit `NA` values via `na.rm = TRUE`.</p> --- ## `na.rm = TRUE` + Example of finding mean when `x` contains a missing value (`NA`) ```r ## add missing value ot x x <- c(NA, x) ## returns NA mean(x) #> [1] NA ## returns the mean we want mean(x, na.rm = TRUE) #> [1] 0.0549654 ``` --- ## Dispersion + **Range**: the minimum and maximum values. Often expressed as a distance between the two. + **Variance**: distance from the mean + **Standard deviation**: distance from the mean expressed in standardized units --- ## Range + Calculate the range `range()` by finding the maximum `max()` and minimum `min()` values. ```r ## min and max values of y min(y) #> [1] 1 max(y) #> [1] 10 ## calculate distance between the two max(y) - min(y) #> [1] 9 ## use the range function range(y) #> [1] 1 10 ``` --- ## Variance + Calculate the variance `var()` by summing `sum()` the squared `^2` distance `-` from the mean `mean()` and dividing `/` by the number of observations `length(x) - 1` ```r ## calculate variance sum((y - mean(y))^2) / (length(y) - 1) #> [1] 8.5499 ## use the var function var(y) #> [1] 8.5499 ``` --- ## Standard deviation + Calculate the standard deviation `sd()` by taking the square root `sqrt()` of the variance `var()`. ```r ## calculate variance yvar <- var(y) ## square root of variance sqrt(yvar) #> [1] 2.92402 ## use sd function sd(y) #> [1] 2.92402 ``` <!-- ############################################# --> <!-- ## GETTING DATA INTO R ## --> <!-- ############################################# --> --- class: inverse, center, middle # Getting data into R --- ## CSV + **CSV**: comma separated value ```r ## readr is in the tidyverse d <- readr::read_csv("../data/csv.csv") #> Parsed with column specification: #> cols( #> id = col_integer(), #> name = col_character(), #> amount = col_integer() #> ) d #> # A tibble: 2 x 3 #> id name amount #> <int> <chr> <int> #> 1 1 Alice 100 #> 2 2 Bob 200 ``` --- ## \*SV/delimited ```r ## uncomment following line for help documentation #?read.table d <- read.delim("../data/tsv.tsv") d #> x1 x2 x3 x4 #> 1 234.000 10.0 1.5 a #> 2 0.234 3.0 15.0 b #> 3 -5.200 -5.3 0.0 c ``` --- ## dat + Wikipedia describes `.dat` as > not [a] specific file type, often generic extension for "data" files for a variety of applications + Often associated with Mplus (statistical software program) + See: \*SV methods, i.e., `read.table()` --- ## xlsx (Excel) ```r ## install readxl package if not already if (!requireNamespace("readxl", quietly = TRUE)) { install.packages("readxl") } ## read an excel file d <- readxl::read_excel("../data/xlsx.xlsx") d #> # A tibble: 4 x 2 #> name value #> <chr> <chr> #> 1 Name Clippy #> 2 Species paperclip #> 3 Approx date of death 39083 #> 4 Weight in grams 0.9 ``` --- ## .sav (SPSS) ```r ## install readxl package if not already if (!requireNamespace("haven", quietly = TRUE)) { install.packages("haven") } ## read spss (sav) file d <- haven::read_spss("../data/sav.sav") d #> # A tibble: 32 x 2 #> toxin yield #> <dbl> <dbl> #> 1 11 4 #> 2 19 4 #> 3 13 4 #> 4 13 4 #> 5 12 4 #> 6 17 4 #> 7 13 4 #> 8 18 4 #> 9 18 4 #> 10 17 4 #> # ... with 22 more rows ``` --- ## dta (Stata) ```r ## install readxl package if not already if (!requireNamespace("haven", quietly = TRUE)) { install.packages("haven") } ## read stata (dta) file d <- haven::read_stata("../data/dta.dta") d #> # A tibble: 2 x 8 #> datetime_c datetime_big_c date weekly_date #> <dttm> <dttm> <date> <dbl> #> 1 2006-11-19 23:13:20 2006-11-19 22:57:03 2010-01-20 2601 #> 2 1959-12-31 20:03:20 1959-12-31 23:35:21 1953-10-02 -601 #> # ... with 4 more variables: monthly_date <dbl>, quarterly_date <dbl>, #> # half_yearly_date <dbl>, yearly_date <dbl> ``` --- ## rds (R) ```r ## my favorite d <- readRDS("../data/rds.rds") d #> # A tibble: 88 x 18 #> expertise MR1 MR2 MR3 MR4 MR6 MR7 MR8 MR9 MR5 MR10 #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 news pre… 0.99 0 0 0 0 0 0 0 0 0 #> 2 page lay… 0.99 0 0 0 0 0 0 0 0 0 #> 3 history … 0.99 0 0 0 0 0 0 0 0 0 #> 4 databases 0.99 0 0 0 0 0 0 0 0 0 #> 5 graphic … 0.99 0 0 0 0 0 0 0 0 0 #> 6 programm… 0.99 0 0 0 0 0 0 0 0 0 #> 7 practice… 0.99 0 0 0 0 0 0 0 0 0 #> 8 data vis… 0.99 0 0 0 0 0 0 0 0 0 #> 9 data-dri… 0.99 0 0 0 0 0 0 0 0 0 #> 10 data jou… 0.99 0 0 0 0 0 0 0 0 0 #> # ... with 78 more rows, and 7 more variables: MR13 <dbl>, MR11 <dbl>, #> # MR14 <dbl>, MR12 <dbl>, MR15 <dbl>, MR16 <dbl>, var <fct> ``` --- ## rda/Rdata ```r ## try to avoid this one load("../data/rds.rda") ``` <!-- ############################################# --> <!-- ## DESCRIPTIVES IN R ## --> <!-- ############################################# --> --- class: center, middle, inverse # Write up --- ## Data + After data collection and prior to analysis, research projects should describe, or summarize, the data + This should be included in the **methods** section of your paper + Descriptives for: - Participants - Variables --- ## Participants + Descriptives in the data summary typically include the following - Number of observations (total number `N = 345`) - Demographic (age, sex, race, education, etc.) breakdowns (numbers and percents *15.3% were female* (`n = 23`)) - For age, usually a range and the mean --- ## Variables + Study variable information: - min - max - mean - sd + Correlation table (don't worry about that yet) --- ## Useful functions/packages + Fortunately, there are lots of good functions and packages to choose from - `summary()` (base R) - `psych::describe()` - `skimr::skim()` - `summarize()` --- class: center, middle, inverse # Assignment --- ## Assignment + We'll work through this one together --- class: center, middle, inverse # Rmarkdown --- ## Rmarkdown + The **{rmarkdown}** package provides a simple front-end framework that (a) is well integrated into R uses and functions and (b) plays well with others front-end frameworks (html, PDF, even Word docs) + **Rstudio**, in particular, makes it easy to access cool rmarkdown features + Thus, this presentation provides a brief introduction to **rmarkdown** --- ## What is markdown? + **Markdown** is a simple markup language written in plain text. It's how I make all these slides and write papers + **Rmarkdown** is a version of markdown that integrates the R environment (code and output) via **code chunks** --- ## How do I use it? + In **Rstudio** select the new file dropdown (or `File > New File`) and then select **R Markdown**. ![](img/rmd.png) --- ## Render + To render the `.Rmd` file, click the **knit** button on the top pane of the script file. ![](img/rmd_rstudio.png) --- ## Very simple rules: + R code exists inside of **code chunks** - Every time you click `knit`, the file [and the code] is executed + Everything outside of code chunks gets converted to HTML text (similar to Word). - There are lots of good examples and tutorials online, just google "Rmarkdown tutorials"