Day 12

class: center, middle, inverse, title-slide

# Day 12
## Text mining
### Michael W. Kearney📊 School of Journalism Informatics Institute University of Missouri
### <table style="border-style:none;padding-top:30px;" class=".table">
<tr>
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> </a>
</th>
</tr>
<tr style="background-color:#fff">
<th style="padding-right:75px!important">
<a href="https://twitter.com/kearneymw"> @kearneymw </a>
</th>
<th style="padding-left:75px!important">
<a href="https://github.com/mkearney"> @mkearney </a>
</th>
</tr>
</table>

---

class: inverse, center, middle

# Text mining

---

## Agenda
+ Natural Language Processing (NLP)
+ Sentiment analysis
+ Topic modeling
+ Packages/resources

---
class:inverse,middle,center

# Natural Language Processing (NLP)

---

## Natural Language Processing

+ Area of computer science concerned with processing and analyzing natural
[human] language
   - How to deal with large amounts of natural language data
   - Typically focuses on frequency-based patterns
   - Different form **natural language understanding**
+ Key NLP concepts
   - Regular expressions
   - String manipulation
   - Tokenizing

---

## Regular expressions

+ Regular expressions are used to describe a template or textual pattern
+ Pattern matching allows for easier text manipulation
   - Removing punctuation, numbers, etc.
   - Identifying phrases, links, phone numbers, etc.
   - Stemming or reformatting words

---

## String manipulation

+ Character (textual) observations are referred to as **strings**
+ String manipulation can be achieved via a number of different tools
   - In R try the **{stringr}** package (tidyverse approved) though the base
   functions of `grepl()`, `grep()`, `gregexpr()`, etc. are great as well

---

## Tokenizers

+ Tokenizing text refers to the process of systematically splitting textual data
into desired units
   - Sentences
   - Paragraphs
   - Words
   - In R try the **{tokenizers}** package

---

```r
x <- c(
 "This is SEN'ENCE! right here in 2018", 
 "lol u what IM SAYING toDAY everywhere"
)
tokenizers::tokenize_words(x)
#> [[1]]
#> [1] "this" "is" "sen'ence" "right" "here" "in" 
#> [7] "2018" 
#> 
#> [[2]]
#> [1] "lol" "u" "what" "im" "saying" 
#> [6] "today" "everywhere"
```

---
class:inverse,middle,center

# Sentiment analysis

---

## Sentiment analysis

+ Estimate various tonal/affect dimensions associated with words/tokens
+ There are several dictionaries to choose from
+ In R, it's super easy with a vector of text and the **{syuzhet}** package

```r
txt <- c(
 "super awesome positive great best amazing excellent",
 "neutral plain about for on from near is to be are",
 "lowsy terrible horrible awful worst dreadful painful"
)
syuzhet::get_sentiment(txt)
#> [1] 4.6 0.0 -4.0
```

---
class:inverse,middle,center

# Topic modeling

---

## Topic modeling

+ Identify themes, or topics, by clusters of tokens (words, phrases, etc)
+ Similar to factor analysis
   - Specify a number of topics
   - Look for that many word/token clusters
   - Get topic loading estimates for each word/token

---
class:inverse,middle,center

# Text mining resources

---

## Packages

+ **{{tidytext}}**
   - **Website**: [github.com/juliasilge/tidytext](https://github.com/juliasilge/tidytext)
   - **Book**: [tidytextmining.com](https://www.tidytextmining.com/)
+ **{{quanteda}}**
   - **Website**: [quanteda.io](https://quanteda.io/)
   - **Tutorials**: [tutorials.quanteda.io/](https://tutorials.quanteda.io/)

---
class:inverse,middle,center

# Exam #2

---

## Exam #2 data

```r
## download CSV version of data
download.file(
  "https://github.com/mkearney/stat/blob/master/static/exams/exam2-data.csv?raw=true",
  assign("tmp", tempfile(fileext = ".rds"))
)

## read in the twitter-AP news headline study data
twap <- readr::read_csv(tmp)
#> Parsed with column specification:
#> cols(
#> .default = col_integer(),
#> party = col_character(),
#> mancheck = col_logical(),
#> gender = col_character(),
#> race = col_character(),
#> headline_format = col_character(),
#> headline_topic = col_character()
#> )
#> See spec(...) for full column specifications.
```

---

## Data set

|    party    | dem_therm | gop_therm | sci_therm | pol_therm | mov_therm | twuse | mancheck | newsworthy | credibility_1 | credibility_2 | credibility_3 | credibility_4 | credibility_5 | credibility_6 | credibility_7 | gender | edu |           race            | age | headline_format | headline_topic |
|:-----------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:----------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:------:|:---:|:-------------------------:|:---:|:---------------:|:--------------:|
|  Democrat   |    90     |    19     |    99     |    89     |    75     |   3   |   TRUE   |     1      |       4       |       4       |       4       |       5       |       3       |       2       |       3       | Woman  |  8  |    White or Caucasian     | 27  |   apnews.com    |    Politics    |
| Republican  |    26     |    80     |    64     |    80     |    74     |   2   |  FALSE   |     1      |       2       |       4       |       4       |       2       |       4       |       2       |       4       | Woman  |  4  |    White or Caucasian     | 23  |   twitter.com   |     Health     |
| Republican  |    21     |    83     |    90     |    16     |    53     |   0   |   TRUE   |     3      |       4       |       3       |       4       |       4       |       4       |       4       |       4       | Woman  |  6  |    White or Caucasian     | 28  |   apnews.com    |    Politics    |
| Independent |    44     |    29     |    92     |    27     |    87     |   1   |   TRUE   |     3      |       4       |       5       |       5       |       5       |       5       |       5       |       4       |  Man   |  4  |    White or Caucasian     | 30  |   apnews.com    |     Health     |
| Independent |    16     |    80     |    86     |    30     |     5     |   4   |   TRUE   |     0      |       5       |       3       |       4       |       4       |       5       |       5       |       5       |  Man   |  8  |    White or Caucasian     | 32  |   twitter.com   |     Health     |
|  Democrat   |    67     |    41     |    39     |    71     |    63     |   1   |   TRUE   |     3      |       5       |       4       |       3       |       2       |       2       |       2       |       3       | Woman  |  4  | Black or African American | 30  |   apnews.com    |    Politics    |
| Republican  |    10     |    75     |    95     |    40     |    90     |   1   |   TRUE   |     1      |       4       |       3       |       4       |       4       |       3       |       3       |       4       |  Man   |  4  |    White or Caucasian     | 43  |   twitter.com   |    Politics    |
|  Democrat   |    100    |     0     |    48     |    18     |    75     |   0   |   TRUE   |     2      |       4       |       4       |       5       |       4       |       3       |       4       |       3       | Woman  |  1  |    White or Caucasian     | 31  |   twitter.com   |    Politics    |
| Independent |    65     |    69     |    48     |    54     |    73     |   0   |   TRUE   |     2      |       4       |       4       |       4       |       5       |       3       |       4       |       4       | Woman  |  2  |    Hispanic or Latino     | 27  |   apnews.com    |    Politics    |
| Republican  |     8     |    94     |    58     |    13     |    83     |   0   |   TRUE   |     3      |       3       |       3       |       4       |       3       |       3       |       3       |       3       |  Man   |  7  | Asian / Pacific Islander  | 57  |   apnews.com    |     Health     |