class: center, middle, inverse, title-slide # Day 12 ## Text mining ### Michael W. Kearney📊
School of Journalism
Informatics Institute
University of Missouri ###
@kearneymw
@mkearney
--- class: inverse, center, middle # Text mining --- ## Agenda + Natural Language Processing (NLP) + Sentiment analysis + Topic modeling + Packages/resources --- class:inverse,middle,center # Natural Language Processing (NLP) --- ## Natural Language Processing + Area of computer science concerned with processing and analyzing natural [human] language - How to deal with large amounts of natural language data - Typically focuses on frequency-based patterns - Different form **natural language understanding** + Key NLP concepts - Regular expressions - String manipulation - Tokenizing --- ## Regular expressions + Regular expressions are used to describe a template or textual pattern + Pattern matching allows for easier text manipulation - Removing punctuation, numbers, etc. - Identifying phrases, links, phone numbers, etc. - Stemming or reformatting words --- ## String manipulation + Character (textual) observations are referred to as **strings** + String manipulation can be achieved via a number of different tools - In R try the **{stringr}** package (tidyverse approved) though the base functions of `grepl()`, `grep()`, `gregexpr()`, etc. are great as well --- ## Tokenizers + Tokenizing text refers to the process of systematically splitting textual data into desired units - Sentences - Paragraphs - Words - In R try the **{tokenizers}** package --- ```r x <- c( "This is SEN'ENCE! right here in 2018", "lol u what IM SAYING toDAY everywhere" ) tokenizers::tokenize_words(x) #> [[1]] #> [1] "this" "is" "sen'ence" "right" "here" "in" #> [7] "2018" #> #> [[2]] #> [1] "lol" "u" "what" "im" "saying" #> [6] "today" "everywhere" ``` --- class:inverse,middle,center # Sentiment analysis --- ## Sentiment analysis + Estimate various tonal/affect dimensions associated with words/tokens + There are several dictionaries to choose from + In R, it's super easy with a vector of text and the **{syuzhet}** package ```r txt <- c( "super awesome positive great best amazing excellent", "neutral plain about for on from near is to be are", "lowsy terrible horrible awful worst dreadful painful" ) syuzhet::get_sentiment(txt) #> [1] 4.6 0.0 -4.0 ``` --- class:inverse,middle,center # Topic modeling --- ## Topic modeling + Identify themes, or topics, by clusters of tokens (words, phrases, etc) + Similar to factor analysis - Specify a number of topics - Look for that many word/token clusters - Get topic loading estimates for each word/token --- class:inverse,middle,center # Text mining resources --- ## Packages + **{{tidytext}}** - **Website**: [github.com/juliasilge/tidytext](https://github.com/juliasilge/tidytext) - **Book**: [tidytextmining.com](https://www.tidytextmining.com/) + **{{quanteda}}** - **Website**: [quanteda.io](https://quanteda.io/) - **Tutorials**: [tutorials.quanteda.io/](https://tutorials.quanteda.io/) --- class:inverse,middle,center # Exam #2 --- ## Exam #2 data ```r ## download CSV version of data download.file( "https://github.com/mkearney/stat/blob/master/static/exams/exam2-data.csv?raw=true", assign("tmp", tempfile(fileext = ".rds")) ) ## read in the twitter-AP news headline study data twap <- readr::read_csv(tmp) #> Parsed with column specification: #> cols( #> .default = col_integer(), #> party = col_character(), #> mancheck = col_logical(), #> gender = col_character(), #> race = col_character(), #> headline_format = col_character(), #> headline_topic = col_character() #> ) #> See spec(...) for full column specifications. ``` --- ## Data set | party | dem_therm | gop_therm | sci_therm | pol_therm | mov_therm | twuse | mancheck | newsworthy | credibility_1 | credibility_2 | credibility_3 | credibility_4 | credibility_5 | credibility_6 | credibility_7 | gender | edu | race | age | headline_format | headline_topic | |:-----------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-----:|:--------:|:----------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:------:|:---:|:-------------------------:|:---:|:---------------:|:--------------:| | Democrat | 90 | 19 | 99 | 89 | 75 | 3 | TRUE | 1 | 4 | 4 | 4 | 5 | 3 | 2 | 3 | Woman | 8 | White or Caucasian | 27 | apnews.com | Politics | | Republican | 26 | 80 | 64 | 80 | 74 | 2 | FALSE | 1 | 2 | 4 | 4 | 2 | 4 | 2 | 4 | Woman | 4 | White or Caucasian | 23 | twitter.com | Health | | Republican | 21 | 83 | 90 | 16 | 53 | 0 | TRUE | 3 | 4 | 3 | 4 | 4 | 4 | 4 | 4 | Woman | 6 | White or Caucasian | 28 | apnews.com | Politics | | Independent | 44 | 29 | 92 | 27 | 87 | 1 | TRUE | 3 | 4 | 5 | 5 | 5 | 5 | 5 | 4 | Man | 4 | White or Caucasian | 30 | apnews.com | Health | | Independent | 16 | 80 | 86 | 30 | 5 | 4 | TRUE | 0 | 5 | 3 | 4 | 4 | 5 | 5 | 5 | Man | 8 | White or Caucasian | 32 | twitter.com | Health | | Democrat | 67 | 41 | 39 | 71 | 63 | 1 | TRUE | 3 | 5 | 4 | 3 | 2 | 2 | 2 | 3 | Woman | 4 | Black or African American | 30 | apnews.com | Politics | | Republican | 10 | 75 | 95 | 40 | 90 | 1 | TRUE | 1 | 4 | 3 | 4 | 4 | 3 | 3 | 4 | Man | 4 | White or Caucasian | 43 | twitter.com | Politics | | Democrat | 100 | 0 | 48 | 18 | 75 | 0 | TRUE | 2 | 4 | 4 | 5 | 4 | 3 | 4 | 3 | Woman | 1 | White or Caucasian | 31 | twitter.com | Politics | | Independent | 65 | 69 | 48 | 54 | 73 | 0 | TRUE | 2 | 4 | 4 | 4 | 5 | 3 | 4 | 4 | Woman | 2 | Hispanic or Latino | 27 | apnews.com | Politics | | Republican | 8 | 94 | 58 | 13 | 83 | 0 | TRUE | 3 | 3 | 3 | 4 | 3 | 3 | 3 | 3 | Man | 7 | Asian / Pacific Islander | 57 | apnews.com | Health |