Section 3 Introduction to Text Analysis with R

This part of the workshop introduces basic methods of Text Analysis (see Bernard and Ryan 1998; Kabanoff 1997; Popping 2000), i.e. computer-based analysis of language data or the (semi-)automated extraction of information from text. In the following, we will explore selected methods. The methods we will focus on are:

  • Keyword Detection and Sentiment Analysis

  • Network and Collocation Analysis

  • Topic Modelling

3.1 What is Text Analysis?

Text Analysis (TA) refers to the process of examining, processing, and interpreting unstructured data (texts) to uncover actionable knowledge using computational methods. Unstructured data (text) can, for example, include emails, literary texts, letters, articles, advertisements, official documents, social media content, transcripts, and product reviews. Actionable knowledge refers to insights and patterns used to classify, sort, extract information, determine relationships, identify trends, and make informed decisions.

Sometimes, Text Analysis is distinguished from Text Analytics. In this context, Text Analysis refers to manual, close-reading, and qualitative interpretative approaches, while Text Analytics refers to quantitative, computational analysis of text. However, in this tutorial, we consider Text Analysis and Text Analytics to be synonymous, encompassing any computer-based qualitative or quantitative method for analyzing text.

3.2 Preparation and session set up

To ensure the scripts below run smoothly, we need to install specific R packages from a library. If you’ve already installed these packages, you can skip this section. To install them, run the code below (which may take 1 to 5 minutes).

install.packages("flextable")
install.packages("GGally")
install.packages("ggraph")
install.packages("gutenbergr")
install.packages("igraph")
install.packages("lda")
install.packages("ldatuning")
install.packages("Matrix")
install.packages("network")
install.packages("quanteda")
install.packages("quanteda.textplots")
install.packages("quanteda.textstats")
install.packages("RColorBrewer")
install.packages("reshape2")
install.packages("slam")
install.packages("sna")
install.packages("textdata")
install.packages("tidygraph")
install.packages("tidytext")
install.packages("tm")
install.packages("topicmodels")
install.packages("udpipe")
install.packages("wordcloud")
install.packages("wordcloud2")
install.packages("writexl")

Once all packages are installed, you can activate them by executing (running) the code chunk below.

# load packages
library(flextable)
library(GGally)
library(ggraph)
library(gutenbergr)
library(igraph)
library(lda)
library(ldatuning)
library(Matrix)
library(network)
library(quanteda)
library(quanteda.textplots)
library(quanteda.textstats)
library(RColorBrewer)
library(reshape2)
library(slam)
library(sna)
library(textdata)
library(tidygraph)
library(tidytext)
library(tidyverse)
library(tm)
library(topicmodels)
library(udpipe)
library(wordcloud)
library(wordcloud2)
library(writexl)
# activate klippy for copy-to-clipboard button
klippy::klippy()

Next, we need to load our data. For the first part which focuses on sentiment analysis, we will use the IMDB data consisting of positive and negative reviews which we load by executing the code chunk below.

# load reviews
posreviews <- list.files(here::here("data/reviews_pos"), full.names = T, pattern = ".*txt") %>%
  purrr::map_chr(~ readr::read_file(.)) %>% str_c(collapse = " ") %>% str_replace_all("<.*?>", " ")
negreviews <- list.files(here::here("data/reviews_neg"), full.names = T, pattern = ".*txt") %>%
  purrr::map_chr(~ readr::read_file(.))%>% str_c(collapse = " ") %>% str_replace_all("<.*?>", " ")
# inspect
str(posreviews); str(negreviews)
##  chr "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right"| __truncated__
##  chr "Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fi"| __truncated__