Section 3 Working with Text

In this section, we will learn how to work with textual data in R and we use positive IMDB reviews as our example texts. Before we start, it is important to understand the general logic of R code which is why we start with a very brief explanation of functions and objects.

3.1 Functions and Objects

In R, functions always have the following form: function(argument1, argument2, ..., argumentN). Typically a function does something to an object (e.g. a table), so that the first argument typically specifies the data to which the function is applied. Other arguments then allow to add some information. Just as a side note, functions are also objects that do not contain data but instructions.

To assign content to an object, we use <- or = so that the we provide a name for an object, and then assign some content to it. For example, MyObject <- 1:20 means Create an object called MyObject. this object should contain the numbers 1 to 20.

# generate an object
MyObject <- 1:20
# inspecting my object
MyObject
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

3.2 Inspecting data

There are many ways to inspect data. We will briefly go over the most common ways to inspect data.

The head function takes the data-object as its first argument and automatically shows the first 6 elements of an object (or rows if the data-object has a table format). In contrast, the str function shows the structure of an object.

# inspect first six elements of my object
head(MyObject)
## [1] 1 2 3 4 5 6
# inspect structure of my object
str(MyObject)
##  int [1:20] 1 2 3 4 5 6 7 8 9 10 ...

Next, we will learn how to load texts into R.

3.3 Loading text data

There are many functions that we can use to load text data into R. For example, we can use the readLines function as shown below.

text <- readLines(here::here("data", "reviews_pos/textpos1.txt"))
# inspect first text element
text
## [1] "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side."

To load many texts, we can use a loop to read all texts in a folder as shown below. In a first step, we define the paths of the texts and then, we use the map_chr function from the purrr package to loop over the paths and read them into R. In addition, we add names to the texts based on the paths from where the texts were loaded.

# define paths
reviews_pos <- list.files(here::here("data/reviews_pos"), full.names = T, pattern = ".*txt") %>%
  # load data
  purrr::map_chr(~ readr::read_file(.))
# add names
names(reviews_pos) <- list.files(here::here("data/reviews_pos"), pattern = ".*txt") %>%
  stringr::str_remove_all(".txt")
# inspect first text element
reviews_pos[1]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                textpos1 
## "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side.\r\n"

3.4 Saving text data

To save a single text file, we can use writeLines function which only needs the text and the location where the text should be saved as its arguments.

writeLines(reviews_pos[1], here::here("data", "review_pos_text1.txt"))

To save many text files on your computer, you need to first define the locations where you want to save the texts and then, in a second step, you save the files (as shown below).

IMPORTANT: I have created a folder called output in my data folder in which the texts will be saved!

# define where to save each file
outs <- file.path(paste0(here::here(), "/", "data/output", "/", names(reviews_pos), ".txt", sep = ""))
head(outs)
## [1] "F:/data recovery/Uni/UQ/SLC/LADAL/workshops/IntroR_WS/data/output/textpos1.txt"   
## [2] "F:/data recovery/Uni/UQ/SLC/LADAL/workshops/IntroR_WS/data/output/textpos10.txt"  
## [3] "F:/data recovery/Uni/UQ/SLC/LADAL/workshops/IntroR_WS/data/output/textpos100.txt" 
## [4] "F:/data recovery/Uni/UQ/SLC/LADAL/workshops/IntroR_WS/data/output/textpos1000.txt"
## [5] "F:/data recovery/Uni/UQ/SLC/LADAL/workshops/IntroR_WS/data/output/textpos101.txt" 
## [6] "F:/data recovery/Uni/UQ/SLC/LADAL/workshops/IntroR_WS/data/output/textpos102.txt"

IMPORTANT: I have set the chunk attribute eval to F (FALSE) so that this chunk is not executed automatically. To run the code chunk, please just click the green “play button” in the top right corner of the code chunk.

# save the files
lapply(seq_along(reviews_pos), function(i) 
       writeLines(reviews_pos[[i]],  
       con = outs[i]))

3.5 Piping

Piping is done using the %>% sequence and it can be translated as and then. In the example below, we take the existing object (text) and then we convert it to upper case and then we store the result in a new object (text2).

text %>%
  toupper() -> text2
# inspect data
text2
## [1] "ONE OF THE OTHER REVIEWERS HAS MENTIONED THAT AFTER WATCHING JUST 1 OZ EPISODE YOU'LL BE HOOKED. THEY ARE RIGHT, AS THIS IS EXACTLY WHAT HAPPENED WITH ME.<BR /><BR />THE FIRST THING THAT STRUCK ME ABOUT OZ WAS ITS BRUTALITY AND UNFLINCHING SCENES OF VIOLENCE, WHICH SET IN RIGHT FROM THE WORD GO. TRUST ME, THIS IS NOT A SHOW FOR THE FAINT HEARTED OR TIMID. THIS SHOW PULLS NO PUNCHES WITH REGARDS TO DRUGS, SEX OR VIOLENCE. ITS IS HARDCORE, IN THE CLASSIC USE OF THE WORD.<BR /><BR />IT IS CALLED OZ AS THAT IS THE NICKNAME GIVEN TO THE OSWALD MAXIMUM SECURITY STATE PENITENTARY. IT FOCUSES MAINLY ON EMERALD CITY, AN EXPERIMENTAL SECTION OF THE PRISON WHERE ALL THE CELLS HAVE GLASS FRONTS AND FACE INWARDS, SO PRIVACY IS NOT HIGH ON THE AGENDA. EM CITY IS HOME TO MANY..ARYANS, MUSLIMS, GANGSTAS, LATINOS, CHRISTIANS, ITALIANS, IRISH AND MORE....SO SCUFFLES, DEATH STARES, DODGY DEALINGS AND SHADY AGREEMENTS ARE NEVER FAR AWAY.<BR /><BR />I WOULD SAY THE MAIN APPEAL OF THE SHOW IS DUE TO THE FACT THAT IT GOES WHERE OTHER SHOWS WOULDN'T DARE. FORGET PRETTY PICTURES PAINTED FOR MAINSTREAM AUDIENCES, FORGET CHARM, FORGET ROMANCE...OZ DOESN'T MESS AROUND. THE FIRST EPISODE I EVER SAW STRUCK ME AS SO NASTY IT WAS SURREAL, I COULDN'T SAY I WAS READY FOR IT, BUT AS I WATCHED MORE, I DEVELOPED A TASTE FOR OZ, AND GOT ACCUSTOMED TO THE HIGH LEVELS OF GRAPHIC VIOLENCE. NOT JUST VIOLENCE, BUT INJUSTICE (CROOKED GUARDS WHO'LL BE SOLD OUT FOR A NICKEL, INMATES WHO'LL KILL ON ORDER AND GET AWAY WITH IT, WELL MANNERED, MIDDLE CLASS INMATES BEING TURNED INTO PRISON BITCHES DUE TO THEIR LACK OF STREET SKILLS OR PRISON EXPERIENCE) WATCHING OZ, YOU MAY BECOME COMFORTABLE WITH WHAT IS UNCOMFORTABLE VIEWING....THATS IF YOU CAN GET IN TOUCH WITH YOUR DARKER SIDE."