Section 5 Basic String Processing

Before turning to more advanced string processing (in the context of computation, texts are referred to as strings) using the stringr package, let us just focus on some basic functions that are extremely useful when working with texts.

To practice working with texts, we will rename a shirt positive IMDB review and also generate a tokenized version of this review.

Tokenizsation: Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be symbols, words, phrases, sentences, or other meaningful elements (e.g. paragraphs). Tokenization is a fundamental step in natural language processing (NLP) and text analysis, as it allows for the text to be analysed and processed systematically..

review <- reviews_pos[4]
review_tok <- quanteda::tokens(review) %>% unlist() %>% as.vector()
# inspect
review; str(review_tok)
##                                                                                                                                                                                                                                                                                                                 textpos1000 
## "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n"
##  chr [1:58] "Gary" "Busey" "is" "superb" "in" "this" "musical" "biography" ...

We can now start to check out functions that are useful and frequently applied to text data. We start by splitting text data.

To split texts, we can use the str_split function. However, there are two issues when using this (very useful) function:

  • the pattern that we want to split on disappears

  • the output is a list (a special type of data format)

To remedy these issues, we

  • combine the str_split function with the unlist function

  • add something right at the beginning of the pattern that we use to split the text. To add something to the beginning of the pattern that we want to split the text by, we use the str_replace_all function. The str_replace_all function takes three arguments, 1. the text, 2. the pattern that should be replaced, 3. the replacement. In the example below, we add ~~~ to the sequence movie and then split on the ~~~ rather than on the sequence “movie” (in other words, we replace movie with ~~~movie and then split on ~~~).

# split text
str_split(
  # attach ~~~ right before where we want to split (we want to split before the token "movie")
    stringr::str_replace_all(reviews_pos, "movie", "~~~movie"),
    # define that we want to split by ~~~
    pattern = "~~~") %>%
  # unlist results
  unlist() -> reviews_pos_split
# inspect data
nchar(reviews_pos_split); str(reviews_pos_split)
##   [1] 1763  641  584  458  127  184 3847  152  685   30  105   92   43   27   11
##  [16]  716  465  417  262  107   59  747  370 1550  122 1693  714   45  726   80
##  [31]  224  906  250  521 1939 2708 1205 2572  819  717   28  109  211   31  554
##  [46]  289  914  440   65  561   62  196   33   55   40  200  121  616   82  135
##  [61]  293   47  280   21  232   89   99  353 2003  379   42   36   42  315  751
##  [76]   33   53  220 1147 1766   11  221   99  214  610   72   27  792  254  131
##  [91]  253   88  250  111   64   84  465  103   10  372
##  [ reached getOption("max.print") -- omitted 2747 entries ]
##  chr [1:2847] "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right"| __truncated__ ...

A very useful function is, e.g. tolower which converts everything to lower case.

tolower(review)
##                                                                                                                                                                                                                                                                                                                 textpos1000 
## "gary busey is superb in this musical biography. great singing and excellent soundtrack. the buddy holly story is a much better movie than la bamba. from reading other comments, there may be some historical inaccuracies. regardless, it is a fun toe-tapping film, and a good introduction to buddy holly's music.\r\n"

Conversely, toupper converts everything to upper case.

toupper(review)
##                                                                                                                                                                                                                                                                                                                 textpos1000 
## "GARY BUSEY IS SUPERB IN THIS MUSICAL BIOGRAPHY. GREAT SINGING AND EXCELLENT SOUNDTRACK. THE BUDDY HOLLY STORY IS A MUCH BETTER MOVIE THAN LA BAMBA. FROM READING OTHER COMMENTS, THERE MAY BE SOME HISTORICAL INACCURACIES. REGARDLESS, IT IS A FUN TOE-TAPPING FILM, AND A GOOD INTRODUCTION TO BUDDY HOLLY'S MUSIC.\r\n"

The stringr package (see here is part of the so-called tidyverse - a collection of packages that allows to write R code in a readable manner - and it is the most widely used package for string processing in . The advantage of using stringr is that it makes string processing very easy. All stringr functions share a common structure:

str_function(string, pattern)

The two arguments in the structure of stringr functions are: string which is the character string to be processed and a pattern which is either a simple sequence of characters, a regular expression, or a combination of both. Because the string comes first, the stringr functions are ideal for piping and thus use in tidyverse style R.

All function names of stringr begin with str, then an underscore and then the name of the action to be performed. For example, to replace the first occurrence of a pattern in a string, we should use str_replace(). In the following, we will use stringr functions to perform various operations on the example text. As we have already loaded the tidyverse package, we can start right away with using stringr functions as shown below.

Like nchar in base, str_count provides the number of characters of a text.

str_count(reviews_pos[1:5])
## [1] 1762  640 1041  310 3846

The function str_detect informs about whether a pattern is present in a text and outputs a logical vector with TRUE if the pattern occurs and FALSE if it does not.

str_detect(review, "and")
## [1] TRUE

The function str_extract_all extracts all occurrences of a pattern, if that pattern is present in a text.

str_extract_all(review, "and")
## [[1]]
## [1] "and" "and"

The function str_locate_all provides the start and end positions of the match of the pattern in a text and displays the result in matrix-form.

str_locate_all(review, "and")
## [[1]]
##      start end
## [1,]    63  65
## [2,]   263 265

The function str_remove removes the first occurrence of a pattern in a text.

str_remove(review, "and") 
## [1] "Gary Busey is superb in this musical biography. Great singing  excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n"

The function str_remove_all removes all occurrences of a pattern from a text.

str_remove_all(review, "and")
## [1] "Gary Busey is superb in this musical biography. Great singing  excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film,  a good introduction to Buddy Holly's music.\r\n"

The function str_replace_all replaces all occurrences of a pattern with something else in a text.

str_replace_all(review, "and", "AND")
## [1] "Gary Busey is superb in this musical biography. Great singing AND excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, AND a good introduction to Buddy Holly's music.\r\n"

Like strsplit, the function str_split splits a text when a given pattern occurs. If no pattern is provided, then the text is split into individual symbols.

str_split(review, "and") 
## [[1]]
## [1] "Gary Busey is superb in this musical biography. Great singing "                                                                                                                                       
## [2] " excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, "
## [3] " a good introduction to Buddy Holly's music.\r\n"

The function str_subset extracts those subsets of a text that contain a certain pattern.

str_subset(review_tok, "and") 
## [1] "and" "and"

The function str_which provides a vector with the indices of the texts that contain a certain pattern.

str_which(review_tok, "and")
## [1] 12 50

The function str_view_all shows the locations of all instances of a pattern in a text or vector of texts.

str_view_all(review, "and")
## [1] │ Gary Busey is superb in this musical biography. Great singing <and> excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, <and> a good introduction to Buddy Holly's music.{\r}
##     │

The function str_pad adds white spaces to a text or vector of texts so that they reach a given number of symbols.

# create text with white spaces
text <- " this    is a    text   "
str_pad(text, width = 30)
## [1] "       this    is a    text   "

The function str_trim removes white spaces from the beginning(s) and end(s) of a text or vector of texts.

str_trim(text) 
## [1] "this    is a    text"

The function str_squish removes white spaces that occur within a text or vector of texts.

str_squish(text)
## [1] "this is a text"

The function str_order provides a vector that represents the order of a vector of texts according to the lengths of texts in that vector.

str_order(reviews_pos_split)
##   [1] 1522 2196  170 2011 1320 2161  231 1771 2525 1876  272  936 1413  287 1988
##  [16] 1276   40 1630  530 1727  168  127 2676 1826 2457 1420  438 2332 2696  528
##  [31]  910 2808 2279  828 2162 1580 2282  942 2081 2029 1811 2440 1036 2178  258
##  [46] 2024  622 1728  469 1196 1577 1606  772 1059 2227 1336 1963  313 1888  745
##  [61] 1202 1713  270  883 2126 2500 1111 2220 1965 1353  331 1558 1659  552 2339
##  [76] 1683  904 1144  323 2666 1803  698   70 1928 2016 2090 2747 1167 1007 1137
##  [91] 1175   25 2408 1663 2359 2576 1238  364  882  177
##  [ reached getOption("max.print") -- omitted 2747 entries ]

The function str_sort orders of a vector of texts according to the lengths of texts in that vector.

str_sort(reviews_pos_split) %>%
  head()
## [1] "...And there were quite a few of these. <br /><br />I do not like this cartoon as much as many others, partly because it was made in its period. I much prefer cartoons with Daffy and Bugs which are fifteen or so years before-hand. Many people will like this, particularly people who always find violence funny, cartoon or not.<br /><br />The basic plot is a pretty well known one for Looney Tunes: Elmer goes out hunting, Daffy leads him to Bugs and Daffy ends up being shot instead. Also inserted are quite clever and highly entertaining jokes (some do not enhance the episode), ugly shooting and animation which is slightly mediocre. The plot is mainly geared by jokes - each joke keeps the episode going. This way of plot-going is not all that unusual in Looney Tunes (of course if you are pretty much a Looney Tunes boffin - or an eager one - like me, then you'll know this already).<br /><br />For people who love everything about Looney Tunes and Daffy Duck and like the sound of what I have said about it, enjoy \"Rabbit Seasoning\"!<br /><br />7 and a half out of ten.\r\n"                                                                                                               
## [2] "'Had Ned Kelly been born later he probably would have won a Victoria Cross at Gallipolli'. such was Ned's Bravery.<br /><br />In Australia and especially country Victoria the name Ned Kelly can be said and immediately recognised. In Greta he is still a Hero, the life Blood of the Town of Jerilderie depends on the tourism he created, but in Mansfield they still haven't forgotten that the three policeman that he 'murdered' were from there.<br /><br />Many of the buildings he visited in his life are still standing. From the Old Melbourne Gaol where he was hanged, to the Post office he held up in Jerilderie. A cell he was once held in in Greta is on display in Benella and the site of Ann Jones' Hotel, the station and even the logs where he was captured in Glenrowan can be visited.<br /><br />Evidence of all the events in the "                                                                                                                                                                                                                                                                                                                                                                      
## [3] "'War "                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
## [4] "'What I Like About You' is definitely a show that I couldn't wait to see each day. Amanda Bynes is such an excellent actress and I grew up watching her show: 'The Amanda Show.' She's a very funny person and seems to be down to earth. \"Holly\" is such a like-able person and has an \"out-there\" personality. I enjoyed how she always seemed to turn things around and upside down, so she messed herself up at times. But that's what made the show so great.<br /><br />I especially loved the show when the character 'Vince' came along. Nick Zano is very HOT and funny, as well as 'Gary', Wesley Jonathan. The whole cast was great, each character had their own personality and charm. Jennie Garth, Allison Munn, and Leslie Grossman were all very interesting. I especially loved 'Lauren'; she's the best! She helped make the show extra funny and you never know what she's gonna do or say next! Overall the show is really nice but the reason I didn't give it a 10 was because there's no more new episodes and because the episodes could've been longer and more deep.\r\n"                                                                                                                                
## [5] "\" Så som i himmelen \" .. as above so below.. that very special point where Divine and Human meet. I ADORE this film ! A gem. YES amazing grace !<br /><br />I was so deeply moved by its very HUMAN quality. I laughed and cried through a whole register , indeed several octaves of emotions.<br /><br />Mikael Nyqvist ís BRILLIANT as Daniel , a first rate passionate performance, charismatic and powerful. His inner light and exceptional talent shines through in every scene, every interaction ,in every meeting. I was totally mesmerised, enchanted and caught up the story, which is our collective story, the story of life itself.<br /><br />The film was also so inclusive of many archetypes, messiah, wounded child ,magical child, artist, teacher, priest, abuser, abused, victim, bully, divine fool - ALL the characters so real and true to life - all awakened great fondness and compassion in me. <br /><br />It is a real treat to see such a thought provoking yet thoroughly enjoyable, entertaining film. Oh ..mustn't forget the heavenly choir of angels and breathtakingly beautiful sound. <br /><br />THANK YOU ALL - This Swedish film will surely captivate people world-wide. BRILLIANT !\r\n"
## [6] "\"Ah Ritchie's made another gangster film with Statham\" thought the average fan, expecting another Snatch/Lock Stock; expecting perhaps a couple of temporal shifts, but none too hard for \"me and the lads\" to swallow after a few beers.<br /><br />Ah, pay attention, you do need to watch this film. No cups of tea, no extra diet cokes from the counter, no \"keep it running\" shouts as you nip to the fridge - watch the film! No laughs other than those you may make yourself from the considerable violence (and if that floats your boat, so be it) but sharp solid direction, excellent dialogue, and great performances.<br /><br />My favourite - Big Pussy from The Sopranos, always a reliable hood.\r\n"

The function str_c combines texts into one text

str_c(review, reviews_pos[7])
## [1] "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\nI think this is one hell of a movie...........We can see Steven fighting around with his martial art stuff again and like in all Segal movies there's a message in it, without the message it would be one of many action/fighting movies but the message is what makes segal movies great and special.\r\n"

The function str_conv converts a text into a certain type of encoding, e.g. into UTF-8 or Latin1.

str_conv(review, encoding = "UTF-8")
## [1] "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n"

The function str_flatten combines a vector of texts into one text. The argument collapse defines the symbol that occurs between the combined texts. If the argument collapse is left out, the texts will be combined without any symbol between the combined texts.

str_flatten(c(review, reviews_pos[7]), collapse = " ")
## [1] "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n I think this is one hell of a movie...........We can see Steven fighting around with his martial art stuff again and like in all Segal movies there's a message in it, without the message it would be one of many action/fighting movies but the message is what makes segal movies great and special.\r\n"

The function str_length provides the length of texts in characters.

str_length(review)
## [1] 311

The function str_sub extracts a string from a text from a start location to an end position (expressed as character positions).

str_sub(review, 5, 25)
## [1] " Busey is superb in t"

The function word extracts words from a text (expressed as word positions).

word(review, 2:7)
## [1] "Busey"   "is"      "superb"  "in"      "this"    "musical"

5.1 Extracting frequency information from text

Frequency lists are very basic but also important when analysing text. Fortunately, it is very easy to extract frequency information and to create frequency lists with R. We can do this by first using the unnest_tokens function which splits texts into individual words, an then use the count function to get the raw frequencies of all word types in a text.

reviews_pos %>%
  # convert to data frame
  as.data.frame()%>%
  # give name column with text
  dplyr::rename(text = 1) %>%
  # tokenise
  tidytext::unnest_tokens(word, text) %>%
  # count tokens
  dplyr::count(word, sort=T) %>%
  # inspect 20 most frequent tokens
  head(20)
##     word     n
## 1    the 13395
## 2    and  6869
## 3      a  6487
## 4     of  5998
## 5     to  5153
## 6     is  4351
## 7     br  3968
## 8     in  3827
## 9     it  2982
## 10     i  2935
## 11  this  2845
## 12  that  2552
## 13    as  2025
## 14  with  1817
## 15   was  1709
## 16   for  1699
## 17   but  1571
## 18  film  1558
## 19 movie  1527
## 20    on  1436

Extracting N-grams is also very easy as the unnest_tokens function can an argument called token in which we can specify that we want to extract n-grams, If we do this, then we need to specify the n as a separate argument. Below we specify that we want the frequencies of all 4-grams.

reviews_pos %>%
    # convert to data frame
  as.data.frame()%>%
  # give name column with text
  dplyr::rename(text = 1) %>%
  # clean data
  dplyr::mutate(text = str_remove_all(text, "<.*?>")) %>%
  # tokenise and extract trigrams
  tidytext:: unnest_tokens(word, text, token="ngrams", n=3) %>%
  # count tokens
  dplyr::count(word, sort=T) %>%
  # inspect ten most frequent tri-grams
  head(10)
##             word   n
## 1     one of the 219
## 2      this is a 127
## 3    some of the  96
## 4      is one of  92
## 5    of the film  92
## 6  this movie is  89
## 7       a lot of  85
## 8   this film is  70
## 9    of the best  68
## 10  of the movie  68

5.2 Regular Expressions

In this section, we focus on regular expressions (to learn more about regular expression, have a look at this very recommendable tutorial). Regular expressions are powerful tools used to search and manipulate text patterns. They provide a way to find specific sequences of characters within larger bodies of text.

There are two basic types of regular expressions:

  • regular expressions that stand for frequencies (quantifiers)

  • regular expressions that stand for classes of symbols (types)

The regular expressions below show the first type of regular expressions, i.e. quantifiers.

The regular expressions below show the second type of regular expressions, i.e. types.

Types can be expanded to include structural properties as shown below.

5.3 Practice: regular expressions

We now want to show all words in the tokenized review that contain y.

review_tok[str_detect(review_tok, "[y]")]
## [1] "Gary"      "Busey"     "biography" "Buddy"     "Holly"     "Story"    
## [7] "may"       "Buddy"     "Holly's"

Show all words in the split tokenized review that begin with a lower case a.

review_tok[str_detect(review_tok, "^a")]
## [1] "and" "a"   "a"   "and" "a"

Show all words in the split tokenized review that end in a lower case s.

review_tok[str_detect(review_tok, "s$")]
## [1] "is"           "this"         "is"           "comments"     "inaccuracies"
## [6] "Regardless"   "is"           "Holly's"

Show all words in the split tokenized review in which there is an e, then any other character, and than another n.

review_tok[str_detect(review_tok, "o..y")]
## [1] "Holly"   "Holly's"

Show all words in the tokenized review text in which there is an e, then two other characters, and than another n.

review_tok[str_detect(review_tok, "o.{2,2}y")]
## [1] "Holly"   "Holly's"

Show all words that consist of exactly three alphabetical characters in the tokenized review .

review_tok[str_detect(review_tok, "^[:alpha:]{3,3}$")]
## [1] "and" "The" "may" "fun" "and"

Show all words that consist of six or more alphabetical characters in the tokenized review.

review_tok[str_detect(review_tok, "^[:alpha:]{6,}$")]
##  [1] "superb"       "musical"      "biography"    "singing"      "excellent"   
##  [6] "soundtrack"   "better"       "reading"      "comments"     "historical"  
## [11] "inaccuracies" "Regardless"   "introduction"

Replace all lower case as with upper case Es in the review.

str_replace_all(review, "a", "E")
## [1] "GEry Busey is superb in this musicEl biogrEphy. GreEt singing End excellent soundtrEck. The Buddy Holly Story is E much better movie thEn LE BEmbE. From reEding other comments, there mEy be some historicEl inEccurEcies. RegErdless, it is E fun toe-tEpping film, End E good introduction to Buddy Holly's music.\r\n"

Remove all non-alphabetical characters in the tokenized review.

str_remove_all(review_tok, "\\W")
##  [1] "Gary"         "Busey"        "is"           "superb"       "in"          
##  [6] "this"         "musical"      "biography"    ""             "Great"       
## [11] "singing"      "and"          "excellent"    "soundtrack"   ""            
## [16] "The"          "Buddy"        "Holly"        "Story"        "is"          
## [21] "a"            "much"         "better"       "movie"        "than"        
## [26] "La"           "Bamba"        ""             "From"         "reading"     
## [31] "other"        "comments"     ""             "there"        "may"         
## [36] "be"           "some"         "historical"   "inaccuracies" ""            
## [41] "Regardless"   ""             "it"           "is"           "a"           
## [46] "fun"          "toetapping"   "film"         ""             "and"         
## [51] "a"            "good"         "introduction" "to"           "Buddy"       
## [56] "Hollys"       "music"        ""

Remove all white spaces in the review.

str_remove_all(review, " ")
## [1] "GaryBuseyissuperbinthismusicalbiography.Greatsingingandexcellentsoundtrack.TheBuddyHollyStoryisamuchbettermoviethanLaBamba.Fromreadingothercomments,theremaybesomehistoricalinaccuracies.Regardless,itisafuntoe-tappingfilm,andagoodintroductiontoBuddyHolly'smusic.\r\n"