Section 5 Basic String Processing
Before turning to more advanced string processing (in the context of computation, texts are referred to as strings) using the stringr
package, let us just focus on some basic functions that are extremely useful when working with texts.
To practice working with texts, we will rename a shirt positive IMDB review and also generate a tokenized version of this review.
Tokenizsation: Tokenization is the process of breaking down text into smaller units called tokens. These tokens can be symbols, words, phrases, sentences, or other meaningful elements (e.g. paragraphs). Tokenization is a fundamental step in natural language processing (NLP) and text analysis, as it allows for the text to be analysed and processed systematically..
review <- reviews_pos[4]
review_tok <- quanteda::tokens(review) %>% unlist() %>% as.vector()
# inspect
review; str(review_tok)
## textpos1000
## "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n"
## chr [1:58] "Gary" "Busey" "is" "superb" "in" "this" "musical" "biography" ...
We can now start to check out functions that are useful and frequently applied to text data. We start by splitting text data.
To split texts, we can use the str_split
function. However, there are two issues when using this (very useful) function:
the pattern that we want to split on disappears
the output is a list (a special type of data format)
To remedy these issues, we
combine the
str_split
function with theunlist
functionadd something right at the beginning of the pattern that we use to split the text. To add something to the beginning of the pattern that we want to split the text by, we use the
str_replace_all
function. Thestr_replace_all
function takes three arguments, 1. the text, 2. the pattern that should be replaced, 3. the replacement. In the example below, we add~~~
to the sequencemovie
and then split on the~~~
rather than on the sequence “movie” (in other words, we replacemovie
with~~~movie
and then split on~~~
).
# split text
str_split(
# attach ~~~ right before where we want to split (we want to split before the token "movie")
stringr::str_replace_all(reviews_pos, "movie", "~~~movie"),
# define that we want to split by ~~~
pattern = "~~~") %>%
# unlist results
unlist() -> reviews_pos_split
# inspect data
nchar(reviews_pos_split); str(reviews_pos_split)
## [1] 1763 641 584 458 127 184 3847 152 685 30 105 92 43 27 11
## [16] 716 465 417 262 107 59 747 370 1550 122 1693 714 45 726 80
## [31] 224 906 250 521 1939 2708 1205 2572 819 717 28 109 211 31 554
## [46] 289 914 440 65 561 62 196 33 55 40 200 121 616 82 135
## [61] 293 47 280 21 232 89 99 353 2003 379 42 36 42 315 751
## [76] 33 53 220 1147 1766 11 221 99 214 610 72 27 792 254 131
## [91] 253 88 250 111 64 84 465 103 10 372
## [ reached getOption("max.print") -- omitted 2747 entries ]
## chr [1:2847] "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right"| __truncated__ ...
A very useful function is, e.g. tolower
which converts everything to lower case.
## textpos1000
## "gary busey is superb in this musical biography. great singing and excellent soundtrack. the buddy holly story is a much better movie than la bamba. from reading other comments, there may be some historical inaccuracies. regardless, it is a fun toe-tapping film, and a good introduction to buddy holly's music.\r\n"
Conversely, toupper
converts everything to upper case.
## textpos1000
## "GARY BUSEY IS SUPERB IN THIS MUSICAL BIOGRAPHY. GREAT SINGING AND EXCELLENT SOUNDTRACK. THE BUDDY HOLLY STORY IS A MUCH BETTER MOVIE THAN LA BAMBA. FROM READING OTHER COMMENTS, THERE MAY BE SOME HISTORICAL INACCURACIES. REGARDLESS, IT IS A FUN TOE-TAPPING FILM, AND A GOOD INTRODUCTION TO BUDDY HOLLY'S MUSIC.\r\n"
The stringr
package (see here is part of the so-called tidyverse - a collection of packages that allows to write R code in a readable manner - and it is the most widely used package for string processing in . The advantage of using stringr
is that it makes string processing very easy. All stringr
functions share a common structure:
str_function(string, pattern)
The two arguments in the structure of stringr
functions are: string which is the character string to be processed and a pattern which is either a simple sequence of characters, a regular expression, or a combination of both. Because the string comes first, the stringr
functions are ideal for piping and thus use in tidyverse style R.
All function names of stringr
begin with str, then an underscore and then the name of the action to be performed. For example, to replace the first occurrence of a pattern in a string, we should use str_replace()
. In the following, we will use stringr
functions to perform various operations on the example text. As we have already loaded the tidyverse
package, we can start right away with using stringr
functions as shown below.
Like nchar
in base
, str_count
provides the number of characters of a text.
## [1] 1762 640 1041 310 3846
The function str_detect
informs about whether a pattern is present in a text and outputs a logical vector with TRUE if the pattern occurs and FALSE if it does not.
## [1] TRUE
The function str_extract_all
extracts all occurrences of a pattern, if that pattern is present in a text.
## [[1]]
## [1] "and" "and"
The function str_locate_all
provides the start and end positions of the match of the pattern in a text and displays the result in matrix-form.
## [[1]]
## start end
## [1,] 63 65
## [2,] 263 265
The function str_remove
removes the first occurrence of a pattern in a text.
## [1] "Gary Busey is superb in this musical biography. Great singing excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n"
The function str_remove_all
removes all occurrences of a pattern from a text.
## [1] "Gary Busey is superb in this musical biography. Great singing excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, a good introduction to Buddy Holly's music.\r\n"
The function str_replace_all
replaces all occurrences of a pattern with something else in a text.
## [1] "Gary Busey is superb in this musical biography. Great singing AND excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, AND a good introduction to Buddy Holly's music.\r\n"
Like strsplit
, the function str_split
splits a text when a given pattern occurs. If no pattern is provided, then the text is split into individual symbols.
## [[1]]
## [1] "Gary Busey is superb in this musical biography. Great singing "
## [2] " excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, "
## [3] " a good introduction to Buddy Holly's music.\r\n"
The function str_subset
extracts those subsets of a text that contain a certain pattern.
## [1] "and" "and"
The function str_which
provides a vector with the indices of the texts that contain a certain pattern.
## [1] 12 50
The function str_view_all
shows the locations of all instances of a pattern in a text or vector of texts.
## [1] │ Gary Busey is superb in this musical biography. Great singing <and> excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, <and> a good introduction to Buddy Holly's music.{\r}
## │
The function str_pad
adds white spaces to a text or vector of texts so that they reach a given number of symbols.
## [1] " this is a text "
The function str_trim
removes white spaces from the beginning(s) and end(s) of a text or vector of texts.
## [1] "this is a text"
The function str_squish
removes white spaces that occur within a text or vector of texts.
## [1] "this is a text"
The function str_order
provides a vector that represents the order of a vector of texts according to the lengths of texts in that vector.
## [1] 1522 2196 170 2011 1320 2161 231 1771 2525 1876 272 936 1413 287 1988
## [16] 1276 40 1630 530 1727 168 127 2676 1826 2457 1420 438 2332 2696 528
## [31] 910 2808 2279 828 2162 1580 2282 942 2081 2029 1811 2440 1036 2178 258
## [46] 2024 622 1728 469 1196 1577 1606 772 1059 2227 1336 1963 313 1888 745
## [61] 1202 1713 270 883 2126 2500 1111 2220 1965 1353 331 1558 1659 552 2339
## [76] 1683 904 1144 323 2666 1803 698 70 1928 2016 2090 2747 1167 1007 1137
## [91] 1175 25 2408 1663 2359 2576 1238 364 882 177
## [ reached getOption("max.print") -- omitted 2747 entries ]
The function str_sort
orders of a vector of texts according to the lengths of texts in that vector.
## [1] "...And there were quite a few of these. <br /><br />I do not like this cartoon as much as many others, partly because it was made in its period. I much prefer cartoons with Daffy and Bugs which are fifteen or so years before-hand. Many people will like this, particularly people who always find violence funny, cartoon or not.<br /><br />The basic plot is a pretty well known one for Looney Tunes: Elmer goes out hunting, Daffy leads him to Bugs and Daffy ends up being shot instead. Also inserted are quite clever and highly entertaining jokes (some do not enhance the episode), ugly shooting and animation which is slightly mediocre. The plot is mainly geared by jokes - each joke keeps the episode going. This way of plot-going is not all that unusual in Looney Tunes (of course if you are pretty much a Looney Tunes boffin - or an eager one - like me, then you'll know this already).<br /><br />For people who love everything about Looney Tunes and Daffy Duck and like the sound of what I have said about it, enjoy \"Rabbit Seasoning\"!<br /><br />7 and a half out of ten.\r\n"
## [2] "'Had Ned Kelly been born later he probably would have won a Victoria Cross at Gallipolli'. such was Ned's Bravery.<br /><br />In Australia and especially country Victoria the name Ned Kelly can be said and immediately recognised. In Greta he is still a Hero, the life Blood of the Town of Jerilderie depends on the tourism he created, but in Mansfield they still haven't forgotten that the three policeman that he 'murdered' were from there.<br /><br />Many of the buildings he visited in his life are still standing. From the Old Melbourne Gaol where he was hanged, to the Post office he held up in Jerilderie. A cell he was once held in in Greta is on display in Benella and the site of Ann Jones' Hotel, the station and even the logs where he was captured in Glenrowan can be visited.<br /><br />Evidence of all the events in the "
## [3] "'War "
## [4] "'What I Like About You' is definitely a show that I couldn't wait to see each day. Amanda Bynes is such an excellent actress and I grew up watching her show: 'The Amanda Show.' She's a very funny person and seems to be down to earth. \"Holly\" is such a like-able person and has an \"out-there\" personality. I enjoyed how she always seemed to turn things around and upside down, so she messed herself up at times. But that's what made the show so great.<br /><br />I especially loved the show when the character 'Vince' came along. Nick Zano is very HOT and funny, as well as 'Gary', Wesley Jonathan. The whole cast was great, each character had their own personality and charm. Jennie Garth, Allison Munn, and Leslie Grossman were all very interesting. I especially loved 'Lauren'; she's the best! She helped make the show extra funny and you never know what she's gonna do or say next! Overall the show is really nice but the reason I didn't give it a 10 was because there's no more new episodes and because the episodes could've been longer and more deep.\r\n"
## [5] "\" Så som i himmelen \" .. as above so below.. that very special point where Divine and Human meet. I ADORE this film ! A gem. YES amazing grace !<br /><br />I was so deeply moved by its very HUMAN quality. I laughed and cried through a whole register , indeed several octaves of emotions.<br /><br />Mikael Nyqvist ís BRILLIANT as Daniel , a first rate passionate performance, charismatic and powerful. His inner light and exceptional talent shines through in every scene, every interaction ,in every meeting. I was totally mesmerised, enchanted and caught up the story, which is our collective story, the story of life itself.<br /><br />The film was also so inclusive of many archetypes, messiah, wounded child ,magical child, artist, teacher, priest, abuser, abused, victim, bully, divine fool - ALL the characters so real and true to life - all awakened great fondness and compassion in me. <br /><br />It is a real treat to see such a thought provoking yet thoroughly enjoyable, entertaining film. Oh ..mustn't forget the heavenly choir of angels and breathtakingly beautiful sound. <br /><br />THANK YOU ALL - This Swedish film will surely captivate people world-wide. BRILLIANT !\r\n"
## [6] "\"Ah Ritchie's made another gangster film with Statham\" thought the average fan, expecting another Snatch/Lock Stock; expecting perhaps a couple of temporal shifts, but none too hard for \"me and the lads\" to swallow after a few beers.<br /><br />Ah, pay attention, you do need to watch this film. No cups of tea, no extra diet cokes from the counter, no \"keep it running\" shouts as you nip to the fridge - watch the film! No laughs other than those you may make yourself from the considerable violence (and if that floats your boat, so be it) but sharp solid direction, excellent dialogue, and great performances.<br /><br />My favourite - Big Pussy from The Sopranos, always a reliable hood.\r\n"
The function str_c
combines texts into one text
## [1] "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\nI think this is one hell of a movie...........We can see Steven fighting around with his martial art stuff again and like in all Segal movies there's a message in it, without the message it would be one of many action/fighting movies but the message is what makes segal movies great and special.\r\n"
The function str_conv
converts a text into a certain type of encoding, e.g. into UTF-8
or Latin1
.
## [1] "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n"
The function str_flatten
combines a vector of texts into one text. The argument collapse
defines the symbol that occurs between the combined texts. If the argument collapse
is left out, the texts will be combined without any symbol between the combined texts.
## [1] "Gary Busey is superb in this musical biography. Great singing and excellent soundtrack. The Buddy Holly Story is a much better movie than La Bamba. From reading other comments, there may be some historical inaccuracies. Regardless, it is a fun toe-tapping film, and a good introduction to Buddy Holly's music.\r\n I think this is one hell of a movie...........We can see Steven fighting around with his martial art stuff again and like in all Segal movies there's a message in it, without the message it would be one of many action/fighting movies but the message is what makes segal movies great and special.\r\n"
The function str_length
provides the length of texts in characters.
## [1] 311
The function str_sub
extracts a string from a text from a start location to an end position (expressed as character positions).
## [1] " Busey is superb in t"
The function word
extracts words from a text (expressed as word positions).
## [1] "Busey" "is" "superb" "in" "this" "musical"
5.1 Extracting frequency information from text
Frequency lists are very basic but also important when analysing text. Fortunately, it is very easy to extract frequency information and to create frequency lists with R. We can do this by first using the unnest_tokens
function which splits texts into individual words, an then use the count
function to get the raw frequencies of all word types in a text.
reviews_pos %>%
# convert to data frame
as.data.frame()%>%
# give name column with text
dplyr::rename(text = 1) %>%
# tokenise
tidytext::unnest_tokens(word, text) %>%
# count tokens
dplyr::count(word, sort=T) %>%
# inspect 20 most frequent tokens
head(20)
## word n
## 1 the 13395
## 2 and 6869
## 3 a 6487
## 4 of 5998
## 5 to 5153
## 6 is 4351
## 7 br 3968
## 8 in 3827
## 9 it 2982
## 10 i 2935
## 11 this 2845
## 12 that 2552
## 13 as 2025
## 14 with 1817
## 15 was 1709
## 16 for 1699
## 17 but 1571
## 18 film 1558
## 19 movie 1527
## 20 on 1436
Extracting N-grams is also very easy as the unnest_tokens
function can an argument called token
in which we can specify that we want to extract n-grams, If we do this, then we need to specify the n
as a separate argument. Below we specify that we want the frequencies of all 4-grams.
reviews_pos %>%
# convert to data frame
as.data.frame()%>%
# give name column with text
dplyr::rename(text = 1) %>%
# clean data
dplyr::mutate(text = str_remove_all(text, "<.*?>")) %>%
# tokenise and extract trigrams
tidytext:: unnest_tokens(word, text, token="ngrams", n=3) %>%
# count tokens
dplyr::count(word, sort=T) %>%
# inspect ten most frequent tri-grams
head(10)
## word n
## 1 one of the 219
## 2 this is a 127
## 3 some of the 96
## 4 is one of 92
## 5 of the film 92
## 6 this movie is 89
## 7 a lot of 85
## 8 this film is 70
## 9 of the best 68
## 10 of the movie 68
5.2 Regular Expressions
In this section, we focus on regular expressions (to learn more about regular expression, have a look at this very recommendable tutorial). Regular expressions are powerful tools used to search and manipulate text patterns. They provide a way to find specific sequences of characters within larger bodies of text.
There are two basic types of regular expressions:
regular expressions that stand for frequencies (quantifiers)
regular expressions that stand for classes of symbols (types)
The regular expressions below show the first type of regular expressions, i.e. quantifiers.
The regular expressions below show the second type of regular expressions, i.e. types.
Types can be expanded to include structural properties as shown below.
5.3 Practice: regular expressions
We now want to show all words in the tokenized review that contain y
.
## [1] "Gary" "Busey" "biography" "Buddy" "Holly" "Story"
## [7] "may" "Buddy" "Holly's"
Show all words in the split tokenized review that begin with a lower case a
.
## [1] "and" "a" "a" "and" "a"
Show all words in the split tokenized review that end in a lower case s
.
## [1] "is" "this" "is" "comments" "inaccuracies"
## [6] "Regardless" "is" "Holly's"
Show all words in the split tokenized review in which there is an e
, then any other character, and than another n
.
## [1] "Holly" "Holly's"
Show all words in the tokenized review text in which there is an e
, then two other characters, and than another n
.
## [1] "Holly" "Holly's"
Show all words that consist of exactly three alphabetical characters in the tokenized review .
## [1] "and" "The" "may" "fun" "and"
Show all words that consist of six or more alphabetical characters in the tokenized review.
## [1] "superb" "musical" "biography" "singing" "excellent"
## [6] "soundtrack" "better" "reading" "comments" "historical"
## [11] "inaccuracies" "Regardless" "introduction"
Replace all lower case a
s with upper case E
s in the review.
## [1] "GEry Busey is superb in this musicEl biogrEphy. GreEt singing End excellent soundtrEck. The Buddy Holly Story is E much better movie thEn LE BEmbE. From reEding other comments, there mEy be some historicEl inEccurEcies. RegErdless, it is E fun toe-tEpping film, End E good introduction to Buddy Holly's music.\r\n"
Remove all non-alphabetical characters in the tokenized review.
## [1] "Gary" "Busey" "is" "superb" "in"
## [6] "this" "musical" "biography" "" "Great"
## [11] "singing" "and" "excellent" "soundtrack" ""
## [16] "The" "Buddy" "Holly" "Story" "is"
## [21] "a" "much" "better" "movie" "than"
## [26] "La" "Bamba" "" "From" "reading"
## [31] "other" "comments" "" "there" "may"
## [36] "be" "some" "historical" "inaccuracies" ""
## [41] "Regardless" "" "it" "is" "a"
## [46] "fun" "toetapping" "film" "" "and"
## [51] "a" "good" "introduction" "to" "Buddy"
## [56] "Hollys" "music" ""
Remove all white spaces in the review.
## [1] "GaryBuseyissuperbinthismusicalbiography.Greatsingingandexcellentsoundtrack.TheBuddyHollyStoryisamuchbettermoviethanLaBamba.Fromreadingothercomments,theremaybesomehistoricalinaccuracies.Regardless,itisafuntoe-tappingfilm,andagoodintroductiontoBuddyHolly'smusic.\r\n"