Week 5 Getting Started with R and RStudio 2
This week, we will continue to explore R and RStudio.
5.1 Working with tables
5.1.1 Tabulating data
We can use the table
function to create basic tables that extract raw frequency information. The following command tells us how many instances there are of each level of the variable date
in the icebio
.
TIP
`
In order to access specific columns of a data frame, you can first type the name of the data set followed by a $
symbol and then the name of the column (or variable).
`
table(icebio$date)
##
## 1990-1994 1995-2001 2002-2005
## 905 67 270
Alternatively, you could, of course, index the column by using its position in the data set like this: icebio[, 6]
- the result of table(icebio[, 6])
and table(icebio$date)
are the same! Also note that here we leave out indexes for rows to tell R that we want all rows.
When you want to cross-tabulate columns, it is often better to use the ftable
function (ftable
stands for frequency table).
ftable(icebio$age, icebio$sex)
## female male
##
## 0-18 5 7
## 19-25 163 65
## 26-33 83 36
## 34-41 35 58
## 42-49 35 97
## 50+ 63 138
EXERCISE TIME!
`
- Using the
table
function, how many women are in the data collected between 2002 and 2005?
Answer
table(icebio$date, icebio$sex)
##
## female male
## 1990-1994 338 562
## 1995-2001 4 58
## 2002-2005 186 84
- Using the
ftable
function, how many men are are from northern Ireland in the data collected between 1990 and 1994?
Answer
ftable(icebio$date, icebio$zone, icebio$sex)
## female male
##
## 1990-1994 mixed between ni and roi 18 13
## non-corpus speaker 7 22
## northern ireland 104 289
## republic of ireland 209 238
## 1995-2001 mixed between ni and roi 0 0
## non-corpus speaker 1 1
## northern ireland 2 36
## republic of ireland 1 21
## 2002-2005 mixed between ni and roi 19 7
## non-corpus speaker 7 9
## northern ireland 122 41
## republic of ireland 38 27
`
5.1.2 Saving data to your computer
To save tabular data on your computer, you can use the write.table
function. This function requires the data that you want to save as its first argument, the location where you want to save the data as the second argument and the type of delimiter as the third argument.
write.table(icebio, here::here("data", "icebio.txt"), sep = "\t")
A word about paths
In the code chunk above, the sequence here::here("data", "icebio.txt")
is a handy way to define a path. A path is simply the location where a file is stored on your computer or on the internet (which typically is a server - which is just a fancy term for a computer - somewhere on the globe). The here
function from thehere
package allows to simply state in which folder a certain file is and what file you are talking about.
In this case, we want to access the file icebio
(which is a txt
file and thus has the appendix .txt
) in the data
folder. R will always start looking in the folder in which your project is stored. If you want to access a file that is stored somewhere else on your computer, you can also define the full path to the folder in which the file is. In my case, this would be D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data
. However, as the data
folder in in the folder where my Rproj file is, I only need to specify that the file is in the data
folder within the folder in which my Rproj file is located.
A word about package naming
Another thing that is notable in the sequence here::here("data", "icebio.txt")
is that I specified that the here
function is part of the here
package. This is what I meant by writing here::here
which simply means use the here
function from here
package (package::function
). This may appear to be somewhat redundant but it happens quite frequently, that different packages have functions that have the same names. In such cases, R will simply choose the function from the package that was loaded last. To prevent R from using the wrong function, it makes sense to specify the package AND the function (as I did in the sequence here::here
). I only use functions without specify the package if the function is part of base R.
5.1.3 Loading data from your computer
To load tabular data from within your project folder (if it is in a tab-separated txt-file) you can also use the read.delim
function. The only difference to loading from the web is that you use a path instead of a URL. If the txt-file is in the folder called data in your project folder, you would load the data as shown below.
<- read.delim(here::here("data", "icebio.txt"), sep = "\t", header = T) icebio
However, you can always just use the full path (and you must do this is the data is not in your project folder).
NOTE
You may have to change the path to the data!
<- read.delim(here::here("data", "icebio.txt"),
icebio sep = "\t", header = T)
To if this has worked, we will use the head
function to see first 6 rows of the data
head(icebio)
## id file.speaker.id text.id spk.ref zone date sex age
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994 male 34-41
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994 female 34-41
## 3 3 <S1A-002$?> S1A-002 ? <NA> <NA> <NA> <NA>
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005 female 26-33
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005 female 19-25
## 6 6 <S1A-002$C> S1A-002 C northern ireland 2002-2005 male 50+
## word.count
## 1 765
## 2 1298
## 3 23
## 4 391
## 5 47
## 6 200
5.1.4 Loading Excel data
To load Excel spreadsheets, you can use the read_excel
function from the readxl
package as shown below. However, it may be necessary to install and activate the readxl
package first.
<- readxl::read_excel(here::here("data", "ICEdata.xlsx")) icebio
We now briefly check column names to see if the loading of the data has worked.
colnames(icebio)
## [1] "id" "file.speaker.id" "text.id" "spk.ref"
## [5] "zone" "date" "sex" "age"
## [9] "word.count"
5.1.5 Loading text data
There are many functions that we can use to load text data into R. For example, we can use the readLines
function as shown below.
<- readLines(here::here("data", "text2.txt"))
text # inspect first text element
1] text[
## [1] "The book is presented as a manuscript written by its protagonist, a middle-aged man named Harry Haller, who leaves it to a chance acquaintance, the nephew of his landlady. The acquaintance adds a short preface of his own and then has the manuscript published. The title of this \"real\" book-in-the-book is Harry Haller's Records (For Madmen Only)."
To load many texts, we can use a loop to read all texts in a folder as shown below. In a first step, we define the paths of the texts and then, we use the sapply
function to loop over the paths and read them into R.
# define paths
<- list.files(here::here("data/testcorpus"), full.names = T)
paths # load texts
<- sapply(paths, function(x){ readLines(x) })
texts # inspect first text element
1] texts[
## $<NA>
## NULL
A method achieving the same result which uses piping (more on what that is below) and tidyverse R code is shown below.
# define paths
<- list.files(here::here("data/testcorpus"), full.names = T, pattern = ".*txt") %>%
texts ::map_chr(~ readr::read_file(.))
purrr# inspect first text element
1] texts[
## [1] NA
5.1.6 Renaming, Piping, and Filtering
To rename existing columns in a table, you can use the rename
command which takes the table as the first argument, the new name as the second argument, the an equal sign (=), and finally, the old name es the third argument. For example, renaming a column OldName as NewName in a table called MyTable would look like this: rename(MyTable, NewName = OldName)
.
Piping is done using the %>%
sequence and it can be translated as and then. In the example below, we create a new object (icebio_edit) from the existing object (icebio) and then we rename the columns in the new object. When we use piping, we do not need to name the data we are using as this is provided by the previous step.
<- icebio %>%
icebio_edit ::rename(Id = id,
dplyrFileSpeakerId = file.speaker.id,
File = colnames(icebio)[3],
Speaker = colnames(icebio)[4])
# inspect data
1:5, 1:6] icebio_edit[
## # A tibble: 5 × 6
## Id FileSpeakerId File Speaker zone date
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994
## 3 3 <S1A-002$?> S1A-002 ? NA NA
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005
A very handy way to rename many columns simultaneously, you can use the str_to_title
function which capitalizes first letter of a word. In the example below, we capitalize all first letters of the column names of our current data.
colnames(icebio_edit) <- stringr::str_to_title(colnames(icebio_edit))
# inspect data
1:5, 1:6] icebio_edit[
## # A tibble: 5 × 6
## Id Filespeakerid File Speaker Zone Date
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994
## 3 3 <S1A-002$?> S1A-002 ? NA NA
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005
To remove rows based on values in columns you can use the filter
function.
<- icebio_edit %>%
icebio_edit2 ::filter(Speaker != "?",
dplyr!= is.na(Zone),
Zone == "2002-2005",
Date > 5)
Word.count # inspect data
head(icebio_edit2)
## # A tibble: 6 × 9
## Id Filespeakerid File Speaker Zone Date Sex Age Word.count
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 4 <S1A-002$A> S1A-002 A northern ire… 2002… fema… 26-33 391
## 2 5 <S1A-002$B> S1A-002 B northern ire… 2002… fema… 19-25 47
## 3 6 <S1A-002$C> S1A-002 C northern ire… 2002… male 50+ 200
## 4 7 <S1A-002$D> S1A-002 D northern ire… 2002… fema… 50+ 464
## 5 8 <S1A-002$E> S1A-002 E mixed betwee… 2002… male 34-41 639
## 6 9 <S1A-002$F> S1A-002 F northern ire… 2002… fema… 26-33 308
To select specific columns you can use the select
function.
<- icebio_edit2 %>%
icebio_selection ::select(File, Speaker, Word.count)
dplyr# inspect data
head(icebio_selection)
## # A tibble: 6 × 3
## File Speaker Word.count
## <chr> <chr> <dbl>
## 1 S1A-002 A 391
## 2 S1A-002 B 47
## 3 S1A-002 C 200
## 4 S1A-002 D 464
## 5 S1A-002 E 639
## 6 S1A-002 F 308
You can also use the select
function to remove specific columns.
<- icebio_edit2 %>%
icebio_selection2 ::select(-Id, -File, -Speaker, -Date, -Zone, -Age)
dplyr# inspect data
head(icebio_selection2)
## # A tibble: 6 × 3
## Filespeakerid Sex Word.count
## <chr> <chr> <dbl>
## 1 <S1A-002$A> female 391
## 2 <S1A-002$B> female 47
## 3 <S1A-002$C> male 200
## 4 <S1A-002$D> female 464
## 5 <S1A-002$E> male 639
## 6 <S1A-002$F> female 308
5.1.7 Ordering data
To order data, for instance, in ascending order according to a specific column you can use the arrange
function.
<- icebio_selection2 %>%
icebio_ordered_asc ::arrange(Word.count)
dplyr# inspect data
head(icebio_ordered_asc)
## # A tibble: 6 × 3
## Filespeakerid Sex Word.count
## <chr> <chr> <dbl>
## 1 <S1B-009$D> female 6
## 2 <S1B-005$C> female 7
## 3 <S1B-009$C> male 7
## 4 <S1B-020$F> male 7
## 5 <S1B-006$G> female 9
## 6 <S2A-050$B> male 9
To order data in descending order you can also use the arrange
function and simply add a - before the column according to which you want to order the data.
<- icebio_selection2 %>%
icebio_ordered_desc ::arrange(-Word.count)
dplyr# inspect data
head(icebio_ordered_desc)
## # A tibble: 6 × 3
## Filespeakerid Sex Word.count
## <chr> <chr> <dbl>
## 1 <S2A-055$A> female 2355
## 2 <S2A-047$A> male 2340
## 3 <S2A-035$A> female 2244
## 4 <S2A-048$A> male 2200
## 5 <S2A-015$A> male 2172
## 6 <S2A-054$A> female 2113
The output shows that the female speaker in file S2A-005 with the speaker identity A has the highest word count with 2,355 words.
EXERCISE TIME!
`
- Using the data called
icebio
, create a new data set calledICE_Ire_ordered
and arrange the data in descending order by the number of words that each speaker has uttered. Who is the speaker with the highest word count?
Answer
<- icebio %>%
ICE_Ire_ordered ::arrange(-word.count)
dplyr# inspect data
head(ICE_Ire_ordered)
## # A tibble: 6 × 9
## id file.speaker.id text.id spk.ref zone date sex age word.count
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 956 <S2A-037$A> S2A-037 A republic o… 1990… male NA 2565
## 2 919 <S2A-016$A> S2A-016 A republic o… 1995… fema… 34-41 2482
## 3 933 <S2A-023$A> S2A-023 A northern i… 1990… male 50+ 2367
## 4 992 <S2A-055$A> S2A-055 A northern i… 2002… fema… 42-49 2355
## 5 979 <S2A-047$A> S2A-047 A republic o… 2002… male 50+ 2340
## 6 997 <S2A-059$A> S2A-059 A republic o… 1990… fema… NA 2305
`
5.1.8 Creating and changing variables
New columns are created, and existing columns can be changed, by using the mutate
function. The mutate
function takes two arguments (if the data does not have to be specified): the first argument is the (new) name of column that you want to create and the second is what you want to store in that column. The = tells R that the new column will contain the result of the second argument.
In the example below, we create a new column called Texttype.
This new column should contain
the value PrivateDialoge if Filespeakerid contains the sequence S1A,
the value PublicDialogue if Filespeakerid contains the sequence S1B,
the value UnscriptedMonologue if Filespeakerid contains the sequence S2A,
the value ScriptedMonologue if Filespeakerid contains the sequence S2B,
the value of Filespeakerid if Filespeakerid neither contains S1A, S1B, S2A, nor S2B.
<- icebio_selection2 %>%
icebio_texttype ::mutate(Texttype =
dplyr::case_when(stringr::str_detect(Filespeakerid ,"S1A") ~ "PrivateDialoge",
dplyr::str_detect(Filespeakerid ,"S1B") ~ "PublicDialogue",
stringr::str_detect(Filespeakerid ,"S2A") ~ "UnscriptedMonologue",
stringr::str_detect(Filespeakerid ,"S2B") ~ "ScriptedMonologue",
stringrTRUE ~ Filespeakerid))
# inspect data
head(icebio_texttype)
## # A tibble: 6 × 4
## Filespeakerid Sex Word.count Texttype
## <chr> <chr> <dbl> <chr>
## 1 <S1A-002$A> female 391 PrivateDialoge
## 2 <S1A-002$B> female 47 PrivateDialoge
## 3 <S1A-002$C> male 200 PrivateDialoge
## 4 <S1A-002$D> female 464 PrivateDialoge
## 5 <S1A-002$E> male 639 PrivateDialoge
## 6 <S1A-002$F> female 308 PrivateDialoge
5.1.9 If-statements
We should briefly talk about if-statements (or case_when
in the present case). The case_when
function is both very powerful and extremely helpful as it allows you to assign values based on a test. As such, case_when
-statements can be read as:
When/If X is the case, then do A and if X is not the case do B! (When/If -> Then -> Else)
The nice thing about ifelse
or case_when
-statements is that they can be used in succession as we have done above. This can then be read as:
If X is the case, then do A, if Y is the case, then do B, else do Z
EXERCISE TIME!
`
1.Using the data called icebio
, create a new data set called ICE_Ire_AgeGroup
in which you create a column called AgeGroup
where all speakers who are younger than 42 have the value young and all speakers aged 42 and over old.
Tip: use if-statements to assign the old and young values.
Answer
<- icebio %>%
ICE_Ire_AgeGroup ::mutate(AgeGroup = dplyr::case_when(age == "42-49" ~ "old",
dplyr== "50+" ~ "old",
age == "0-18" ~ "young",
age == "19-25" ~ "young",
age == "26-33" ~ "young",
age == "34-41" ~ "young",
age TRUE ~age))
# inspect data
head(ICE_Ire_AgeGroup); table(ICE_Ire_AgeGroup$AgeGroup)
## # A tibble: 6 × 10
## id file.speaker.id text.id spk.ref zone date sex age word.count
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 <S1A-001$A> S1A-001 A northern i… 1990… male 34-41 765
## 2 2 <S1A-001$B> S1A-001 B northern i… 1990… fema… 34-41 1298
## 3 3 <S1A-002$?> S1A-002 ? NA NA NA NA 23
## 4 4 <S1A-002$A> S1A-002 A northern i… 2002… fema… 26-33 391
## 5 5 <S1A-002$B> S1A-002 B northern i… 2002… fema… 19-25 47
## 6 6 <S1A-002$C> S1A-002 C northern i… 2002… male 50+ 200
## # … with 1 more variable: AgeGroup <chr>
##
## NA old young
## 547 333 452
`
5.1.10 Summarizing data
Summarizing is really helpful and can be done using the summarise
function.
<- icebio_texttype %>%
icebio_summary1 ::summarise(Words = sum(Word.count))
dplyr# inspect data
head(icebio_summary1)
## # A tibble: 1 × 1
## Words
## <dbl>
## 1 141876
To get summaries of sub-groups or by variable level, we can use the group_by
function and then use the summarise
function.
<- icebio_texttype %>%
icebio_summary2 ::group_by(Texttype, Sex) %>%
dplyr::summarise(Speakers = n(),
dplyrWords = sum(Word.count))
# inspect data
head(icebio_summary2)
## # A tibble: 6 × 4
## # Groups: Texttype [3]
## Texttype Sex Speakers Words
## <chr> <chr> <int> <dbl>
## 1 PrivateDialoge female 105 60024
## 2 PrivateDialoge male 18 9628
## 3 PublicDialogue female 63 24647
## 4 PublicDialogue male 41 16783
## 5 UnscriptedMonologue female 3 6712
## 6 UnscriptedMonologue male 16 24082
EXERCISE TIME!
`
- Use the
icebio
and determine the number of words uttered by female speakers from Northern Ireland above an age of 50.
Answer
<- icebio %>%
words_fni50 ::select(zone, sex, age, word.count) %>%
dplyr::group_by(zone, sex, age) %>%
dplyr::summarize(Words = sum(word.count)) %>%
dplyr::filter(sex == "female",
dplyr== "50+",
age == "northern ireland") zone
## `summarise()` has grouped output by 'zone', 'sex'. You can override using the
## `.groups` argument.
# inspect data
words_fni50
## # A tibble: 1 × 4
## # Groups: zone, sex [1]
## zone sex age Words
## <chr> <chr> <chr> <dbl>
## 1 northern ireland female 50+ 23210
- Load the file exercisedata.txt and determine the mean scores of groups A and B.
Tip: to extract the mean, combine the summary
function with the mean
function.
Answer
<- read.delim("data/exercisedata.txt", sep = "\t", header = T) %>%
exercisedata ::group_by(Group) %>%
dplyr::summarize(Mean = mean(Score))
dplyr# inspect data
exercisedata
## # A tibble: 2 × 2
## Group Mean
## <chr> <dbl>
## 1 A 14.9
## 2 B 11.8
`
5.1.11 Gathering and spreading data
The tidyr
package has two very useful functions for gathering and spreading data that can be sued to transform data to long and wide formats (you will see what this means below). The functions are called gather
and spread
.
We will use the data set called icebio_summary2
, which we created above, to demonstrate how this works.
We will first check out the spread
-function to create different columns for women and men that show how many of them are represented in the different text types.
<- icebio_summary2 %>%
icebio_summary_wide ::select(-Words) %>%
dplyr::spread(Sex, Speakers)
tidyr# inspect
icebio_summary_wide
## # A tibble: 3 × 3
## # Groups: Texttype [3]
## Texttype female male
## <chr> <int> <int>
## 1 PrivateDialoge 105 18
## 2 PublicDialogue 63 41
## 3 UnscriptedMonologue 3 16
The data is now in what is called a wide
-format as values are distributed across columns.
To reformat this back to a long
-format where each column represents exactly one variable, we use the gather
-function:
<- icebio_summary_wide %>%
icebio_summary_long ::gather(Sex, Speakers, female:male)
tidyr# inspect
icebio_summary_long
## # A tibble: 6 × 3
## # Groups: Texttype [3]
## Texttype Sex Speakers
## <chr> <chr> <int>
## 1 PrivateDialoge female 105
## 2 PublicDialogue female 63
## 3 UnscriptedMonologue female 3
## 4 PrivateDialoge male 18
## 5 PublicDialogue male 41
## 6 UnscriptedMonologue male 16
5.2 Ending R sessions
At the end of each session, you can extract information about the session itself (e.g. which R version you used and which versions of packages). This can help others (or even your future self) to reproduce the analysis that you have done.
5.2.1 Extracting session information
You can extract the session information by running the sessionInfo
function (without any arguments)
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
## [9] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.2 xfun_0.30 bslib_0.3.1 haven_2.5.0
## [5] colorspace_2.0-3 vctrs_0.4.1 generics_0.1.2 htmltools_0.5.2
## [9] yaml_2.3.5 utf8_1.2.2 rlang_1.0.2 jquerylib_0.1.4
## [13] pillar_1.7.0 withr_2.5.0 glue_1.6.2 DBI_1.1.2
## [17] dbplyr_2.1.1 readxl_1.4.0 modelr_0.1.8 lifecycle_1.0.1
## [21] cellranger_1.1.0 munsell_0.5.0 gtable_0.3.0 rvest_1.0.2
## [25] evaluate_0.15 knitr_1.39 tzdb_0.3.0 fastmap_1.1.0
## [29] fansi_1.0.3 broom_0.8.0 renv_0.15.4 backports_1.4.1
## [33] scales_1.2.0 jsonlite_1.8.0 fs_1.5.2 hms_1.1.1
## [37] digest_0.6.29 stringi_1.7.6 bookdown_0.26 rprojroot_2.0.3
## [41] grid_4.2.0 here_1.0.1 cli_3.3.0 tools_4.2.0
## [45] magrittr_2.0.3 sass_0.4.1 crayon_1.5.1 pkgconfig_2.0.3
## [49] ellipsis_0.3.2 xml2_1.3.3 reprex_2.0.1 lubridate_1.8.0
## [53] assertthat_0.2.1 rmarkdown_2.14 httr_1.4.3 rstudioapi_0.13
## [57] R6_2.5.1 compiler_4.2.0
5.3 Going further
If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the workshops and resources on R provided by the UQ library. In addition, there are various online resources available to learn R (you can check out a very recommendable introduction here).
Here are also some additional resources that you may find helpful:
- Grolemund. G., and Wickham, H., R 4 Data Science, 2017.
- Highly recommended! (especially chapters 1, 2, 4, 6, and 8)
- Stat545 - Data wrangling, exploration, and analysis with R. University of British Columbia. http://stat545.com/
- Swirlstats, a package that teaches you R and statistics within R: https://swirlstats.com/
- DataCamp’s (free) Intro to R interactive tutorial: https://www.datacamp.com/courses/free-introduction-to-r
- DataCamp’s advanced R tutorials require a subscription. *Twitter:
- Explore RStudio Tips https://twitter.com/rstudiotips
- Explore #rstats, #rstudioconf