Week 5 Working with Tables in R and RStudio
This week, we will continue to explore R and RStudio.
5.1 Working with tables
We will now start working with data in R. As most of the data that we work with comes in tables.
We will download a tab-separated txt-file. Translated to prose, the code below means Create an object called w5dat and in that object, store the result of the read.delim
function.
library(tidyverse)
<- read.delim(here::here("data", "week5data1.txt"), sep = "\t", header = T) w5dat
read.delim
stands for read delimited file and it takes the URL from which to load the data (or the path to the data on your computer) as its first argument. The sep
stand for separator and the \t
stands for tab-separated and represents the second argument that the read.delim
function takes. The third argument, header
, can take either T(RUE) or F(ALSE) and it tells R if the data has column names (headers) or not.
5.2 Functions and Objects
In R, functions always have the following form: function(argument1, argument2, ..., argumentN)
. Typically a function does something to an object (e.g. a table), so that the first argument typically specifies the data to which the function is applied. Other arguments then allow to add some information. Just as a side note, functions are also objects that do not contain data but instructions.
To assign content to an object, we use <-
or =
so that the we provide a name for an object, and then assign some content to it. For example, MyObject <- 1:3
means Create an object called MyObject
. this object should contain the numbers 1 to 3.
<- 1:3
MyObject # inspect
MyObject
## [1] 1 2 3
5.3 Inspecting data
There are many ways to inspect data. We will briefly go over the most common ways to inspect data.
The head
function takes the data-object as its first argument and automatically shows the first 6 elements of an object (or rows if the data-object has a table format).
head(w5dat)
## id file.speaker.id text.id spk.ref zone date sex age
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994 male 34-41
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994 female 34-41
## 3 3 <S1A-002$?> S1A-002 ? <NA> <NA> <NA> <NA>
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005 female 26-33
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005 female 19-25
## 6 6 <S1A-002$C> S1A-002 C northern ireland 2002-2005 male 50+
## word.count
## 1 765
## 2 1298
## 3 23
## 4 391
## 5 47
## 6 200
We can also use the head
function to inspect more or less elements and we can specify the number of elements (or rows) that we want to inspect as a second argument. In the example below, the 4
tells R that we only want to see the first 4 rows of the data.
head(w5dat, 4)
## id file.speaker.id text.id spk.ref zone date sex age
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994 male 34-41
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994 female 34-41
## 3 3 <S1A-002$?> S1A-002 ? <NA> <NA> <NA> <NA>
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005 female 26-33
## word.count
## 1 765
## 2 1298
## 3 23
## 4 391
5.4 Accessing individual cells in a table
If you want to access specific cells in a table, you can do so by typing the name of the object and then specify the rows and columns in square brackets (i.e. data[row, column]). For example, w5dat[2, 4]
would show the value of the cell in the second row and fourth column of the object w5dat
. We can also use the colon to define a range (as shown below, where 1:5 means from 1 to 5 and 1:3 means from 1 to 3) The command w5dat[1:5, 1:3]
thus means:
Show me the first 5 rows and the first 3 columns of the data-object that is called w5dat.
1:5, 1:3] w5dat[
## id file.speaker.id text.id
## 1 1 <S1A-001$A> S1A-001
## 2 2 <S1A-001$B> S1A-001
## 3 3 <S1A-002$?> S1A-002
## 4 4 <S1A-002$A> S1A-002
## 5 5 <S1A-002$B> S1A-002
Inspecting the structure of data
You can use the str
function to inspect the structure of a data set. This means that this function will show the number of observations (rows) and variables (columns) and tell you what type of variables the data consists of
- int = integer
- chr = character string
- num = numeric
- fct = factor
str(w5dat)
## 'data.frame': 1332 obs. of 9 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ file.speaker.id: chr "<S1A-001$A>" "<S1A-001$B>" "<S1A-002$?>" "<S1A-002$A>" ...
## $ text.id : chr "S1A-001" "S1A-001" "S1A-002" "S1A-002" ...
## $ spk.ref : chr "A" "B" "?" "A" ...
## $ zone : chr "northern ireland" "northern ireland" NA "northern ireland" ...
## $ date : chr "1990-1994" "1990-1994" NA "2002-2005" ...
## $ sex : chr "male" "female" NA "female" ...
## $ age : chr "34-41" "34-41" NA "26-33" ...
## $ word.count : int 765 1298 23 391 47 200 464 639 308 78 ...
The summary
function summarizes the data.
summary(w5dat)
## id file.speaker.id text.id spk.ref
## Min. : 1.0 Length:1332 Length:1332 Length:1332
## 1st Qu.: 333.8 Class :character Class :character Class :character
## Median : 666.5 Mode :character Mode :character Mode :character
## Mean : 666.5
## 3rd Qu.: 999.2
## Max. :1332.0
## zone date sex age
## Length:1332 Length:1332 Length:1332 Length:1332
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## word.count
## Min. : 0.0
## 1st Qu.: 66.0
## Median : 240.5
## Mean : 449.9
## 3rd Qu.: 638.2
## Max. :2565.0
5.5 Tabulating data
We can use the table
function to create basic tables that extract raw frequency information. The following command tells us how many instances there are of each level of the variable date
in the w5dat
.
table(w5dat$date)
##
## 1990-1994 1995-2001 2002-2005
## 905 67 270
Alternatively, you could, of course, index the column by using its position in the data set like this: w5dat[, 6]
- the result of table(w5dat[, 6])
and table(w5dat$date)
are the same! Also note that here we leave out indexes for rows to tell R that we want all rows.
When you want to cross-tabulate columns, it is often better to use the ftable
function (ftable
stands for frequency table).
ftable(w5dat$age, w5dat$sex)
## female male
##
## 0-18 5 7
## 19-25 163 65
## 26-33 83 36
## 34-41 35 58
## 42-49 35 97
## 50+ 63 138
5.6 Saving data to your computer
To save tabular data on your computer, you can use the write.table
function. This function requires the data that you want to save as its first argument, the location where you want to save the data as the second argument and the type of delimiter as the third argument.
write.table(w5dat, here::here("data", "w5dat.txt"), sep = "\t")
A word about paths
In the code chunk above, the sequence here::here("data", "w5dat.txt")
is a handy way to define a path. A path is simply the location where a file is stored on your computer or on the internet (which typically is a server - which is just a fancy term for a computer - somewhere on the globe). The here
function from thehere
package allows to simply state in which folder a certain file is and what file you are talking about.
In this case, we want to access the file w5dat
(which is a txt
file and thus has the appendix .txt
) in the data
folder. R will always start looking in the folder in which your project is stored. If you want to access a file that is stored somewhere else on your computer, you can also define the full path to the folder in which the file is. In my case, this would be D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data
. However, as the data
folder in in the folder where my Rproj file is, I only need to specify that the file is in the data
folder within the folder in which my Rproj file is located.
A word about package naming
Another thing that is notable in the sequence here::here("data", "w5dat.txt")
is that I specified that the here
function is part of the here
package. This is what I meant by writing here::here
which simply means use the here
function from here
package (package::function
). This may appear to be somewhat redundant but it happens quite frequently, that different packages have functions that have the same names. In such cases, R will simply choose the function from the package that was loaded last. To prevent R from using the wrong function, it makes sense to specify the package AND the function (as I did in the sequence here::here
). I only use functions without specify the package if the function is part of base R.
5.7 Loading data from your computer
To load tabular data from within your project folder (if it is in a tab-separated txt-file) you can also use the read.delim
function. The only difference to loading from the web is that you use a path instead of a URL. If the txt-file is in the folder called data in your project folder, you would load the data as shown below.
<- read.delim(here::here("data", "week5data1.txt"), sep = "\t", header = T) w5dat
To if this has worked, we will use the head
function to see first 6 rows of the data
head(w5dat)
## id file.speaker.id text.id spk.ref zone date sex age
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994 male 34-41
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994 female 34-41
## 3 3 <S1A-002$?> S1A-002 ? <NA> <NA> <NA> <NA>
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005 female 26-33
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005 female 19-25
## 6 6 <S1A-002$C> S1A-002 C northern ireland 2002-2005 male 50+
## word.count
## 1 765
## 2 1298
## 3 23
## 4 391
## 5 47
## 6 200
5.8 Loading Excel data
To load Excel spreadsheets, you can use the read_excel
function from the readxl
package as shown below. However, it may be necessary to install and activate the readxl
package first.
<- readxl::read_excel(here::here("data", "week5data2.xlsx")) w5dat
We now briefly check column names to see if the loading of the data has worked.
colnames(w5dat)
## [1] "id" "file.speaker.id" "text.id" "spk.ref"
## [5] "zone" "date" "sex" "age"
## [9] "word.count"
5.9 Renaming, Piping, and Filtering
To rename existing columns in a table, you can use the rename
command which takes the table as the first argument, the new name as the second argument, the an equal sign (=), and finally, the old name es the third argument. For example, renaming a column OldName as NewName in a table called MyTable would look like this: rename(MyTable, NewName = OldName)
.
Piping is done using the %>%
sequence and it can be translated as and then. In the example below, we create a new object (w5dat_edit) from the existing object (w5dat) and then we rename the columns in the new object. When we use piping, we do not need to name the data we are using as this is provided by the previous step.
<- w5dat %>%
w5dat_edit ::rename(Id = id,
dplyrFileSpeakerId = file.speaker.id,
File = colnames(w5dat)[3],
Speaker = colnames(w5dat)[4])
# inspect data
1:5, 1:6] w5dat_edit[
## # A tibble: 5 × 6
## Id FileSpeakerId File Speaker zone date
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994
## 3 3 <S1A-002$?> S1A-002 ? NA NA
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005
A very handy way to rename many columns simultaneously, you can use the str_to_title
function which capitalizes first letter of a word. In the example below, we capitalize all first letters of the column names of our current data.
colnames(w5dat_edit) <- stringr::str_to_title(colnames(w5dat_edit))
# inspect data
1:5, 1:6] w5dat_edit[
## # A tibble: 5 × 6
## Id Filespeakerid File Speaker Zone Date
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 1 <S1A-001$A> S1A-001 A northern ireland 1990-1994
## 2 2 <S1A-001$B> S1A-001 B northern ireland 1990-1994
## 3 3 <S1A-002$?> S1A-002 ? NA NA
## 4 4 <S1A-002$A> S1A-002 A northern ireland 2002-2005
## 5 5 <S1A-002$B> S1A-002 B northern ireland 2002-2005
To remove rows based on values in columns you can use the filter
function.
<- w5dat_edit %>%
w5dat_edit2 ::filter(Speaker != "?",
dplyr!= is.na(Zone),
Zone == "2002-2005",
Date > 5)
Word.count # inspect data
head(w5dat_edit2)
## # A tibble: 6 × 9
## Id Filespeakerid File Speaker Zone Date Sex Age Word.…¹
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 4 <S1A-002$A> S1A-002 A northern ireland 2002… fema… 26-33 391
## 2 5 <S1A-002$B> S1A-002 B northern ireland 2002… fema… 19-25 47
## 3 6 <S1A-002$C> S1A-002 C northern ireland 2002… male 50+ 200
## 4 7 <S1A-002$D> S1A-002 D northern ireland 2002… fema… 50+ 464
## 5 8 <S1A-002$E> S1A-002 E mixed between n… 2002… male 34-41 639
## 6 9 <S1A-002$F> S1A-002 F northern ireland 2002… fema… 26-33 308
## # … with abbreviated variable name ¹Word.count
To select specific columns you can use the select
function.
<- w5dat_edit2 %>%
w5dat_selection ::select(File, Speaker, Word.count)
dplyr# inspect data
head(w5dat_selection)
## # A tibble: 6 × 3
## File Speaker Word.count
## <chr> <chr> <dbl>
## 1 S1A-002 A 391
## 2 S1A-002 B 47
## 3 S1A-002 C 200
## 4 S1A-002 D 464
## 5 S1A-002 E 639
## 6 S1A-002 F 308
You can also use the select
function to remove specific columns.
<- w5dat_edit2 %>%
w5dat_selection2 ::select(-Id, -File, -Speaker, -Date, -Zone, -Age)
dplyr# inspect data
head(w5dat_selection2)
## # A tibble: 6 × 3
## Filespeakerid Sex Word.count
## <chr> <chr> <dbl>
## 1 <S1A-002$A> female 391
## 2 <S1A-002$B> female 47
## 3 <S1A-002$C> male 200
## 4 <S1A-002$D> female 464
## 5 <S1A-002$E> male 639
## 6 <S1A-002$F> female 308
5.10 Ordering data
To order data, for instance, in ascending order according to a specific column you can use the arrange
function.
<- w5dat_selection2 %>%
w5dat_ordered_asc ::arrange(Word.count)
dplyr# inspect data
head(w5dat_ordered_asc)
## # A tibble: 6 × 3
## Filespeakerid Sex Word.count
## <chr> <chr> <dbl>
## 1 <S1B-009$D> female 6
## 2 <S1B-005$C> female 7
## 3 <S1B-009$C> male 7
## 4 <S1B-020$F> male 7
## 5 <S1B-006$G> female 9
## 6 <S2A-050$B> male 9
To order data in descending order you can also use the arrange
function and simply add a - before the column according to which you want to order the data.
<- w5dat_selection2 %>%
w5dat_ordered_desc ::arrange(-Word.count)
dplyr# inspect data
head(w5dat_ordered_desc)
## # A tibble: 6 × 3
## Filespeakerid Sex Word.count
## <chr> <chr> <dbl>
## 1 <S2A-055$A> female 2355
## 2 <S2A-047$A> male 2340
## 3 <S2A-035$A> female 2244
## 4 <S2A-048$A> male 2200
## 5 <S2A-015$A> male 2172
## 6 <S2A-054$A> female 2113
The output shows that the female speaker in file S2A-005 with the speaker identity A has the highest word count with 2,355 words.
5.11 Creating and changing variables
New columns are created, and existing columns can be changed, by using the mutate
function. The mutate
function takes two arguments (if the data does not have to be specified): the first argument is the (new) name of column that you want to create and the second is what you want to store in that column. The = tells R that the new column will contain the result of the second argument.
In the example below, we create a new column called Texttype.
This new column should contain
the value PrivateDialoge if Filespeakerid contains the sequence S1A,
the value PublicDialogue if Filespeakerid contains the sequence S1B,
the value UnscriptedMonologue if Filespeakerid contains the sequence S2A,
the value ScriptedMonologue if Filespeakerid contains the sequence S2B,
the value of Filespeakerid if Filespeakerid neither contains S1A, S1B, S2A, nor S2B.
<- w5dat_selection2 %>%
w5dat_texttype ::mutate(Texttype =
dplyr::case_when(stringr::str_detect(Filespeakerid ,"S1A") ~ "PrivateDialoge",
dplyr::str_detect(Filespeakerid ,"S1B") ~ "PublicDialogue",
stringr::str_detect(Filespeakerid ,"S2A") ~ "UnscriptedMonologue",
stringr::str_detect(Filespeakerid ,"S2B") ~ "ScriptedMonologue",
stringrTRUE ~ Filespeakerid))
# inspect data
head(w5dat_texttype)
## # A tibble: 6 × 4
## Filespeakerid Sex Word.count Texttype
## <chr> <chr> <dbl> <chr>
## 1 <S1A-002$A> female 391 PrivateDialoge
## 2 <S1A-002$B> female 47 PrivateDialoge
## 3 <S1A-002$C> male 200 PrivateDialoge
## 4 <S1A-002$D> female 464 PrivateDialoge
## 5 <S1A-002$E> male 639 PrivateDialoge
## 6 <S1A-002$F> female 308 PrivateDialoge
5.12 If-statements
We should briefly talk about if-statements (or case_when
in the present case). The case_when
function is both very powerful and extremely helpful as it allows you to assign values based on a test. As such, case_when
-statements can be read as:
When/If X is the case, then do A and if X is not the case do B! (When/If -> Then -> Else)
The nice thing about ifelse
or case_when
-statements is that they can be used in succession as we have done above. This can then be read as:
If X is the case, then do A, if Y is the case, then do B, else do Z
5.13 Summarizing data
Summarizing is really helpful and can be done using the summarise
function.
<- w5dat_texttype %>%
w5dat_summary1 ::summarise(Words = sum(Word.count))
dplyr# inspect data
head(w5dat_summary1)
## # A tibble: 1 × 1
## Words
## <dbl>
## 1 141876
To get summaries of sub-groups or by variable level, we can use the group_by
function and then use the summarise
function.
<- w5dat_texttype %>%
w5dat_summary2 ::group_by(Texttype, Sex) %>%
dplyr::summarise(Speakers = n(),
dplyrWords = sum(Word.count))
# inspect data
head(w5dat_summary2)
## # A tibble: 6 × 4
## # Groups: Texttype [3]
## Texttype Sex Speakers Words
## <chr> <chr> <int> <dbl>
## 1 PrivateDialoge female 105 60024
## 2 PrivateDialoge male 18 9628
## 3 PublicDialogue female 63 24647
## 4 PublicDialogue male 41 16783
## 5 UnscriptedMonologue female 3 6712
## 6 UnscriptedMonologue male 16 24082
5.14 Gathering and spreading data
The tidyr
package has two very useful functions for gathering and spreading data that can be sued to transform data to long and wide formats (you will see what this means below). The functions are called gather
and spread
.
We will use the data set called w5dat_summary2
, which we created above, to demonstrate how this works.
We will first check out the spread
-function to create different columns for women and men that show how many of them are represented in the different text types.
<- w5dat_summary2 %>%
w5dat_summary_wide ::select(-Words) %>%
dplyr::spread(Sex, Speakers)
tidyr# inspect
w5dat_summary_wide
## # A tibble: 3 × 3
## # Groups: Texttype [3]
## Texttype female male
## <chr> <int> <int>
## 1 PrivateDialoge 105 18
## 2 PublicDialogue 63 41
## 3 UnscriptedMonologue 3 16
The data is now in what is called a wide
-format as values are distributed across columns.
To reformat this back to a long
-format where each column represents exactly one variable, we use the gather
-function:
<- w5dat_summary_wide %>%
w5dat_summary_long ::gather(Sex, Speakers, female:male)
tidyr# inspect
w5dat_summary_long
## # A tibble: 6 × 3
## # Groups: Texttype [3]
## Texttype Sex Speakers
## <chr> <chr> <int>
## 1 PrivateDialoge female 105
## 2 PublicDialogue female 63
## 3 UnscriptedMonologue female 3
## 4 PrivateDialoge male 18
## 5 PublicDialogue male 41
## 6 UnscriptedMonologue male 16
5.15 Ending R sessions
At the end of each session, you can extract information about the session itself (e.g. which R version you used and which versions of packages). This can help others (or even your future self) to reproduce the analysis that you have done.
5.16 Extracting session information
You can extract the session information by running the sessionInfo
function (without any arguments)
sessionInfo()
## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Germany.utf8 LC_CTYPE=German_Germany.utf8
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C
## [5] LC_TIME=German_Germany.utf8
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.3.6
## [9] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] lubridate_1.8.0 here_1.0.1 rprojroot_2.0.3
## [4] assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
## [7] R6_2.5.1 cellranger_1.1.0 backports_1.4.1
## [10] reprex_2.0.2 evaluate_0.16 httr_1.4.4
## [13] pillar_1.8.1 rlang_1.0.5 googlesheets4_1.0.1
## [16] readxl_1.4.1 rstudioapi_0.14 jquerylib_0.1.4
## [19] rmarkdown_2.16 googledrive_2.0.0 munsell_0.5.0
## [22] broom_1.0.1 compiler_4.2.1 modelr_0.1.9
## [25] xfun_0.32 pkgconfig_2.0.3 htmltools_0.5.3
## [28] tidyselect_1.1.2 bookdown_0.28 fansi_1.0.3
## [31] crayon_1.5.1 tzdb_0.3.0 dbplyr_2.2.1
## [34] withr_2.5.0 grid_4.2.1 jsonlite_1.8.0
## [37] gtable_0.3.1 lifecycle_1.0.1 DBI_1.1.3
## [40] magrittr_2.0.3 scales_1.2.1 cli_3.4.0
## [43] stringi_1.7.8 cachem_1.0.6 renv_0.15.5
## [46] fs_1.5.2 xml2_1.3.3 bslib_0.4.0
## [49] ellipsis_0.3.2 generics_0.1.3 vctrs_0.4.1
## [52] tools_4.2.1 glue_1.6.2 hms_1.1.2
## [55] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
## [58] gargle_1.2.1 rvest_1.0.3 knitr_1.40
## [61] haven_2.5.1 sass_0.4.2
5.17 Going further
If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the workshops and resources on R provided by the UQ library. In addition, there are various online resources available to learn R (you can check out a very recommendable introduction here).
Here are also some additional resources that you may find helpful:
- Grolemund. G., and Wickham, H., R 4 Data Science, 2017.
- Highly recommended! (especially chapters 1, 2, 4, 6, and 8)
- Stat545 - Data wrangling, exploration, and analysis with R. University of British Columbia. http://stat545.com/
- Swirlstats, a package that teaches you R and statistics within R: https://swirlstats.com/
- DataCamp’s (free) Intro to R interactive tutorial: https://www.datacamp.com/courses/free-introduction-to-r
- DataCamp’s advanced R tutorials require a subscription. *Twitter:
- Explore RStudio Tips https://twitter.com/rstudiotips
- Explore #rstats, #rstudioconf