Week 5 Working with Tables in R and RStudio

This week, we will continue to explore R and RStudio.

5.1 Working with tables

We will now start working with data in R. As most of the data that we work with comes in tables.

We will download a tab-separated txt-file. Translated to prose, the code below means Create an object called w5dat and in that object, store the result of the read.delim function.

library(tidyverse)
w5dat <- read.delim(here::here("data", "week5data1.txt"), sep = "\t", header = T)

read.delim stands for read delimited file and it takes the URL from which to load the data (or the path to the data on your computer) as its first argument. The sep stand for separator and the \t stands for tab-separated and represents the second argument that the read.delim function takes. The third argument, header, can take either T(RUE) or F(ALSE) and it tells R if the data has column names (headers) or not.

5.2 Functions and Objects

In R, functions always have the following form: function(argument1, argument2, ..., argumentN). Typically a function does something to an object (e.g. a table), so that the first argument typically specifies the data to which the function is applied. Other arguments then allow to add some information. Just as a side note, functions are also objects that do not contain data but instructions.

To assign content to an object, we use <- or = so that the we provide a name for an object, and then assign some content to it. For example, MyObject <- 1:3 means Create an object called MyObject. this object should contain the numbers 1 to 3.

MyObject <- 1:3
# inspect
MyObject

## [1] 1 2 3

5.3 Inspecting data

There are many ways to inspect data. We will briefly go over the most common ways to inspect data.

The head function takes the data-object as its first argument and automatically shows the first 6 elements of an object (or rows if the data-object has a table format).

head(w5dat)

##   id file.speaker.id text.id spk.ref             zone      date    sex   age
## 1  1     <S1A-001$A> S1A-001       A northern ireland 1990-1994   male 34-41
## 2  2     <S1A-001$B> S1A-001       B northern ireland 1990-1994 female 34-41
## 3  3     <S1A-002$?> S1A-002       ?             <NA>      <NA>   <NA>  <NA>
## 4  4     <S1A-002$A> S1A-002       A northern ireland 2002-2005 female 26-33
## 5  5     <S1A-002$B> S1A-002       B northern ireland 2002-2005 female 19-25
## 6  6     <S1A-002$C> S1A-002       C northern ireland 2002-2005   male   50+
##   word.count
## 1        765
## 2       1298
## 3         23
## 4        391
## 5         47
## 6        200

We can also use the head function to inspect more or less elements and we can specify the number of elements (or rows) that we want to inspect as a second argument. In the example below, the 4 tells R that we only want to see the first 4 rows of the data.

head(w5dat, 4)

##   id file.speaker.id text.id spk.ref             zone      date    sex   age
## 1  1     <S1A-001$A> S1A-001       A northern ireland 1990-1994   male 34-41
## 2  2     <S1A-001$B> S1A-001       B northern ireland 1990-1994 female 34-41
## 3  3     <S1A-002$?> S1A-002       ?             <NA>      <NA>   <NA>  <NA>
## 4  4     <S1A-002$A> S1A-002       A northern ireland 2002-2005 female 26-33
##   word.count
## 1        765
## 2       1298
## 3         23
## 4        391

5.4 Accessing individual cells in a table

If you want to access specific cells in a table, you can do so by typing the name of the object and then specify the rows and columns in square brackets (i.e. data[row, column]). For example, w5dat[2, 4] would show the value of the cell in the second row and fourth column of the object w5dat. We can also use the colon to define a range (as shown below, where 1:5 means from 1 to 5 and 1:3 means from 1 to 3) The command w5dat[1:5, 1:3] thus means:

Show me the first 5 rows and the first 3 columns of the data-object that is called w5dat.

w5dat[1:5, 1:3]

##   id file.speaker.id text.id
## 1  1     <S1A-001$A> S1A-001
## 2  2     <S1A-001$B> S1A-001
## 3  3     <S1A-002$?> S1A-002
## 4  4     <S1A-002$A> S1A-002
## 5  5     <S1A-002$B> S1A-002

Inspecting the structure of data

You can use the str function to inspect the structure of a data set. This means that this function will show the number of observations (rows) and variables (columns) and tell you what type of variables the data consists of

int = integer
chr = character string
num = numeric
fct = factor

str(w5dat)

## 'data.frame':    1332 obs. of  9 variables:
##  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ file.speaker.id: chr  "<S1A-001$A>" "<S1A-001$B>" "<S1A-002$?>" "<S1A-002$A>" ...
##  $ text.id        : chr  "S1A-001" "S1A-001" "S1A-002" "S1A-002" ...
##  $ spk.ref        : chr  "A" "B" "?" "A" ...
##  $ zone           : chr  "northern ireland" "northern ireland" NA "northern ireland" ...
##  $ date           : chr  "1990-1994" "1990-1994" NA "2002-2005" ...
##  $ sex            : chr  "male" "female" NA "female" ...
##  $ age            : chr  "34-41" "34-41" NA "26-33" ...
##  $ word.count     : int  765 1298 23 391 47 200 464 639 308 78 ...

The summary function summarizes the data.

summary(w5dat)

##        id         file.speaker.id      text.id            spk.ref         
##  Min.   :   1.0   Length:1332        Length:1332        Length:1332       
##  1st Qu.: 333.8   Class :character   Class :character   Class :character  
##  Median : 666.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 666.5                                                           
##  3rd Qu.: 999.2                                                           
##  Max.   :1332.0                                                           
##      zone               date               sex                age           
##  Length:1332        Length:1332        Length:1332        Length:1332       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    word.count    
##  Min.   :   0.0  
##  1st Qu.:  66.0  
##  Median : 240.5  
##  Mean   : 449.9  
##  3rd Qu.: 638.2  
##  Max.   :2565.0

5.5 Tabulating data

We can use the table function to create basic tables that extract raw frequency information. The following command tells us how many instances there are of each level of the variable date in the w5dat.

table(w5dat$date)

## 
## 1990-1994 1995-2001 2002-2005 
##       905        67       270

Alternatively, you could, of course, index the column by using its position in the data set like this: w5dat[, 6] - the result of table(w5dat[, 6]) and table(w5dat$date) are the same! Also note that here we leave out indexes for rows to tell R that we want all rows.

When you want to cross-tabulate columns, it is often better to use the ftable function (ftable stands for frequency table).

ftable(w5dat$age, w5dat$sex)

##        female male
##                   
## 0-18        5    7
## 19-25     163   65
## 26-33      83   36
## 34-41      35   58
## 42-49      35   97
## 50+        63  138

5.6 Saving data to your computer

To save tabular data on your computer, you can use the write.table function. This function requires the data that you want to save as its first argument, the location where you want to save the data as the second argument and the type of delimiter as the third argument.

write.table(w5dat, here::here("data", "w5dat.txt"), sep = "\t")

A word about paths

In the code chunk above, the sequence here::here("data", "w5dat.txt") is a handy way to define a path. A path is simply the location where a file is stored on your computer or on the internet (which typically is a server - which is just a fancy term for a computer - somewhere on the globe). The here function from thehere package allows to simply state in which folder a certain file is and what file you are talking about.

In this case, we want to access the file w5dat (which is a txt file and thus has the appendix .txt) in the data folder. R will always start looking in the folder in which your project is stored. If you want to access a file that is stored somewhere else on your computer, you can also define the full path to the folder in which the file is. In my case, this would be D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data. However, as the data folder in in the folder where my Rproj file is, I only need to specify that the file is in the data folder within the folder in which my Rproj file is located.

A word about package naming

Another thing that is notable in the sequence here::here("data", "w5dat.txt") is that I specified that the here function is part of the here package. This is what I meant by writing here::here which simply means use the here function from here package (package::function). This may appear to be somewhat redundant but it happens quite frequently, that different packages have functions that have the same names. In such cases, R will simply choose the function from the package that was loaded last. To prevent R from using the wrong function, it makes sense to specify the package AND the function (as I did in the sequence here::here). I only use functions without specify the package if the function is part of base R.

5.7 Loading data from your computer

To load tabular data from within your project folder (if it is in a tab-separated txt-file) you can also use the read.delim function. The only difference to loading from the web is that you use a path instead of a URL. If the txt-file is in the folder called data in your project folder, you would load the data as shown below.

w5dat <- read.delim(here::here("data", "week5data1.txt"), sep = "\t", header = T)

To if this has worked, we will use the head function to see first 6 rows of the data

head(w5dat)

##   id file.speaker.id text.id spk.ref             zone      date    sex   age
## 1  1     <S1A-001$A> S1A-001       A northern ireland 1990-1994   male 34-41
## 2  2     <S1A-001$B> S1A-001       B northern ireland 1990-1994 female 34-41
## 3  3     <S1A-002$?> S1A-002       ?             <NA>      <NA>   <NA>  <NA>
## 4  4     <S1A-002$A> S1A-002       A northern ireland 2002-2005 female 26-33
## 5  5     <S1A-002$B> S1A-002       B northern ireland 2002-2005 female 19-25
## 6  6     <S1A-002$C> S1A-002       C northern ireland 2002-2005   male   50+
##   word.count
## 1        765
## 2       1298
## 3         23
## 4        391
## 5         47
## 6        200

5.8 Loading Excel data

To load Excel spreadsheets, you can use the read_excel function from the readxl package as shown below. However, it may be necessary to install and activate the readxl package first.

w5dat <- readxl::read_excel(here::here("data", "week5data2.xlsx"))

We now briefly check column names to see if the loading of the data has worked.

colnames(w5dat)

## [1] "id"              "file.speaker.id" "text.id"         "spk.ref"        
## [5] "zone"            "date"            "sex"             "age"            
## [9] "word.count"

5.9 Renaming, Piping, and Filtering

To rename existing columns in a table, you can use the rename command which takes the table as the first argument, the new name as the second argument, the an equal sign (=), and finally, the old name es the third argument. For example, renaming a column OldName as NewName in a table called MyTable would look like this: rename(MyTable, NewName = OldName).

Piping is done using the %>% sequence and it can be translated as and then. In the example below, we create a new object (w5dat_edit) from the existing object (w5dat) and then we rename the columns in the new object. When we use piping, we do not need to name the data we are using as this is provided by the previous step.

w5dat_edit <- w5dat %>%
  dplyr::rename(Id = id,
         FileSpeakerId = file.speaker.id,
         File = colnames(w5dat)[3],
         Speaker = colnames(w5dat)[4])
# inspect data
w5dat_edit[1:5, 1:6]

## # A tibble: 5 × 6
##      Id FileSpeakerId File    Speaker zone             date     
##   <dbl> <chr>         <chr>   <chr>   <chr>            <chr>    
## 1     1 <S1A-001$A>   S1A-001 A       northern ireland 1990-1994
## 2     2 <S1A-001$B>   S1A-001 B       northern ireland 1990-1994
## 3     3 <S1A-002$?>   S1A-002 ?       NA               NA       
## 4     4 <S1A-002$A>   S1A-002 A       northern ireland 2002-2005
## 5     5 <S1A-002$B>   S1A-002 B       northern ireland 2002-2005

A very handy way to rename many columns simultaneously, you can use the str_to_title function which capitalizes first letter of a word. In the example below, we capitalize all first letters of the column names of our current data.

colnames(w5dat_edit) <- stringr::str_to_title(colnames(w5dat_edit))
# inspect data
w5dat_edit[1:5, 1:6]

## # A tibble: 5 × 6
##      Id Filespeakerid File    Speaker Zone             Date     
##   <dbl> <chr>         <chr>   <chr>   <chr>            <chr>    
## 1     1 <S1A-001$A>   S1A-001 A       northern ireland 1990-1994
## 2     2 <S1A-001$B>   S1A-001 B       northern ireland 1990-1994
## 3     3 <S1A-002$?>   S1A-002 ?       NA               NA       
## 4     4 <S1A-002$A>   S1A-002 A       northern ireland 2002-2005
## 5     5 <S1A-002$B>   S1A-002 B       northern ireland 2002-2005

To remove rows based on values in columns you can use the filter function.

w5dat_edit2 <- w5dat_edit %>%
  dplyr::filter(Speaker != "?",
         Zone != is.na(Zone),
         Date == "2002-2005",
         Word.count > 5)
# inspect data
head(w5dat_edit2)

## # A tibble: 6 × 9
##      Id Filespeakerid File    Speaker Zone             Date  Sex   Age   Word.…¹
##   <dbl> <chr>         <chr>   <chr>   <chr>            <chr> <chr> <chr>   <dbl>
## 1     4 <S1A-002$A>   S1A-002 A       northern ireland 2002… fema… 26-33     391
## 2     5 <S1A-002$B>   S1A-002 B       northern ireland 2002… fema… 19-25      47
## 3     6 <S1A-002$C>   S1A-002 C       northern ireland 2002… male  50+       200
## 4     7 <S1A-002$D>   S1A-002 D       northern ireland 2002… fema… 50+       464
## 5     8 <S1A-002$E>   S1A-002 E       mixed between n… 2002… male  34-41     639
## 6     9 <S1A-002$F>   S1A-002 F       northern ireland 2002… fema… 26-33     308
## # … with abbreviated variable name ¹Word.count

To select specific columns you can use the select function.

w5dat_selection <- w5dat_edit2 %>%
  dplyr::select(File, Speaker, Word.count)
# inspect data
head(w5dat_selection)

## # A tibble: 6 × 3
##   File    Speaker Word.count
##   <chr>   <chr>        <dbl>
## 1 S1A-002 A              391
## 2 S1A-002 B               47
## 3 S1A-002 C              200
## 4 S1A-002 D              464
## 5 S1A-002 E              639
## 6 S1A-002 F              308

You can also use the select function to remove specific columns.

w5dat_selection2 <- w5dat_edit2 %>%
  dplyr::select(-Id, -File, -Speaker, -Date, -Zone, -Age)
# inspect data
head(w5dat_selection2)

## # A tibble: 6 × 3
##   Filespeakerid Sex    Word.count
##   <chr>         <chr>       <dbl>
## 1 <S1A-002$A>   female        391
## 2 <S1A-002$B>   female         47
## 3 <S1A-002$C>   male          200
## 4 <S1A-002$D>   female        464
## 5 <S1A-002$E>   male          639
## 6 <S1A-002$F>   female        308

5.10 Ordering data

To order data, for instance, in ascending order according to a specific column you can use the arrange function.

w5dat_ordered_asc <- w5dat_selection2 %>%
  dplyr::arrange(Word.count)
# inspect data
head(w5dat_ordered_asc)

## # A tibble: 6 × 3
##   Filespeakerid Sex    Word.count
##   <chr>         <chr>       <dbl>
## 1 <S1B-009$D>   female          6
## 2 <S1B-005$C>   female          7
## 3 <S1B-009$C>   male            7
## 4 <S1B-020$F>   male            7
## 5 <S1B-006$G>   female          9
## 6 <S2A-050$B>   male            9

To order data in descending order you can also use the arrange function and simply add a - before the column according to which you want to order the data.

w5dat_ordered_desc <- w5dat_selection2 %>%
  dplyr::arrange(-Word.count)
# inspect data
head(w5dat_ordered_desc)

## # A tibble: 6 × 3
##   Filespeakerid Sex    Word.count
##   <chr>         <chr>       <dbl>
## 1 <S2A-055$A>   female       2355
## 2 <S2A-047$A>   male         2340
## 3 <S2A-035$A>   female       2244
## 4 <S2A-048$A>   male         2200
## 5 <S2A-015$A>   male         2172
## 6 <S2A-054$A>   female       2113

The output shows that the female speaker in file S2A-005 with the speaker identity A has the highest word count with 2,355 words.

5.11 Creating and changing variables

New columns are created, and existing columns can be changed, by using the mutate function. The mutate function takes two arguments (if the data does not have to be specified): the first argument is the (new) name of column that you want to create and the second is what you want to store in that column. The = tells R that the new column will contain the result of the second argument.

In the example below, we create a new column called Texttype.

This new column should contain

the value PrivateDialoge if Filespeakerid contains the sequence S1A,
the value PublicDialogue if Filespeakerid contains the sequence S1B,
the value UnscriptedMonologue if Filespeakerid contains the sequence S2A,
the value ScriptedMonologue if Filespeakerid contains the sequence S2B,
the value of Filespeakerid if Filespeakerid neither contains S1A, S1B, S2A, nor S2B.

w5dat_texttype <- w5dat_selection2 %>%
  dplyr::mutate(Texttype = 
                  dplyr::case_when(stringr::str_detect(Filespeakerid ,"S1A") ~ "PrivateDialoge",
                                   stringr::str_detect(Filespeakerid ,"S1B") ~ "PublicDialogue",
                                   stringr::str_detect(Filespeakerid ,"S2A") ~ "UnscriptedMonologue",
                                   stringr::str_detect(Filespeakerid ,"S2B") ~ "ScriptedMonologue",
                                   TRUE ~ Filespeakerid))
# inspect data
head(w5dat_texttype)

## # A tibble: 6 × 4
##   Filespeakerid Sex    Word.count Texttype      
##   <chr>         <chr>       <dbl> <chr>         
## 1 <S1A-002$A>   female        391 PrivateDialoge
## 2 <S1A-002$B>   female         47 PrivateDialoge
## 3 <S1A-002$C>   male          200 PrivateDialoge
## 4 <S1A-002$D>   female        464 PrivateDialoge
## 5 <S1A-002$E>   male          639 PrivateDialoge
## 6 <S1A-002$F>   female        308 PrivateDialoge

5.12 If-statements

We should briefly talk about if-statements (or case_when in the present case). The case_when function is both very powerful and extremely helpful as it allows you to assign values based on a test. As such, case_when-statements can be read as:

When/If X is the case, then do A and if X is not the case do B! (When/If -> Then -> Else)

The nice thing about ifelse or case_when-statements is that they can be used in succession as we have done above. This can then be read as:

If X is the case, then do A, if Y is the case, then do B, else do Z

5.13 Summarizing data

Summarizing is really helpful and can be done using the summarise function.

w5dat_summary1 <- w5dat_texttype %>%
  dplyr::summarise(Words = sum(Word.count))
# inspect data
head(w5dat_summary1)

## # A tibble: 1 × 1
##    Words
##    <dbl>
## 1 141876

To get summaries of sub-groups or by variable level, we can use the group_by function and then use the summarise function.

w5dat_summary2 <- w5dat_texttype %>%
  dplyr::group_by(Texttype, Sex) %>%
  dplyr::summarise(Speakers = n(),
            Words = sum(Word.count))
# inspect data
head(w5dat_summary2)

## # A tibble: 6 × 4
## # Groups:   Texttype [3]
##   Texttype            Sex    Speakers Words
##   <chr>               <chr>     <int> <dbl>
## 1 PrivateDialoge      female      105 60024
## 2 PrivateDialoge      male         18  9628
## 3 PublicDialogue      female       63 24647
## 4 PublicDialogue      male         41 16783
## 5 UnscriptedMonologue female        3  6712
## 6 UnscriptedMonologue male         16 24082

5.14 Gathering and spreading data

The tidyr package has two very useful functions for gathering and spreading data that can be sued to transform data to long and wide formats (you will see what this means below). The functions are called gather and spread.

We will use the data set called w5dat_summary2, which we created above, to demonstrate how this works.

We will first check out the spread-function to create different columns for women and men that show how many of them are represented in the different text types.

w5dat_summary_wide <- w5dat_summary2 %>%
  dplyr::select(-Words) %>%
  tidyr::spread(Sex, Speakers)
# inspect
w5dat_summary_wide

## # A tibble: 3 × 3
## # Groups:   Texttype [3]
##   Texttype            female  male
##   <chr>                <int> <int>
## 1 PrivateDialoge         105    18
## 2 PublicDialogue          63    41
## 3 UnscriptedMonologue      3    16

The data is now in what is called a wide-format as values are distributed across columns.

To reformat this back to a long-format where each column represents exactly one variable, we use the gather-function:

w5dat_summary_long <- w5dat_summary_wide %>%
  tidyr::gather(Sex, Speakers, female:male)
# inspect
w5dat_summary_long

## # A tibble: 6 × 3
## # Groups:   Texttype [3]
##   Texttype            Sex    Speakers
##   <chr>               <chr>     <int>
## 1 PrivateDialoge      female      105
## 2 PublicDialogue      female       63
## 3 UnscriptedMonologue female        3
## 4 PrivateDialoge      male         18
## 5 PublicDialogue      male         41
## 6 UnscriptedMonologue male         16

5.15 Ending R sessions

At the end of each session, you can extract information about the session itself (e.g. which R version you used and which versions of packages). This can help others (or even your future self) to reproduce the analysis that you have done.

5.16 Extracting session information

You can extract the session information by running the sessionInfo function (without any arguments)

sessionInfo()

## R version 4.2.1 RC (2022-06-17 r82510 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.3.6  
## [9] tidyverse_1.3.2
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.8.0     here_1.0.1          rprojroot_2.0.3    
##  [4] assertthat_0.2.1    digest_0.6.29       utf8_1.2.2         
##  [7] R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
## [10] reprex_2.0.2        evaluate_0.16       httr_1.4.4         
## [13] pillar_1.8.1        rlang_1.0.5         googlesheets4_1.0.1
## [16] readxl_1.4.1        rstudioapi_0.14     jquerylib_0.1.4    
## [19] rmarkdown_2.16      googledrive_2.0.0   munsell_0.5.0      
## [22] broom_1.0.1         compiler_4.2.1      modelr_0.1.9       
## [25] xfun_0.32           pkgconfig_2.0.3     htmltools_0.5.3    
## [28] tidyselect_1.1.2    bookdown_0.28       fansi_1.0.3        
## [31] crayon_1.5.1        tzdb_0.3.0          dbplyr_2.2.1       
## [34] withr_2.5.0         grid_4.2.1          jsonlite_1.8.0     
## [37] gtable_0.3.1        lifecycle_1.0.1     DBI_1.1.3          
## [40] magrittr_2.0.3      scales_1.2.1        cli_3.4.0          
## [43] stringi_1.7.8       cachem_1.0.6        renv_0.15.5        
## [46] fs_1.5.2            xml2_1.3.3          bslib_0.4.0        
## [49] ellipsis_0.3.2      generics_0.1.3      vctrs_0.4.1        
## [52] tools_4.2.1         glue_1.6.2          hms_1.1.2          
## [55] fastmap_1.1.0       yaml_2.3.5          colorspace_2.0-3   
## [58] gargle_1.2.1        rvest_1.0.3         knitr_1.40         
## [61] haven_2.5.1         sass_0.4.2

5.17 Going further

If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the workshops and resources on R provided by the UQ library. In addition, there are various online resources available to learn R (you can check out a very recommendable introduction here).

Here are also some additional resources that you may find helpful:

Grolemund. G., and Wickham, H., R 4 Data Science, 2017.
- Highly recommended! (especially chapters 1, 2, 4, 6, and 8)
Stat545 - Data wrangling, exploration, and analysis with R. University of British Columbia. http://stat545.com/
Swirlstats, a package that teaches you R and statistics within R: https://swirlstats.com/
DataCamp’s (free) Intro to R interactive tutorial: https://www.datacamp.com/courses/free-introduction-to-r
- DataCamp’s advanced R tutorials require a subscription. *Twitter:
- Explore RStudio Tips https://twitter.com/rstudiotips
- Explore #rstats, #rstudioconf