Week 5 Working with Tables in R and RStudio 2

This week, we will continue to explore R and RStudio.

5.1 Working with tables

We will now start working with data in R. As most of the data that we work with comes in tables, we will focus on this first before moving on to working with text data.

5.2 Loading data from the web

To show, how data can be downloaded from the web, we will download a tab-separated txt-file. Translated to prose, the code below means Create an object called icebio and in that object, store the result of the read.delim function.

read.delim stands for read delimited file and it takes the URL from which to load the data (or the path to the data on your computer) as its first argument. The sep stand for separator and the \t stands for tab-separated and represents the second argument that the read.delim function takes. The third argument, header, can take either T(RUE) or F(ALSE) and it tells R if the data has column names (headers) or not.

5.3 Functions and Objects

In R, functions always have the following form: function(argument1, argument2, ..., argumentN). Typically a function does something to an object (e.g. a table), so that the first argument typically specifies the data to which the function is applied. Other arguments then allow to add some information. Just as a side note, functions are also objects that do not contain data but instructions.

To assign content to an object, we use <- or = so that the we provide a name for an object, and then assign some content to it. For example, MyObject <- 1:3 means Create an object called MyObject. this object should contain the numbers 1 to 3.

# load data
icebio <- read.delim("https://slcladal.github.io/data/BiodataIceIreland.txt", 
                      sep = "\t", header = T)

5.4 Inspecting data

There are many ways to inspect data. We will briefly go over the most common ways to inspect data.

The head function takes the data-object as its first argument and automatically shows the first 6 elements of an object (or rows if the data-object has a table format).

head(icebio)

##   id file.speaker.id text.id spk.ref             zone      date    sex   age
## 1  1     <S1A-001$A> S1A-001       A northern ireland 1990-1994   male 34-41
## 2  2     <S1A-001$B> S1A-001       B northern ireland 1990-1994 female 34-41
## 3  3     <S1A-002$?> S1A-002       ?             <NA>      <NA>   <NA>  <NA>
## 4  4     <S1A-002$A> S1A-002       A northern ireland 2002-2005 female 26-33
## 5  5     <S1A-002$B> S1A-002       B northern ireland 2002-2005 female 19-25
## 6  6     <S1A-002$C> S1A-002       C northern ireland 2002-2005   male   50+
##   word.count
## 1        765
## 2       1298
## 3         23
## 4        391
## 5         47
## 6        200

We can also use the head function to inspect more or less elements and we can specify the number of elements (or rows) that we want to inspect as a second argument. In the example below, the 4 tells R that we only want to see the first 4 rows of the data.

head(icebio, 4)

##   id file.speaker.id text.id spk.ref             zone      date    sex   age
## 1  1     <S1A-001$A> S1A-001       A northern ireland 1990-1994   male 34-41
## 2  2     <S1A-001$B> S1A-001       B northern ireland 1990-1994 female 34-41
## 3  3     <S1A-002$?> S1A-002       ?             <NA>      <NA>   <NA>  <NA>
## 4  4     <S1A-002$A> S1A-002       A northern ireland 2002-2005 female 26-33
##   word.count
## 1        765
## 2       1298
## 3         23
## 4        391

EXERCISE TIME!

Download and inspect the first 7 rows of the data set that you can find under this URL: https://slcladal.github.io/data/lmmdata.txt. Can you guess what the data is about?

Answer

  ex1data <- read.delim("https://slcladal.github.io/data/lmmdata.txt", sep = "\t")
  head(ex1data, 7)

  ##   Date         Genre    Text Prepositions Region
  ## 1 1736       Science   albin       166.01  North
  ## 2 1711     Education    anon       139.86  North
  ## 3 1808 PrivateLetter  austen       130.78  North
  ## 4 1878     Education    bain       151.29  North
  ## 5 1743     Education barclay       145.72  North
  ## 6 1908     Education  benson       120.77  North
  ## 7 1906         Diary  benson       119.17  North

The data is about texts and the different columns provide information about the texts such as when the texts were written (Date), the genre the texts represent (Genre), the name of the texts (Text), the relative frequencies of prepositions the texts contain (Prepositions), and the region where the author was from (Region).

5.5 Accessing individual cells in a table

If you want to access specific cells in a table, you can do so by typing the name of the object and then specify the rows and columns in square brackets (i.e. data[row, column]). For example, icebio[2, 4] would show the value of the cell in the second row and fourth column of the object icebio. We can also use the colon to define a range (as shown below, where 1:5 means from 1 to 5 and 1:3 means from 1 to 3) The command icebio[1:5, 1:3] thus means:

Show me the first 5 rows and the first 3 columns of the data-object that is called icebio.

icebio[1:5, 1:3]

##   id file.speaker.id text.id
## 1  1     <S1A-001$A> S1A-001
## 2  2     <S1A-001$B> S1A-001
## 3  3     <S1A-002$?> S1A-002
## 4  4     <S1A-002$A> S1A-002
## 5  5     <S1A-002$B> S1A-002

EXERCISE TIME!

How would you inspect the content of the cells in 4^th column, rows 3 to 5 of the icebio data set?

Answer

  icebio[3:5, 4]

  ## [1] "?" "A" "B"

Inspecting the structure of data

You can use the str function to inspect the structure of a data set. This means that this function will show the number of observations (rows) and variables (columns) and tell you what type of variables the data consists of

int = integer
chr = character string
num = numeric
fct = factor

str(icebio)

## 'data.frame':    1332 obs. of  9 variables:
##  $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ file.speaker.id: chr  "<S1A-001$A>" "<S1A-001$B>" "<S1A-002$?>" "<S1A-002$A>" ...
##  $ text.id        : chr  "S1A-001" "S1A-001" "S1A-002" "S1A-002" ...
##  $ spk.ref        : chr  "A" "B" "?" "A" ...
##  $ zone           : chr  "northern ireland" "northern ireland" NA "northern ireland" ...
##  $ date           : chr  "1990-1994" "1990-1994" NA "2002-2005" ...
##  $ sex            : chr  "male" "female" NA "female" ...
##  $ age            : chr  "34-41" "34-41" NA "26-33" ...
##  $ word.count     : int  765 1298 23 391 47 200 464 639 308 78 ...

The summary function summarizes the data.

summary(icebio)

##        id         file.speaker.id      text.id            spk.ref         
##  Min.   :   1.0   Length:1332        Length:1332        Length:1332       
##  1st Qu.: 333.8   Class :character   Class :character   Class :character  
##  Median : 666.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 666.5                                                           
##  3rd Qu.: 999.2                                                           
##  Max.   :1332.0                                                           
##      zone               date               sex                age           
##  Length:1332        Length:1332        Length:1332        Length:1332       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    word.count    
##  Min.   :   0.0  
##  1st Qu.:  66.0  
##  Median : 240.5  
##  Mean   : 449.9  
##  3rd Qu.: 638.2  
##  Max.   :2565.0

5.6 Tabulating data

We can use the table function to create basic tables that extract raw frequency information. The following command tells us how many instances there are of each level of the variable date in the icebio.

TIP

In order to access specific columns of a data frame, you can first type the name of the data set followed by a $ symbol and then the name of the column (or variable).

table(icebio$date)

## 
## 1990-1994 1995-2001 2002-2005 
##       905        67       270

Alternatively, you could, of course, index the column by using its position in the data set like this: icebio[, 6] - the result of table(icebio[, 6]) and table(icebio$date) are the same! Also note that here we leave out indexes for rows to tell R that we want all rows.

When you want to cross-tabulate columns, it is often better to use the ftable function (ftable stands for frequency table).

ftable(icebio$age, icebio$sex)

##        female male
##                   
## 0-18        5    7
## 19-25     163   65
## 26-33      83   36
## 34-41      35   58
## 42-49      35   97
## 50+        63  138

EXERCISE TIME!

Using the table function, how many women are in the data collected between 2002 and 2005?

Answer

  table(icebio$date, icebio$sex)

  ##            
  ##             female male
  ##   1990-1994    338  562
  ##   1995-2001      4   58
  ##   2002-2005    186   84

Using the ftable function, how many men are are from northern Ireland in the data collected between 1990 and 1994?

Answer

  ftable(icebio$date, icebio$zone, icebio$sex)

  ##                                     female male
  ##                                                
  ## 1990-1994 mixed between ni and roi      18   13
  ##           non-corpus speaker             7   22
  ##           northern ireland             104  289
  ##           republic of ireland          209  238
  ## 1995-2001 mixed between ni and roi       0    0
  ##           non-corpus speaker             1    1
  ##           northern ireland               2   36
  ##           republic of ireland            1   21
  ## 2002-2005 mixed between ni and roi      19    7
  ##           non-corpus speaker             7    9
  ##           northern ireland             122   41
  ##           republic of ireland           38   27

5.7 Saving data to your computer

To save tabular data on your computer, you can use the write.table function. This function requires the data that you want to save as its first argument, the location where you want to save the data as the second argument and the type of delimiter as the third argument.

write.table(icebio, here::here("data", "icebio.txt"), sep = "\t")

A word about paths

In the code chunk above, the sequence here::here("data", "icebio.txt") is a handy way to define a path. A path is simply the location where a file is stored on your computer or on the internet (which typically is a server - which is just a fancy term for a computer - somewhere on the globe). The here function from thehere package allows to simply state in which folder a certain file is and what file you are talking about.

In this case, we want to access the file icebio (which is a txt file and thus has the appendix .txt) in the data folder. R will always start looking in the folder in which your project is stored. If you want to access a file that is stored somewhere else on your computer, you can also define the full path to the folder in which the file is. In my case, this would be D:/Uni/UQ/SLC/LADAL/SLCLADAL.github.io/data. However, as the data folder in in the folder where my Rproj file is, I only need to specify that the file is in the data folder within the folder in which my Rproj file is located.

A word about package naming

Another thing that is notable in the sequence here::here("data", "icebio.txt") is that I specified that the here function is part of the here package. This is what I meant by writing here::here which simply means use the here function from here package (package::function). This may appear to be somewhat redundant but it happens quite frequently, that different packages have functions that have the same names. In such cases, R will simply choose the function from the package that was loaded last. To prevent R from using the wrong function, it makes sense to specify the package AND the function (as I did in the sequence here::here). I only use functions without specify the package if the function is part of base R.

5.8 Loading data from your computer

To load tabular data from within your project folder (if it is in a tab-separated txt-file) you can also use the read.delim function. The only difference to loading from the web is that you use a path instead of a URL. If the txt-file is in the folder called data in your project folder, you would load the data as shown below.

icebio <- read.delim(here::here("data", "icebio.txt"), sep = "\t", header = T)

However, you can always just use the full path (and you must do this is the data is not in your project folder).

NOTE
You may have to change the path to the data!

icebio <- read.delim(here::here("data", "icebio.txt"), 
                      sep = "\t", header = T)

To if this has worked, we will use the head function to see first 6 rows of the data

head(icebio)

##   id file.speaker.id text.id spk.ref             zone      date    sex   age
## 1  1     <S1A-001$A> S1A-001       A northern ireland 1990-1994   male 34-41
## 2  2     <S1A-001$B> S1A-001       B northern ireland 1990-1994 female 34-41
## 3  3     <S1A-002$?> S1A-002       ?             <NA>      <NA>   <NA>  <NA>
## 4  4     <S1A-002$A> S1A-002       A northern ireland 2002-2005 female 26-33
## 5  5     <S1A-002$B> S1A-002       B northern ireland 2002-2005 female 19-25
## 6  6     <S1A-002$C> S1A-002       C northern ireland 2002-2005   male   50+
##   word.count
## 1        765
## 2       1298
## 3         23
## 4        391
## 5         47
## 6        200

5.9 Loading Excel data

To load Excel spreadsheets, you can use the read_excel function from the readxl package as shown below. However, it may be necessary to install and activate the readxl package first.

icebio <- readxl::read_excel(here::here("data", "ICEdata.xlsx"))

We now briefly check column names to see if the loading of the data has worked.

colnames(icebio)

## [1] "id"              "file.speaker.id" "text.id"         "spk.ref"        
## [5] "zone"            "date"            "sex"             "age"            
## [9] "word.count"

5.10 Renaming, Piping, and Filtering

To rename existing columns in a table, you can use the rename command which takes the table as the first argument, the new name as the second argument, the an equal sign (=), and finally, the old name es the third argument. For example, renaming a column OldName as NewName in a table called MyTable would look like this: rename(MyTable, NewName = OldName).

Piping is done using the %>% sequence and it can be translated as and then. In the example below, we create a new object (icebio_edit) from the existing object (icebio) and then we rename the columns in the new object. When we use piping, we do not need to name the data we are using as this is provided by the previous step.

icebio_edit <- icebio %>%
  dplyr::rename(Id = id,
         FileSpeakerId = file.speaker.id,
         File = colnames(icebio)[3],
         Speaker = colnames(icebio)[4])
# inspect data
icebio_edit[1:5, 1:6]

## # A tibble: 5 × 6
##      Id FileSpeakerId File    Speaker zone             date     
##   <dbl> <chr>         <chr>   <chr>   <chr>            <chr>    
## 1     1 <S1A-001$A>   S1A-001 A       northern ireland 1990-1994
## 2     2 <S1A-001$B>   S1A-001 B       northern ireland 1990-1994
## 3     3 <S1A-002$?>   S1A-002 ?       NA               NA       
## 4     4 <S1A-002$A>   S1A-002 A       northern ireland 2002-2005
## 5     5 <S1A-002$B>   S1A-002 B       northern ireland 2002-2005

A very handy way to rename many columns simultaneously, you can use the str_to_title function which capitalizes first letter of a word. In the example below, we capitalize all first letters of the column names of our current data.

colnames(icebio_edit) <- stringr::str_to_title(colnames(icebio_edit))
# inspect data
icebio_edit[1:5, 1:6]

## # A tibble: 5 × 6
##      Id Filespeakerid File    Speaker Zone             Date     
##   <dbl> <chr>         <chr>   <chr>   <chr>            <chr>    
## 1     1 <S1A-001$A>   S1A-001 A       northern ireland 1990-1994
## 2     2 <S1A-001$B>   S1A-001 B       northern ireland 1990-1994
## 3     3 <S1A-002$?>   S1A-002 ?       NA               NA       
## 4     4 <S1A-002$A>   S1A-002 A       northern ireland 2002-2005
## 5     5 <S1A-002$B>   S1A-002 B       northern ireland 2002-2005

To remove rows based on values in columns you can use the filter function.

icebio_edit2 <- icebio_edit %>%
  dplyr::filter(Speaker != "?",
         Zone != is.na(Zone),
         Date == "2002-2005",
         Word.count > 5)
# inspect data
head(icebio_edit2)

## # A tibble: 6 × 9
##      Id Filespeakerid File    Speaker Zone          Date  Sex   Age   Word.count
##   <dbl> <chr>         <chr>   <chr>   <chr>         <chr> <chr> <chr>      <dbl>
## 1     4 <S1A-002$A>   S1A-002 A       northern ire… 2002… fema… 26-33        391
## 2     5 <S1A-002$B>   S1A-002 B       northern ire… 2002… fema… 19-25         47
## 3     6 <S1A-002$C>   S1A-002 C       northern ire… 2002… male  50+          200
## 4     7 <S1A-002$D>   S1A-002 D       northern ire… 2002… fema… 50+          464
## 5     8 <S1A-002$E>   S1A-002 E       mixed betwee… 2002… male  34-41        639
## 6     9 <S1A-002$F>   S1A-002 F       northern ire… 2002… fema… 26-33        308

To select specific columns you can use the select function.

icebio_selection <- icebio_edit2 %>%
  dplyr::select(File, Speaker, Word.count)
# inspect data
head(icebio_selection)

## # A tibble: 6 × 3
##   File    Speaker Word.count
##   <chr>   <chr>        <dbl>
## 1 S1A-002 A              391
## 2 S1A-002 B               47
## 3 S1A-002 C              200
## 4 S1A-002 D              464
## 5 S1A-002 E              639
## 6 S1A-002 F              308

You can also use the select function to remove specific columns.

icebio_selection2 <- icebio_edit2 %>%
  dplyr::select(-Id, -File, -Speaker, -Date, -Zone, -Age)
# inspect data
head(icebio_selection2)

## # A tibble: 6 × 3
##   Filespeakerid Sex    Word.count
##   <chr>         <chr>       <dbl>
## 1 <S1A-002$A>   female        391
## 2 <S1A-002$B>   female         47
## 3 <S1A-002$C>   male          200
## 4 <S1A-002$D>   female        464
## 5 <S1A-002$E>   male          639
## 6 <S1A-002$F>   female        308

5.11 Ordering data

To order data, for instance, in ascending order according to a specific column you can use the arrange function.

icebio_ordered_asc <- icebio_selection2 %>%
  dplyr::arrange(Word.count)
# inspect data
head(icebio_ordered_asc)

## # A tibble: 6 × 3
##   Filespeakerid Sex    Word.count
##   <chr>         <chr>       <dbl>
## 1 <S1B-009$D>   female          6
## 2 <S1B-005$C>   female          7
## 3 <S1B-009$C>   male            7
## 4 <S1B-020$F>   male            7
## 5 <S1B-006$G>   female          9
## 6 <S2A-050$B>   male            9

To order data in descending order you can also use the arrange function and simply add a - before the column according to which you want to order the data.

icebio_ordered_desc <- icebio_selection2 %>%
  dplyr::arrange(-Word.count)
# inspect data
head(icebio_ordered_desc)

## # A tibble: 6 × 3
##   Filespeakerid Sex    Word.count
##   <chr>         <chr>       <dbl>
## 1 <S2A-055$A>   female       2355
## 2 <S2A-047$A>   male         2340
## 3 <S2A-035$A>   female       2244
## 4 <S2A-048$A>   male         2200
## 5 <S2A-015$A>   male         2172
## 6 <S2A-054$A>   female       2113

The output shows that the female speaker in file S2A-005 with the speaker identity A has the highest word count with 2,355 words.

EXERCISE TIME!

Using the data called icebio, create a new data set called ICE_Ire_ordered and arrange the data in descending order by the number of words that each speaker has uttered. Who is the speaker with the highest word count?

Answer

  ICE_Ire_ordered <- icebio %>%
  dplyr::arrange(-word.count)
  # inspect data
  head(ICE_Ire_ordered)

  ## # A tibble: 6 × 9
  ##      id file.speaker.id text.id spk.ref zone        date  sex   age   word.count
  ##   <dbl> <chr>           <chr>   <chr>   <chr>       <chr> <chr> <chr>      <dbl>
  ## 1   956 <S2A-037$A>     S2A-037 A       republic o… 1990… male  NA          2565
  ## 2   919 <S2A-016$A>     S2A-016 A       republic o… 1995… fema… 34-41       2482
  ## 3   933 <S2A-023$A>     S2A-023 A       northern i… 1990… male  50+         2367
  ## 4   992 <S2A-055$A>     S2A-055 A       northern i… 2002… fema… 42-49       2355
  ## 5   979 <S2A-047$A>     S2A-047 A       republic o… 2002… male  50+         2340
  ## 6   997 <S2A-059$A>     S2A-059 A       republic o… 1990… fema… NA          2305

5.12 Creating and changing variables

New columns are created, and existing columns can be changed, by using the mutate function. The mutate function takes two arguments (if the data does not have to be specified): the first argument is the (new) name of column that you want to create and the second is what you want to store in that column. The = tells R that the new column will contain the result of the second argument.

In the example below, we create a new column called Texttype.

This new column should contain

the value PrivateDialoge if Filespeakerid contains the sequence S1A,
the value PublicDialogue if Filespeakerid contains the sequence S1B,
the value UnscriptedMonologue if Filespeakerid contains the sequence S2A,
the value ScriptedMonologue if Filespeakerid contains the sequence S2B,
the value of Filespeakerid if Filespeakerid neither contains S1A, S1B, S2A, nor S2B.

icebio_texttype <- icebio_selection2 %>%
  dplyr::mutate(Texttype = 
                  dplyr::case_when(stringr::str_detect(Filespeakerid ,"S1A") ~ "PrivateDialoge",
                                   stringr::str_detect(Filespeakerid ,"S1B") ~ "PublicDialogue",
                                   stringr::str_detect(Filespeakerid ,"S2A") ~ "UnscriptedMonologue",
                                   stringr::str_detect(Filespeakerid ,"S2B") ~ "ScriptedMonologue",
                                   TRUE ~ Filespeakerid))
# inspect data
head(icebio_texttype)

## # A tibble: 6 × 4
##   Filespeakerid Sex    Word.count Texttype      
##   <chr>         <chr>       <dbl> <chr>         
## 1 <S1A-002$A>   female        391 PrivateDialoge
## 2 <S1A-002$B>   female         47 PrivateDialoge
## 3 <S1A-002$C>   male          200 PrivateDialoge
## 4 <S1A-002$D>   female        464 PrivateDialoge
## 5 <S1A-002$E>   male          639 PrivateDialoge
## 6 <S1A-002$F>   female        308 PrivateDialoge

5.13 If-statements

We should briefly talk about if-statements (or case_when in the present case). The case_when function is both very powerful and extremely helpful as it allows you to assign values based on a test. As such, case_when-statements can be read as:

When/If X is the case, then do A and if X is not the case do B! (When/If -> Then -> Else)

The nice thing about ifelse or case_when-statements is that they can be used in succession as we have done above. This can then be read as:

If X is the case, then do A, if Y is the case, then do B, else do Z

EXERCISE TIME!

1.Using the data called icebio, create a new data set called ICE_Ire_AgeGroup in which you create a column called AgeGroup where all speakers who are younger than 42 have the value young and all speakers aged 42 and over old.

Tip: use if-statements to assign the old and young values.

Answer

  ICE_Ire_AgeGroup <- icebio %>%
  dplyr::mutate(AgeGroup = dplyr::case_when(age == "42-49" ~ "old",
                                            age == "50+" ~ "old", 
                                            age == "0-18" ~ "young",
                                            age == "19-25" ~ "young",
                                            age == "26-33" ~ "young",
                                            age == "34-41" ~ "young", 
                                            TRUE ~age))
  # inspect data
  head(ICE_Ire_AgeGroup); table(ICE_Ire_AgeGroup$AgeGroup)

  ## # A tibble: 6 × 10
  ##      id file.speaker.id text.id spk.ref zone        date  sex   age   word.count
  ##   <dbl> <chr>           <chr>   <chr>   <chr>       <chr> <chr> <chr>      <dbl>
  ## 1     1 <S1A-001$A>     S1A-001 A       northern i… 1990… male  34-41        765
  ## 2     2 <S1A-001$B>     S1A-001 B       northern i… 1990… fema… 34-41       1298
  ## 3     3 <S1A-002$?>     S1A-002 ?       NA          NA    NA    NA            23
  ## 4     4 <S1A-002$A>     S1A-002 A       northern i… 2002… fema… 26-33        391
  ## 5     5 <S1A-002$B>     S1A-002 B       northern i… 2002… fema… 19-25         47
  ## 6     6 <S1A-002$C>     S1A-002 C       northern i… 2002… male  50+          200
  ## # … with 1 more variable: AgeGroup <chr>

  ## 
  ##    NA   old young 
  ##   547   333   452

5.14 Summarizing data

Summarizing is really helpful and can be done using the summarise function.

icebio_summary1 <- icebio_texttype %>%
  dplyr::summarise(Words = sum(Word.count))
# inspect data
head(icebio_summary1)

## # A tibble: 1 × 1
##    Words
##    <dbl>
## 1 141876

To get summaries of sub-groups or by variable level, we can use the group_by function and then use the summarise function.

icebio_summary2 <- icebio_texttype %>%
  dplyr::group_by(Texttype, Sex) %>%
  dplyr::summarise(Speakers = n(),
            Words = sum(Word.count))
# inspect data
head(icebio_summary2)

## # A tibble: 6 × 4
## # Groups:   Texttype [3]
##   Texttype            Sex    Speakers Words
##   <chr>               <chr>     <int> <dbl>
## 1 PrivateDialoge      female      105 60024
## 2 PrivateDialoge      male         18  9628
## 3 PublicDialogue      female       63 24647
## 4 PublicDialogue      male         41 16783
## 5 UnscriptedMonologue female        3  6712
## 6 UnscriptedMonologue male         16 24082

EXERCISE TIME!

Use the icebio and determine the number of words uttered by female speakers from Northern Ireland above an age of 50.

Answer

  words_fni50 <- icebio %>%
  dplyr::select(zone, sex, age, word.count) %>%
  dplyr::group_by(zone, sex, age) %>%
  dplyr::summarize(Words = sum(word.count)) %>%
  dplyr::filter(sex == "female",
         age == "50+",
         zone == "northern ireland")

  ## `summarise()` has grouped output by 'zone', 'sex'. You can override using the
  ## `.groups` argument.

  # inspect data
  words_fni50

  ## # A tibble: 1 × 4
  ## # Groups:   zone, sex [1]
  ##   zone             sex    age   Words
  ##   <chr>            <chr>  <chr> <dbl>
  ## 1 northern ireland female 50+   23210

Load the file exercisedata.txt and determine the mean scores of groups A and B.

Tip: to extract the mean, combine the summary function with the mean function.

Answer

  exercisedata <- read.delim("data/exercisedata.txt", sep = "\t", header = T) %>%
  dplyr::group_by(Group) %>%
  dplyr::summarize(Mean = mean(Score))
  # inspect data
  exercisedata

  ## # A tibble: 2 × 2
  ##   Group  Mean
  ##   <chr> <dbl>
  ## 1 A      14.9
  ## 2 B      11.8

5.15 Gathering and spreading data

The tidyr package has two very useful functions for gathering and spreading data that can be sued to transform data to long and wide formats (you will see what this means below). The functions are called gather and spread.

We will use the data set called icebio_summary2, which we created above, to demonstrate how this works.

We will first check out the spread-function to create different columns for women and men that show how many of them are represented in the different text types.

icebio_summary_wide <- icebio_summary2 %>%
  dplyr::select(-Words) %>%
  tidyr::spread(Sex, Speakers)
# inspect
icebio_summary_wide

## # A tibble: 3 × 3
## # Groups:   Texttype [3]
##   Texttype            female  male
##   <chr>                <int> <int>
## 1 PrivateDialoge         105    18
## 2 PublicDialogue          63    41
## 3 UnscriptedMonologue      3    16

The data is now in what is called a wide-format as values are distributed across columns.

To reformat this back to a long-format where each column represents exactly one variable, we use the gather-function:

icebio_summary_long <- icebio_summary_wide %>%
  tidyr::gather(Sex, Speakers, female:male)
# inspect
icebio_summary_long

## # A tibble: 6 × 3
## # Groups:   Texttype [3]
##   Texttype            Sex    Speakers
##   <chr>               <chr>     <int>
## 1 PrivateDialoge      female      105
## 2 PublicDialogue      female       63
## 3 UnscriptedMonologue female        3
## 4 PrivateDialoge      male         18
## 5 PublicDialogue      male         41
## 6 UnscriptedMonologue male         16

5.16 Ending R sessions

At the end of each session, you can extract information about the session itself (e.g. which R version you used and which versions of packages). This can help others (or even your future self) to reproduce the analysis that you have done.

5.17 Extracting session information

You can extract the session information by running the sessionInfo function (without any arguments)

sessionInfo()

## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9     purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.7    ggplot2_3.3.6  
## [9] tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2 xfun_0.30        bslib_0.3.1      haven_2.5.0     
##  [5] colorspace_2.0-3 vctrs_0.4.1      generics_0.1.2   htmltools_0.5.2 
##  [9] yaml_2.3.5       utf8_1.2.2       rlang_1.0.2      jquerylib_0.1.4 
## [13] pillar_1.7.0     withr_2.5.0      glue_1.6.2       DBI_1.1.2       
## [17] dbplyr_2.1.1     readxl_1.4.0     modelr_0.1.8     lifecycle_1.0.1 
## [21] cellranger_1.1.0 munsell_0.5.0    gtable_0.3.0     rvest_1.0.2     
## [25] evaluate_0.15    knitr_1.39       tzdb_0.3.0       fastmap_1.1.0   
## [29] fansi_1.0.3      broom_0.8.0      renv_0.15.4      backports_1.4.1 
## [33] scales_1.2.0     jsonlite_1.8.0   fs_1.5.2         hms_1.1.1       
## [37] digest_0.6.29    stringi_1.7.6    bookdown_0.26    rprojroot_2.0.3 
## [41] grid_4.2.0       here_1.0.1       cli_3.3.0        tools_4.2.0     
## [45] magrittr_2.0.3   sass_0.4.1       crayon_1.5.1     pkgconfig_2.0.3 
## [49] ellipsis_0.3.2   xml2_1.3.3       reprex_2.0.1     lubridate_1.8.0 
## [53] assertthat_0.2.1 rmarkdown_2.14   httr_1.4.3       rstudioapi_0.13 
## [57] R6_2.5.1         compiler_4.2.0

5.18 Going further

If you want to know more, would like to get some more practice, or would like to have another approach to R, please check out the workshops and resources on R provided by the UQ library. In addition, there are various online resources available to learn R (you can check out a very recommendable introduction here).

Here are also some additional resources that you may find helpful:

Grolemund. G., and Wickham, H., R 4 Data Science, 2017.
- Highly recommended! (especially chapters 1, 2, 4, 6, and 8)
Stat545 - Data wrangling, exploration, and analysis with R. University of British Columbia. http://stat545.com/
Swirlstats, a package that teaches you R and statistics within R: https://swirlstats.com/
DataCamp’s (free) Intro to R interactive tutorial: https://www.datacamp.com/courses/free-introduction-to-r
- DataCamp’s advanced R tutorials require a subscription. *Twitter:
- Explore RStudio Tips https://twitter.com/rstudiotips
- Explore #rstats, #rstudioconf