Part 2 Working with Tables

2.1 Preparation and session set up

This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).

# install packages
install.packages("dplyr")
install.packages("tidyr")
install.packages("readxl")
install.packages("here")

Now that we have installed the packages, we can activate them as shown below.

# load packages
library(dplyr)
library(tidyr)
library(readxl)
library(here)

Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.

2.2 Getting started

Tables are one of the most common and important ways to process, handle, and describe data. This tutorial introduces different types of tables, how to load and save different types of tables, as well as how to modify and process tables and tabulated data.

When working with R, there are different kinds or types of tables that have different characteristics. The most common types of tables in R are:

matrices
data frames
tibbles

Matrices can only contain one type of data and all data points will be converted to the type of scale with the lowest information value. For instance, if at least one variables in a table represents characters (or strings), then all other variables are also converted into characters (although they may be numbers).

Data frames can contain different types of scales and variables are not converted automatically. In this sense, data frames are more flexible and are preferable over matrices.

Tibbles are the tidyverse equivalent of data frames which offer new functions and possibilities of handling and inspecting the data. .

2.3 Loading tables into R

Loading rda-files

There are several different functions that allow us to read tabulated data into R. In our case, we use the readRDS function which loads Rdata sets.

# load data with read.delim
mytable <- base::readRDS(here::here("data", "rdadat.rda"))
# inspect
head(mytable)

##   ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1       R1   002SIN preschool               s              2              DE
## 2       R2   002SIN preschool               s              2              DE
## 3       R3   002SIN preschool               s              2              DE
## 4       R4   002SIN preschool               s              2              DE
## 5       R5   002SIN preschool               s              2              DE
## 6       R6   002SIN preschool               s              2              DE
##   age_months  family
## 1         64 bil-rus
## 2         64 bil-rus
## 3         64 bil-rus
## 4         64 bil-rus
## 5         64 bil-rus
## 6         64 bil-rus

Loading txt-files

If the data is stored as a txt-file, there are various functions to read in the data. The most common functions are read.delim and read.table.

# load data with read.delim
txtdat <- read.delim(here::here("data", "txtdat.txt"), sep = "\t", header = TRUE)
# load data from a server/website 
txtdat2 <- read.table("https://raw.githubusercontent.com/MartinSchweinberger/acqvatabletree/master/data/txtdat.txt", sep = "\t", header = T)
# inspect
head(txtdat); head(txtdat2)

##         status    attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47

##         status    attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47

Loading xlsx-files

To load excel data you can use the read_excel function from the readxl package (which is part of the tidyverse).

# load data
excelcomp <- readxl::read_excel(here::here("data", "xlsxdat.xlsx"), sheet = 1)
# inspect
head(excelcomp)

## # A tibble: 6 × 9
##   ...1  ID_rater ID_child age_group accent_response accent_numeric
##   <chr> <chr>    <chr>    <chr>     <chr>                    <dbl>
## 1 1     R1       002SIN   preschool s                            2
## 2 2     R2       002SIN   preschool s                            2
## 3 3     R3       002SIN   preschool s                            2
## 4 4     R4       002SIN   preschool s                            2
## 5 5     R5       002SIN   preschool s                            2
## 6 6     R6       002SIN   preschool s                            2
## # … with 3 more variables: rating_language <chr>, age_months <dbl>,
## #   family <chr>

Loading csv-files

# load data
csvdat <- read.csv(here::here("data", "csvdat.csv"))
# inspect
head(csvdat)

##   ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1       R1   002SIN preschool               s              2              DE
## 2       R2   002SIN preschool               s              2              DE
## 3       R3   002SIN preschool               s              2              DE
## 4       R4   002SIN preschool               s              2              DE
## 5       R5   002SIN preschool               s              2              DE
## 6       R6   002SIN preschool               s              2              DE
##   age_months  family
## 1         64 bil-rus
## 2         64 bil-rus
## 3         64 bil-rus
## 4         64 bil-rus
## 5         64 bil-rus
## 6         64 bil-rus

Excercise Time!

Load your own data into R!

# load own data

2.4 Inspecting tables

The most common functions that are used to inspect tabular data are the head() and the str() functions. The head() function shows the first 6 lines (by default) but we can also specify the number of rows. The str() function provides a summary of the structure of the data.

Inspecting the first 6 rows of a table.

head(csvdat)

##   ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1       R1   002SIN preschool               s              2              DE
## 2       R2   002SIN preschool               s              2              DE
## 3       R3   002SIN preschool               s              2              DE
## 4       R4   002SIN preschool               s              2              DE
## 5       R5   002SIN preschool               s              2              DE
## 6       R6   002SIN preschool               s              2              DE
##   age_months  family
## 1         64 bil-rus
## 2         64 bil-rus
## 3         64 bil-rus
## 4         64 bil-rus
## 5         64 bil-rus
## 6         64 bil-rus

Inspecting the first 10 rows of a table.

head(csvdat, 10)

##    ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1        R1   002SIN preschool               s              2              DE
## 2        R2   002SIN preschool               s              2              DE
## 3        R3   002SIN preschool               s              2              DE
## 4        R4   002SIN preschool               s              2              DE
## 5        R5   002SIN preschool               s              2              DE
## 6        R6   002SIN preschool               s              2              DE
## 7        R7   002SIN preschool               s              2              DE
## 8        R8   002SIN preschool               s              2              DE
## 9        R9   002SIN preschool               s              2              DE
## 10      R10   002SIN preschool               w              1              DE
##    age_months  family
## 1          64 bil-rus
## 2          64 bil-rus
## 3          64 bil-rus
## 4          64 bil-rus
## 5          64 bil-rus
## 6          64 bil-rus
## 7          64 bil-rus
## 8          64 bil-rus
## 9          64 bil-rus
## 10         64 bil-rus

Checking the structure of tabulated data.

str(csvdat)

## 'data.frame':    777 obs. of  8 variables:
##  $ ID_rater       : chr  "R1" "R2" "R3" "R4" ...
##  $ ID_child       : chr  "002SIN" "002SIN" "002SIN" "002SIN" ...
##  $ age_group      : chr  "preschool" "preschool" "preschool" "preschool" ...
##  $ accent_response: chr  "s" "s" "s" "s" ...
##  $ accent_numeric : int  2 2 2 2 2 2 2 2 2 1 ...
##  $ rating_language: chr  "DE" "DE" "DE" "DE" ...
##  $ age_months     : int  64 64 64 64 64 64 64 64 64 64 ...
##  $ family         : chr  "bil-rus" "bil-rus" "bil-rus" "bil-rus" ...

We can also extract a summary of the data using the summary function.

summary(csvdat)

##    ID_rater           ID_child          age_group         accent_response   
##  Length:777         Length:777         Length:777         Length:777        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  accent_numeric   rating_language      age_months        family         
##  Min.   :0.0000   Length:777         Min.   : 48.00   Length:777        
##  1st Qu.:0.0000   Class :character   1st Qu.: 61.00   Class :character  
##  Median :1.0000   Mode  :character   Median : 79.00   Mode  :character  
##  Mean   :0.8132                      Mean   : 80.19                     
##  3rd Qu.:1.0000                      3rd Qu.: 97.00                     
##  Max.   :2.0000                      Max.   :116.00                     
##  NA's   :22

It can make sense to convert the character variables into factors for a summary which we can do with the mutate_if function.

csvdat %>%
  dplyr::mutate_if(is.character, factor) %>%
  summary()

##     ID_rater      ID_child       age_group   accent_response accent_numeric  
##  R1     : 37   002SIN : 21   preschool:378   no  :309        Min.   :0.0000  
##  R10    : 37   003SIN : 21   school   :378   s   :168        1st Qu.:0.0000  
##  R11    : 37   005SIN : 21   NA's     : 21   w   :278        Median :1.0000  
##  R12    : 37   007SIN : 21                   NA's: 22        Mean   :0.8132  
##  R13    : 37   008SIN : 21                                   3rd Qu.:1.0000  
##  R14    : 37   010SIN : 21                                   Max.   :2.0000  
##  (Other):555   (Other):651                                   NA's   :22      
##  rating_language   age_months           family   
##  DE:777          Min.   : 48.00   bil-mixed:105  
##                  1st Qu.: 61.00   bil-rus  :399  
##                  Median : 79.00   mono-de  :273  
##                  Mean   : 80.19                  
##                  3rd Qu.: 97.00                  
##                  Max.   :116.00                  
##

The following section shows how to access and manipulate tables.

Excercise Time!

Summarize your own data (the data you loaded above)!

# summarise own data

2.5 Processing tabular data

If you have already loaded data into R and now want to process the data, you typically have to modify the data in some form or another to get the information or format you need. We now load and inspect the txtdat data set using the read.delim function. The new data is stored as a txt file and has 100 observations (rows) and 3 variables (status, attraction, and money). The data set represents how much money people have spend in someone they were interested in or not (attraction: Interested versus NotInterested) and their own relationship status (status: Single versus Relationship).

2.6 Piping

Piping, done with the sequence %>%, is a very easy, intuitive, quick, and handy way to process data (we have used it above when we converted the character variables into factors). Essentially piping means that we take an element that is to the left of the piping symbol and then do something to it; that way, the piping symbol can be translated as and then.

We could, for example, load data and then capitalize the column names and then group the data by status and attraction and then get the mean of money spend on deleting all observations except for the first one. A more formal way to write this would be:

load %>% capitalize %>% group %>% summarize.

In R this command would look like this:

# example of a data processing pipeline
pipeddata <- read.delim(here::here("data", "txtdat.txt"), sep = "\t", header = TRUE) %>%
  dplyr::rename(Status = status, Attraction = attraction, Money = money) %>%
  dplyr::group_by(Status, Attraction) %>%
  dplyr::summarise(Mean = mean(Money))
# inspect summarized data
pipeddata

## # A tibble: 4 × 3
## # Groups:   Status [2]
##   Status       Attraction     Mean
##   <chr>        <chr>         <dbl>
## 1 Relationship Interested     99.2
## 2 Relationship NotInterested  51.5
## 3 Single       Interested    157. 
## 4 Single       NotInterested  46.0

The pipe has worked and we get the resulting summary which shows the mean of the money spend based on Attraction and Status.

2.7 Selecting and filtering

Among the most frequent procedures in data processing is selecting certain columns or extracting rows based on variable levels. In the tidyverse, this is done by using the select and filter functions. While select allows to extract columns, filter is used to extract rows, e.g. to get only observations that have a certain feature. Have a look at the example below.

# select and filter
reduceddata <- txtdat %>%
  # select the columns attraction and money
  dplyr::select(attraction, money) %>%
  # extract rows which represent cases where the person was interested in someone
  dplyr::filter(attraction == "Interested")
# inspect new table
nrow(reduceddata); table(reduceddata$attraction)

## [1] 50

## 
## Interested 
##         50

We have now reduced the data by excluding status (we have only selected attraction and money) and we have removed those 50 data rows of people who were not interested. The select function (like most other tidyverse functions) can also be used together with a minus sign which causes a column to be removed, thus dplyr::select(-money) would remove the money column (see below).

# select and filter
datawithoutmoney <- txtdat %>%
  # remove money
  dplyr::select(-money) 
# inspect data
head(datawithoutmoney)

##         status    attraction
## 1 Relationship NotInterested
## 2 Relationship NotInterested
## 3 Relationship NotInterested
## 4 Relationship NotInterested
## 5 Relationship NotInterested
## 6 Relationship NotInterested

Selecting and filtering are extremely powerful functions that can also be combined with other functions. But before we discuss more complex issues, we will have a look at how we can change columns.

Excercise Time!

Select any three variables from your own data set (the data you loaded above)!

# selecting variables from own data

2.8 Changing data and adding columns

Changing and adding data is done with the mutate function. The mutate functions requires that we specify a column name - if we use the same name as the column we are changing, then we change the column but if we specify another column name, then a new column is created.

We will now create a new column (Spendalot) in which we encode if the person has spend a lot of money (100 AUD or more) on the present or not (less than 100 AUD).

# creating a new column
newdata <- txtdat %>%
  dplyr::mutate(Spendalot = ifelse(money >= 100, "Alot", "Alittle")) 
# inspect data
head(newdata)

##         status    attraction money Spendalot
## 1 Relationship NotInterested 86.33   Alittle
## 2 Relationship NotInterested 45.58   Alittle
## 3 Relationship NotInterested 68.43   Alittle
## 4 Relationship NotInterested 52.93   Alittle
## 5 Relationship NotInterested 61.86   Alittle
## 6 Relationship NotInterested 48.47   Alittle

The table now has a new column (Spendalot) because we have specified a column name that did not exist yet - had we written dplyr::mutate(money = ifelse(money >= 100, "Alot", "Alittle")) then we would have changed the money column and replaced the money values with the labels Alot and Alittle.

You can also use the traditional way of doing this shown below if you are more comfortable with that.

osdat <- txtdat
# creating a new column
osdat$Spendalot_oldschool <- ifelse(osdat$money >= 100, "Alot", "Alittle")
# inspect data
head(osdat)

##         status    attraction money Spendalot_oldschool
## 1 Relationship NotInterested 86.33             Alittle
## 2 Relationship NotInterested 45.58             Alittle
## 3 Relationship NotInterested 68.43             Alittle
## 4 Relationship NotInterested 52.93             Alittle
## 5 Relationship NotInterested 61.86             Alittle
## 6 Relationship NotInterested 48.47             Alittle

2.9 Renaming columns

Oftentimes, column names are not really meaningful or incoherent which makes it easier to wrap your head around what the values in a column refer to. The easiest way around this is rename columns which is, fortunately very simple in the tidyverse. While the column names of our example table are meaningful, I want to capitalize the first letter of each column name. This can be done as follows.

# renaming columns
txtdat <- txtdat  %>%
  dplyr::rename(Status = status, 
                Attraction = attraction, 
                Money = colnames(.)[3])
# inspect data
head(txtdat)

##         Status    Attraction Money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47

The renaming was successful as all column names now begin with a capital letter.

Excercise Time!

Rename any variables in your own data set (the data you loaded above)!

# rename own data

2.10 Grouping and summarising

In contrast to mutate, which retains the number of rows, summarizing creates new columns but collapses rows and only provides the summary value (or values if more than one summary is specified). Also, columns that are not grouping variables are removed.

Summarizing is particularly useful when we want to get summaries of groups. We will modify the example from above and extract the mean and the standard deviation of the money spend on presents by relationship status and whether the giver was attracted to the giv-ee.

#grouping and summarizing data 
datasummary <- txtdat %>%
  dplyr::group_by(Status, Attraction) %>%
  dplyr::summarise(Mean = round(mean(Money), 2), SD = round(sd(Money), 1))
# inspect summarized data
datasummary

## # A tibble: 4 × 4
## # Groups:   Status [2]
##   Status       Attraction     Mean    SD
##   <chr>        <chr>         <dbl> <dbl>
## 1 Relationship Interested     99.2  14.7
## 2 Relationship NotInterested  51.5  17  
## 3 Single       Interested    157.   23.2
## 4 Single       NotInterested  46.0  19.9

Excercise Time!

Group and summarise own data set (the data you loaded above)!

# group and summarise variables from own data

2.11 Gathering and Spreading

One very common problem is that data - or at least parts of it - have to be transformed from long to wide format or vice versa. In the tidyverse, this is done using the gather and spread function. We will convert the summary table shown above into a wide format (we also remove the SD column as it is no longer needed)

# converting data to wide format 
widedata <- datasummary %>%
  # remove SD column
  dplyr::select(-SD) %>% 
  # convert into wide format
  tidyr::spread(Attraction, Mean)
# inspect wide data
widedata

## # A tibble: 2 × 3
## # Groups:   Status [2]
##   Status       Interested NotInterested
##   <chr>             <dbl>         <dbl>
## 1 Relationship       99.2          51.5
## 2 Single            157.           46.0

We can re-convert the wide into a long format using the gather function.

# converting data to long format 
longdata <- widedata %>%
  # convert into long format
  tidyr::gather(Attraction, Money, Interested:NotInterested)
# inspect wide data
longdata

## # A tibble: 4 × 3
## # Groups:   Status [2]
##   Status       Attraction    Money
##   <chr>        <chr>         <dbl>
## 1 Relationship Interested     99.2
## 2 Single       Interested    157. 
## 3 Relationship NotInterested  51.5
## 4 Single       NotInterested  46.0

Excercise Time!

Spread two variables from your own data set (the data you loaded above)!

# spread your own data

There are many more useful functions for processing, handling, and summarizing tables but this should suffice to get you started.

2.12 Saving tables on your computer

There are different ways to save your data depending on the format in which you want to save your data.

Save data as csv

# save as csv
write.csv(txtdat, here::here("data", "csvdat_save.csv"))

Save data as txt

# save as txt
write.table(txtdat, here::here("data", "txtdat_save.txt"), sep = "\t")

Save data as rda

# save as rda
base::saveRDS(txtdat, file = here::here("data", "txtdat_save.rda"))

Excercise Time!

Save your own data as rda!

# save your own data

2.13 Outro

sessionInfo()

## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] here_1.0.1   readxl_1.4.0 tidyr_1.2.0  dplyr_1.0.9 
## 
## loaded via a namespace (and not attached):
##  [1] cellranger_1.1.0 pillar_1.7.0     bslib_0.3.1      compiler_4.2.0  
##  [5] jquerylib_0.1.4  highr_0.9        tools_4.2.0      digest_0.6.29   
##  [9] jsonlite_1.8.0   evaluate_0.15    lifecycle_1.0.1  tibble_3.1.7    
## [13] pkgconfig_2.0.3  rlang_1.0.2      cli_3.3.0        DBI_1.1.2       
## [17] rstudioapi_0.13  yaml_2.3.5       xfun_0.31        fastmap_1.1.0   
## [21] stringr_1.4.0    knitr_1.39       generics_0.1.2   vctrs_0.4.1     
## [25] sass_0.4.1       rprojroot_2.0.3  tidyselect_1.1.2 glue_1.6.2      
## [29] R6_2.5.1         fansi_1.0.3      rmarkdown_2.14   bookdown_0.27   
## [33] purrr_0.3.4      magrittr_2.0.3   htmltools_0.5.2  ellipsis_0.3.2  
## [37] assertthat_0.2.1 utf8_1.2.2       stringi_1.7.6    crayon_1.5.1