Part 2 Working with Tables
2.1 Preparation and session set up
This tutorial is based on R. If you have not installed R or are new to it, you will find an introduction to and more information how to use R here. For this tutorials, we need to install certain packages from an R library so that the scripts shown below are executed without errors. Before turning to the code below, please install the packages by running the code below this paragraph. If you have already installed the packages mentioned below, then you can skip ahead ignore this section. To install the necessary packages, simply run the following code - it may take some time (between 1 and 5 minutes to install all of the packages so you do not need to worry if it takes some time).
# install packages
install.packages("dplyr")
install.packages("tidyr")
install.packages("readxl")
install.packages("here")
Now that we have installed the packages, we can activate them as shown below.
# load packages
library(dplyr)
library(tidyr)
library(readxl)
library(here)
Once you have installed RStudio and initiated the session by executing the code shown above, you are good to go.
2.2 Getting started
Tables are one of the most common and important ways to process, handle, and describe data. This tutorial introduces different types of tables, how to load and save different types of tables, as well as how to modify and process tables and tabulated data.
When working with R, there are different kinds or types of tables that have different characteristics. The most common types of tables in R are:
- matrices
- data frames
- tibbles
Matrices can only contain one type of data and all data points will be converted to the type of scale with the lowest information value. For instance, if at least one variables in a table represents characters (or strings), then all other variables are also converted into characters (although they may be numbers).
Data frames can contain different types of scales and variables are not converted automatically. In this sense, data frames are more flexible and are preferable over matrices.
Tibbles are the tidyverse equivalent of data frames which offer new functions and possibilities of handling and inspecting the data. .
2.3 Loading tables into R
Loading rda-files
There are several different functions that allow us to read tabulated data into R. In our case, we use the readRDS
function which loads Rdata sets.
# load data with read.delim
<- base::readRDS(here::here("data", "rdadat.rda"))
mytable # inspect
head(mytable)
## ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1 R1 002SIN preschool s 2 DE
## 2 R2 002SIN preschool s 2 DE
## 3 R3 002SIN preschool s 2 DE
## 4 R4 002SIN preschool s 2 DE
## 5 R5 002SIN preschool s 2 DE
## 6 R6 002SIN preschool s 2 DE
## age_months family
## 1 64 bil-rus
## 2 64 bil-rus
## 3 64 bil-rus
## 4 64 bil-rus
## 5 64 bil-rus
## 6 64 bil-rus
Loading txt-files
If the data is stored as a txt-file, there are various functions to read in the data. The most common functions are read.delim
and read.table
.
# load data with read.delim
<- read.delim(here::here("data", "txtdat.txt"), sep = "\t", header = TRUE)
txtdat # load data from a server/website
<- read.table("https://raw.githubusercontent.com/MartinSchweinberger/acqvatabletree/master/data/txtdat.txt", sep = "\t", header = T)
txtdat2 # inspect
head(txtdat); head(txtdat2)
## status attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
## status attraction money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
Loading xlsx-files
To load excel data you can use the read_excel
function from the readxl
package (which is part of the tidyverse).
# load data
<- readxl::read_excel(here::here("data", "xlsxdat.xlsx"), sheet = 1)
excelcomp # inspect
head(excelcomp)
## # A tibble: 6 × 9
## ...1 ID_rater ID_child age_group accent_response accent_numeric
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 R1 002SIN preschool s 2
## 2 2 R2 002SIN preschool s 2
## 3 3 R3 002SIN preschool s 2
## 4 4 R4 002SIN preschool s 2
## 5 5 R5 002SIN preschool s 2
## 6 6 R6 002SIN preschool s 2
## # … with 3 more variables: rating_language <chr>, age_months <dbl>,
## # family <chr>
Loading csv-files
# load data
<- read.csv(here::here("data", "csvdat.csv"))
csvdat # inspect
head(csvdat)
## ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1 R1 002SIN preschool s 2 DE
## 2 R2 002SIN preschool s 2 DE
## 3 R3 002SIN preschool s 2 DE
## 4 R4 002SIN preschool s 2 DE
## 5 R5 002SIN preschool s 2 DE
## 6 R6 002SIN preschool s 2 DE
## age_months family
## 1 64 bil-rus
## 2 64 bil-rus
## 3 64 bil-rus
## 4 64 bil-rus
## 5 64 bil-rus
## 6 64 bil-rus
2.4 Inspecting tables
The most common functions that are used to inspect tabular data are the head()
and the str()
functions. The head()
function shows the first 6 lines (by default) but we can also specify the number of rows. The str()
function provides a summary of the structure of the data.
Inspecting the first 6 rows of a table.
head(csvdat)
## ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1 R1 002SIN preschool s 2 DE
## 2 R2 002SIN preschool s 2 DE
## 3 R3 002SIN preschool s 2 DE
## 4 R4 002SIN preschool s 2 DE
## 5 R5 002SIN preschool s 2 DE
## 6 R6 002SIN preschool s 2 DE
## age_months family
## 1 64 bil-rus
## 2 64 bil-rus
## 3 64 bil-rus
## 4 64 bil-rus
## 5 64 bil-rus
## 6 64 bil-rus
Inspecting the first 10 rows of a table.
head(csvdat, 10)
## ID_rater ID_child age_group accent_response accent_numeric rating_language
## 1 R1 002SIN preschool s 2 DE
## 2 R2 002SIN preschool s 2 DE
## 3 R3 002SIN preschool s 2 DE
## 4 R4 002SIN preschool s 2 DE
## 5 R5 002SIN preschool s 2 DE
## 6 R6 002SIN preschool s 2 DE
## 7 R7 002SIN preschool s 2 DE
## 8 R8 002SIN preschool s 2 DE
## 9 R9 002SIN preschool s 2 DE
## 10 R10 002SIN preschool w 1 DE
## age_months family
## 1 64 bil-rus
## 2 64 bil-rus
## 3 64 bil-rus
## 4 64 bil-rus
## 5 64 bil-rus
## 6 64 bil-rus
## 7 64 bil-rus
## 8 64 bil-rus
## 9 64 bil-rus
## 10 64 bil-rus
Checking the structure of tabulated data.
str(csvdat)
## 'data.frame': 777 obs. of 8 variables:
## $ ID_rater : chr "R1" "R2" "R3" "R4" ...
## $ ID_child : chr "002SIN" "002SIN" "002SIN" "002SIN" ...
## $ age_group : chr "preschool" "preschool" "preschool" "preschool" ...
## $ accent_response: chr "s" "s" "s" "s" ...
## $ accent_numeric : int 2 2 2 2 2 2 2 2 2 1 ...
## $ rating_language: chr "DE" "DE" "DE" "DE" ...
## $ age_months : int 64 64 64 64 64 64 64 64 64 64 ...
## $ family : chr "bil-rus" "bil-rus" "bil-rus" "bil-rus" ...
We can also extract a summary of the data using the summary
function.
summary(csvdat)
## ID_rater ID_child age_group accent_response
## Length:777 Length:777 Length:777 Length:777
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## accent_numeric rating_language age_months family
## Min. :0.0000 Length:777 Min. : 48.00 Length:777
## 1st Qu.:0.0000 Class :character 1st Qu.: 61.00 Class :character
## Median :1.0000 Mode :character Median : 79.00 Mode :character
## Mean :0.8132 Mean : 80.19
## 3rd Qu.:1.0000 3rd Qu.: 97.00
## Max. :2.0000 Max. :116.00
## NA's :22
It can make sense to convert the character variables into factors for a summary which we can do with the mutate_if
function.
%>%
csvdat ::mutate_if(is.character, factor) %>%
dplyrsummary()
## ID_rater ID_child age_group accent_response accent_numeric
## R1 : 37 002SIN : 21 preschool:378 no :309 Min. :0.0000
## R10 : 37 003SIN : 21 school :378 s :168 1st Qu.:0.0000
## R11 : 37 005SIN : 21 NA's : 21 w :278 Median :1.0000
## R12 : 37 007SIN : 21 NA's: 22 Mean :0.8132
## R13 : 37 008SIN : 21 3rd Qu.:1.0000
## R14 : 37 010SIN : 21 Max. :2.0000
## (Other):555 (Other):651 NA's :22
## rating_language age_months family
## DE:777 Min. : 48.00 bil-mixed:105
## 1st Qu.: 61.00 bil-rus :399
## Median : 79.00 mono-de :273
## Mean : 80.19
## 3rd Qu.: 97.00
## Max. :116.00
##
The following section shows how to access and manipulate tables.
2.5 Processing tabular data
If you have already loaded data into R and now want to process the data, you typically have to modify the data in some form or another to get the information or format you need. We now load and inspect the txtdat
data set using the read.delim
function. The new data is stored as a txt file and has 100 observations (rows) and 3 variables (status, attraction, and money). The data set represents how much money people have spend in someone they were interested in or not (attraction: Interested versus NotInterested) and their own relationship status (status: Single versus Relationship).
2.6 Piping
Piping, done with the sequence %>%
, is a very easy, intuitive, quick, and handy way to process data (we have used it above when we converted the character variables into factors). Essentially piping means that we take an element that is to the left of the piping symbol and then do something to it; that way, the piping symbol can be translated as and then.
We could, for example, load data and then capitalize the column names and then group the data by status and attraction and then get the mean of money spend on deleting all observations except for the first one. A more formal way to write this would be:
load %>% capitalize %>% group %>% summarize.
In R this command would look like this:
# example of a data processing pipeline
<- read.delim(here::here("data", "txtdat.txt"), sep = "\t", header = TRUE) %>%
pipeddata ::rename(Status = status, Attraction = attraction, Money = money) %>%
dplyr::group_by(Status, Attraction) %>%
dplyr::summarise(Mean = mean(Money))
dplyr# inspect summarized data
pipeddata
## # A tibble: 4 × 3
## # Groups: Status [2]
## Status Attraction Mean
## <chr> <chr> <dbl>
## 1 Relationship Interested 99.2
## 2 Relationship NotInterested 51.5
## 3 Single Interested 157.
## 4 Single NotInterested 46.0
The pipe has worked and we get the resulting summary which shows the mean of the money spend based on Attraction and Status.
2.7 Selecting and filtering
Among the most frequent procedures in data processing is selecting certain columns or extracting rows based on variable levels. In the tidyverse, this is done by using the select
and filter
functions. While select
allows to extract columns, filter
is used to extract rows, e.g. to get only observations that have a certain feature. Have a look at the example below.
# select and filter
<- txtdat %>%
reduceddata # select the columns attraction and money
::select(attraction, money) %>%
dplyr# extract rows which represent cases where the person was interested in someone
::filter(attraction == "Interested")
dplyr# inspect new table
nrow(reduceddata); table(reduceddata$attraction)
## [1] 50
##
## Interested
## 50
We have now reduced the data by excluding status (we have only selected attraction and money) and we have removed those 50 data rows of people who were not interested. The select
function (like most other tidyverse functions) can also be used together with a minus sign which causes a column to be removed, thus dplyr::select(-money)
would remove the money column (see below).
# select and filter
<- txtdat %>%
datawithoutmoney # remove money
::select(-money)
dplyr# inspect data
head(datawithoutmoney)
## status attraction
## 1 Relationship NotInterested
## 2 Relationship NotInterested
## 3 Relationship NotInterested
## 4 Relationship NotInterested
## 5 Relationship NotInterested
## 6 Relationship NotInterested
Selecting and filtering are extremely powerful functions that can also be combined with other functions. But before we discuss more complex issues, we will have a look at how we can change columns.
2.8 Changing data and adding columns
Changing and adding data is done with the mutate
function. The mutate
functions requires that we specify a column name - if we use the same name as the column we are changing, then we change the column but if we specify another column name, then a new column is created.
We will now create a new column (Spendalot) in which we encode if the person has spend a lot of money (100 AUD or more) on the present or not (less than 100 AUD).
# creating a new column
<- txtdat %>%
newdata ::mutate(Spendalot = ifelse(money >= 100, "Alot", "Alittle"))
dplyr# inspect data
head(newdata)
## status attraction money Spendalot
## 1 Relationship NotInterested 86.33 Alittle
## 2 Relationship NotInterested 45.58 Alittle
## 3 Relationship NotInterested 68.43 Alittle
## 4 Relationship NotInterested 52.93 Alittle
## 5 Relationship NotInterested 61.86 Alittle
## 6 Relationship NotInterested 48.47 Alittle
The table now has a new column (Spendalot) because we have specified a column name that did not exist yet - had we written dplyr::mutate(money = ifelse(money >= 100, "Alot", "Alittle"))
then we would have changed the money column and replaced the money values with the labels Alot and Alittle.
You can also use the traditional way of doing this shown below if you are more comfortable with that.
<- txtdat
osdat # creating a new column
$Spendalot_oldschool <- ifelse(osdat$money >= 100, "Alot", "Alittle")
osdat# inspect data
head(osdat)
## status attraction money Spendalot_oldschool
## 1 Relationship NotInterested 86.33 Alittle
## 2 Relationship NotInterested 45.58 Alittle
## 3 Relationship NotInterested 68.43 Alittle
## 4 Relationship NotInterested 52.93 Alittle
## 5 Relationship NotInterested 61.86 Alittle
## 6 Relationship NotInterested 48.47 Alittle
2.9 Renaming columns
Oftentimes, column names are not really meaningful or incoherent which makes it easier to wrap your head around what the values in a column refer to. The easiest way around this is rename columns which is, fortunately very simple in the tidyverse. While the column names of our example table are meaningful, I want to capitalize the first letter of each column name. This can be done as follows.
# renaming columns
<- txtdat %>%
txtdat ::rename(Status = status,
dplyrAttraction = attraction,
Money = colnames(.)[3])
# inspect data
head(txtdat)
## Status Attraction Money
## 1 Relationship NotInterested 86.33
## 2 Relationship NotInterested 45.58
## 3 Relationship NotInterested 68.43
## 4 Relationship NotInterested 52.93
## 5 Relationship NotInterested 61.86
## 6 Relationship NotInterested 48.47
The renaming was successful as all column names now begin with a capital letter.
2.10 Grouping and summarising
In contrast to mutate, which retains the number of rows, summarizing creates new columns but collapses rows and only provides the summary value (or values if more than one summary is specified). Also, columns that are not grouping variables are removed.
Summarizing is particularly useful when we want to get summaries of groups. We will modify the example from above and extract the mean and the standard deviation of the money spend on presents by relationship status and whether the giver was attracted to the giv-ee.
#grouping and summarizing data
<- txtdat %>%
datasummary ::group_by(Status, Attraction) %>%
dplyr::summarise(Mean = round(mean(Money), 2), SD = round(sd(Money), 1))
dplyr# inspect summarized data
datasummary
## # A tibble: 4 × 4
## # Groups: Status [2]
## Status Attraction Mean SD
## <chr> <chr> <dbl> <dbl>
## 1 Relationship Interested 99.2 14.7
## 2 Relationship NotInterested 51.5 17
## 3 Single Interested 157. 23.2
## 4 Single NotInterested 46.0 19.9
2.11 Gathering and Spreading
One very common problem is that data - or at least parts of it - have to be transformed from long to wide format or vice versa. In the tidyverse, this is done using the gather
and spread
function. We will convert the summary table shown above into a wide format (we also remove the SD column as it is no longer needed)
# converting data to wide format
<- datasummary %>%
widedata # remove SD column
::select(-SD) %>%
dplyr# convert into wide format
::spread(Attraction, Mean)
tidyr# inspect wide data
widedata
## # A tibble: 2 × 3
## # Groups: Status [2]
## Status Interested NotInterested
## <chr> <dbl> <dbl>
## 1 Relationship 99.2 51.5
## 2 Single 157. 46.0
We can re-convert the wide into a long format using the gather
function.
# converting data to long format
<- widedata %>%
longdata # convert into long format
::gather(Attraction, Money, Interested:NotInterested)
tidyr# inspect wide data
longdata
## # A tibble: 4 × 3
## # Groups: Status [2]
## Status Attraction Money
## <chr> <chr> <dbl>
## 1 Relationship Interested 99.2
## 2 Single Interested 157.
## 3 Relationship NotInterested 51.5
## 4 Single NotInterested 46.0
2.12 Saving tables on your computer
There are different ways to save your data depending on the format in which you want to save your data.
Save data as csv
# save as csv
write.csv(txtdat, here::here("data", "csvdat_save.csv"))
Save data as txt
# save as txt
write.table(txtdat, here::here("data", "txtdat_save.txt"), sep = "\t")
Save data as rda
# save as rda
::saveRDS(txtdat, file = here::here("data", "txtdat_save.rda")) base
2.13 Outro
sessionInfo()
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Australia.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] here_1.0.1 readxl_1.4.0 tidyr_1.2.0 dplyr_1.0.9
##
## loaded via a namespace (and not attached):
## [1] cellranger_1.1.0 pillar_1.7.0 bslib_0.3.1 compiler_4.2.0
## [5] jquerylib_0.1.4 highr_0.9 tools_4.2.0 digest_0.6.29
## [9] jsonlite_1.8.0 evaluate_0.15 lifecycle_1.0.1 tibble_3.1.7
## [13] pkgconfig_2.0.3 rlang_1.0.2 cli_3.3.0 DBI_1.1.2
## [17] rstudioapi_0.13 yaml_2.3.5 xfun_0.31 fastmap_1.1.0
## [21] stringr_1.4.0 knitr_1.39 generics_0.1.2 vctrs_0.4.1
## [25] sass_0.4.1 rprojroot_2.0.3 tidyselect_1.1.2 glue_1.6.2
## [29] R6_2.5.1 fansi_1.0.3 rmarkdown_2.14 bookdown_0.27
## [33] purrr_0.3.4 magrittr_2.0.3 htmltools_0.5.2 ellipsis_0.3.2
## [37] assertthat_0.2.1 utf8_1.2.2 stringi_1.7.6 crayon_1.5.1