Conditional Inference Trees
Conditional Inference Trees (CITs) are much better at determining the true effect of a predictor, i.e. the effect of a predictor if all other effects are simultaneously considered. In contrast to CARTs, CITs use p-values to determine splits in the data. Below is a conditional inference tree which shows how and what factors contribute to the use of discourse like. In conditional inference trees predictors are only included if the predictor is significant (i.e. if these predictors are necessary).
Example 1: disourse like
In a first step, we load the data (in this case, the data is stored online and made available via the LADAL github repo).
<- read.delim("https://slcladal.github.io/data/treedata.txt", header = T, sep = "\t")
citdata # inspect the data (head shows the first 6 lines)
head(citdata)
## Age Gender Status LikeUser
## 1 15-40 female high no
## 2 15-40 female high no
## 3 15-40 male high no
## 4 41-80 female low yes
## 5 41-80 male high no
## 6 41-80 male low no
We can also inspect the structure of the data using the str()
function as shown below.
# inspect the data (str shows the structure of the data)
str(citdata)
## 'data.frame': 251 obs. of 4 variables:
## $ Age : chr "15-40" "15-40" "15-40" "41-80" ...
## $ Gender : chr "female" "female" "male" "female" ...
## $ Status : chr "high" "high" "high" "low" ...
## $ LikeUser: chr "no" "no" "no" "yes" ...
We can see that all variables are character variables (indicated by the chr
next to the variable name). However, CITs require factors (not character strings) and we thus need to convert the variables into factors (which we can do using mutate_if(is.character, factor)
)
# convert character strings to factors
<- citdata %>%
citdata ::mutate_if(is.character, factor)
dplyr# inspect the data (str shows the structure of the data)
str(citdata)
## 'data.frame': 251 obs. of 4 variables:
## $ Age : Factor w/ 2 levels "15-40","41-80": 1 1 1 2 2 2 2 1 2 2 ...
## $ Gender : Factor w/ 2 levels "female","male": 1 1 2 1 2 2 1 2 2 2 ...
## $ Status : Factor w/ 2 levels "high","low": 1 1 1 2 1 2 2 1 2 2 ...
## $ LikeUser: Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 1 ...
Now we can implement the CIT. We start by defining a so-called control which allows us to model that we are running several test in one go. Setting a control in this case means that we lower the significance levels so that the overall significance level (of all test combined is still at \(\alpha\) = .05 (this prevents errors in our analysis called false positives).
# apply bonferroni correction (1 minus alpha multiplied by n of predictors)
= ctree_control(mincriterion = 1-(.05*ncol(citdata)-1)) control
After setting the control, we implement the CIT by using the ctree
function from the partykit
package. This function requires the formula (the dependent variable ~ and the independent variables) and the specification of the data. Then, we plot the resulting CIT using the plot
function.
# create initial conditional inference tree model
<- partykit::ctree(LikeUser ~ Age + Gender + Status,
citd.ctree data = citdata)
plot(citd.ctree, gp = gpar(fontsize = 8)) # plot final ctree
Example 2: Prepositions
We now proceed with a different example that uses a numeric dependent variable. In this example, we want to see what factors impact the use of prepositions across time in historical texts. The analysis is based on data extracted from the Penn Corpora of Historical English (see http://www.ling.upenn.edu/hist-corpora/), that consists of 603 texts written between 1125 and 1900.
Be start again by loading the data.
<- base::readRDS(url("https://slcladal.github.io/data/pvd.rda", "rb")) %>%
pvd ::mutate_if(is.character, factor) %>%
dplyr# remove columns we do not need
::select(-Date, -Genre, -Text)
dplyr# inspect
head(pvd); str(pvd)
## Prepositions Region GenreRedux DateRedux
## 1 166.01 North NonFiction 1700-1799
## 2 139.86 North NonFiction 1700-1799
## 3 130.78 North Conversational 1800-1913
## 4 151.29 North NonFiction 1800-1913
## 5 145.72 North NonFiction 1700-1799
## 6 120.77 North NonFiction 1800-1913
## 'data.frame': 537 obs. of 4 variables:
## $ Prepositions: num 166 140 131 151 146 ...
## $ Region : Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...
## $ GenreRedux : Factor w/ 5 levels "Conversational",..: 4 4 1 4 4 4 1 4 4 1 ...
## $ DateRedux : Factor w/ 5 levels "1150-1499","1500-1599",..: 4 4 5 5 4 5 5 5 4 4 ...
Next, we set the control just like we did before.
# apply bonferroni correction (1 minus alpha multiplied by n of predictors)
= ctree_control(mincriterion = 1-(.05*ncol(pvd)-1)) control
Now, we implement the CIT and plot the results.
# create initial conditional inference tree model
<- partykit::ctree(Prepositions ~ DateRedux + Region + GenreRedux,
citd.ctree2 data = pvd)
plot(citd.ctree2, gp = gpar(fontsize = 8)) # plot final ctree
Prettifying your CIT tree
The easiest and most common way to visualize CITs is to simply use the plot
function from base R
. However, using this function does not allow to adapt and customize the visualization except for some very basic parameters. The ggparty
function allows to use the ggplot
syntax to customize CITs which allows more adjustments and is more aesthetically pleasing.
To generate this customized CIT, we activate the ggparty
package and extract the significant p-values from the CIT object. We then plot the CIT and define the nodes, edges, and text elements as shown below.
# extract p-values
<- unlist(nodeapply(citd.ctree, ids = nodeids(citd.ctree), function(n) info_node(n)$p.value))
pvals <- pvals[pvals <.05]
pvals # plotting
ggparty(citd.ctree) +
geom_edge() +
geom_edge_label() +
geom_node_label(line_list = list(aes(label = splitvar),
aes(label = paste0("N=", nodesize, ", p",
ifelse(pvals < .001, "<.001", paste0("=", round(pvals, 3)))),
size = 10)),
line_gpar = list(list(size = 13),
list(size = 10)),
ids = "inner") +
geom_node_label(aes(label = paste0("Node ", id, ", N = ", nodesize)),
ids = "terminal", nudge_y = -0.0, nudge_x = 0.01) +
geom_node_plot(gglist = list(
geom_bar(aes(x = "", fill = LikeUser),
position = position_fill(), color = "black"),
theme_minimal(),
scale_fill_manual(values = c("gray50", "gray80"), guide = FALSE),
scale_y_continuous(breaks = c(0, 1)),
xlab(""),
ylab("Probability"),
geom_text(aes(x = "", group = LikeUser,
label = stat(count)),
stat = "count", position = position_fill(), vjust = 1.1)),
shared_axis_labels = TRUE)
We can also use position_dodge
(instead of position_fill
) to display frequencies rather than probabilities as shown below.
# plotting
ggparty(citd.ctree) +
geom_edge() +
geom_edge_label() +
geom_node_label(line_list = list(aes(label = splitvar),
aes(label = paste0("N=", nodesize, ", p",
ifelse(pvals < .001, "<.001", paste0("=", round(pvals, 3)))),
size = 10)),
line_gpar = list(list(size = 13),
list(size = 10)),
ids = "inner") +
geom_node_label(aes(label = paste0("Node ", id, ", N = ", nodesize)),
ids = "terminal", nudge_y = 0.01, nudge_x = 0.01) +
geom_node_plot(gglist = list(
geom_bar(aes(x = "", fill = LikeUser),
position = position_dodge(), color = "black"),
theme_minimal(),
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank()),
scale_fill_manual(values = c("gray50", "gray80"), guide = FALSE),
scale_y_continuous(breaks = seq(0, 100, 20),
limits = c(0, 100)),
xlab(""),
ylab("Frequency"),
geom_text(aes(x = "", group = LikeUser,
label = stat(count)),
stat = "count",
position = position_dodge(0.9), vjust = -0.7)),
shared_axis_labels = TRUE)
Extensions of Conditional Inference Trees
An extension which remedies this problem is to use a so-called ensemble method which grows many varied trees. The most common ensemble method is called a Random Forest Analysis and, unfortunately, we cannot deal with random forests here (but if you are interested, check out this tutorial).