For all examples the movies data set contained in the package will be used.
library(UpSetR)
movies <- read.csv( system.file("extdata", "movies.csv", package = "UpSetR"), header=T, sep=";" )
The set.metadata
parameter is broken up into 3 fields:
data
, ncols
, and plots
.
data
: takes a data frame where the first column is
the set names, and the following columns are attributes of the
sets.
plots
: is a list that takes a list of parameters
that are used to generate the plots. These parameters include
column
, type
, assign
, and
colors
.
column
: is the column of the dataframe that should
be used for the specified plot.
type
: is what type of plot should be used to display
the data from the specified column. If the data in the column is
numeric, then the plot type can be either a bar plot
("hist"
), or heat map ("heat"
). If the data in
the column is boolean, then the plot type can be a "bool"
heat map. If the data in the column is categorical (character), then the
plot type can either be a heat map ("heat"
) or text
("text"
). Additionally, if the data in the column is
ordinal (factor), then the plot type can be either a heat map or text.
There is also a type called "matrix_rows"
which allows us
to use apply colors to the matrix background using categorical data.
This type is useful for identifying characteristics of sets using the
matrix.
assign
: is the number of the columns that should be
assigned to the specific plot. For instance if you’re plotting 2 set
metadata plots then you may choose one plot to take up 20 columns and
other plot 10 columns. Since the UpSet plot is typically plotted on a
100 by 100 grid, the grid will now be 100 by 130 where roughly 1/4 ofthe
plot is assigned to the metadata plots.
colors
: is used to specify the colors used in the
metadata plots. If the plot type is a bar plot then the parameter only
takes one color for the whole plot. If the plot type is
"heat"
or "bool"
, then a vector of colors can
be provided where there is one color for each unique category
(character). However, if the data type is ordinal (factor) there is no
colors
input and the heat map works on a color gradient
rather than applying different colors to each level. Lastly, if the plot
type is “text"
then a vector of colors can be provided
where there is one color for each unique string. If not colors are
provided, a color palette will be provided for you.
In this example, the average Rotten Tomatoes movie ratings for each set will be used as the set metadata. This may help us draw more conclusions from the visualization by knowing how professional movie reviewers typically rate movies in these categories.
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min=0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
When generating a bar plot using set metadata information it is important to make sure the specified column is numeric.
## [1] FALSE
The column is not numeric! In fact it is a factor, so we must coerce it to characters and then to integers.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type="hist", column="avgRottenTomatoesScore", assign=20))))
In this example we will make our own data on what major cities these
genres were most popular in. Since this is categorical and not ordinal
we must remember to change the column to characters (it is a factor
again). To make sure we assign specific colors to each category you can
specify the name of each category in the color vector, as shown below.
If you don’t care what color is assigned to each category then you don’t
have to specify the category names in the color vector. R will just
apply the colors to each category in the order they occur. Additionally,
if you don’t supply anything for the colors
parameter a
default color palette will be provided for you.
Cities <- sample(c("Boston","NYC","LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
metadata[which(metadata$sets %in% c("Drama", "Comedy", "Action", "Thriller", "Romance")), ]
## sets avgRottenTomatoesScore Cities
## 1 Action 73 LA
## 4 Comedy 34 LA
## 7 Drama 36 NYC
## 13 Romance 26 LA
## 15 Thriller 11 Boston
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "heat", column = "Cities", assign = 10, colors = c("Boston" = "green", "NYC" = "navy", "LA" = "purple")))))
Now lets also use our numeric critic values!
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "heat", column = "Cities", assign = 10, colors = c("Boston" = "green", "NYC" = "navy", "LA" = "purple")), list(type = "heat", column = "avgRottenTomatoesScore", assign = 10))))
As a side note, the way the numerical heat map is handled is similar to how the ordinal heat maps are handled.
Now suppose we have metadata that tells us whether or not these genres are well accepted overseas. This could be used as a categorical column where there are only two categories, but for this example we will assume that your data is coded in 1’s and 0’s. It is important to keep in mind that if you run a “heat” with 0’s and 1’s instead of a “bool” the binary data will be treated as numerical values, and a color gradient will be used to show the relative differences.
accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
metadata[which(metadata$sets %in% c("Drama", "Comedy", "Action", "Thriller", "Romance")), ]
## sets avgRottenTomatoesScore Cities accepted
## 1 Action 73 LA 1
## 4 Comedy 34 LA 1
## 7 Drama 36 NYC 1
## 13 Romance 26 LA 0
## 15 Thriller 11 Boston 1
upset(movies, set.metadata = list(data = metadata, plots = list(list(type="bool", column= "accepted", assign = 5, colors = c("#FF3333", "#006400")))))
Let’s see what happens when we choose a “heat” instead of a “bool” for our binary data column.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type="heat", column= "accepted", assign = 5, colors = c("red", "green")))))
Lets say we prefer to show text instead of a heat map for the cities these genres were most popular in.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "text", column = "Cities", assign = 10, colors = c("Boston" = "green", "NYC" = "navy", "LA" = "purple")))))
In some cases we may just want to incorporate categorical set
metadata directly into the UpSet plot to easily identify characteristics
of the sets via the matrix. To do this we need to specify the type as
"matrix_rows"
, what column we’re using to categorize the
sets, and the colors to apply to each category. There is also an option
to change the opacity of the matrix background using alpha
.
To change the opacity of the matrix background without applying set
metadata see the shade.alpha
parameter in the
upset()
function documentation.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type="hist", column="avgRottenTomatoesScore", assign=20),list(type="matrix_rows", column = "Cities", colors = c("Boston" = "green", "NYC" = "navy", "LA" = "purple"), alpha = 0.5))))
Now lets sum up all of our metadata information together on one plot!
upset(movies, set.metadata = list(data = metadata, plots = list(list(type="hist", column="avgRottenTomatoesScore", assign=20),list(type="bool", column= "accepted", assign = 5, colors = c("#FF3333", "#006400")), list(type = "text", column = "Cities", assign = 5, colors = c("Boston" = "green", "NYC" = "navy", "LA" = "purple")))))
Finally, lets include functionalities discussed in all of the other UpSetR Vignettes! This gives us a very in depth look at information about our sets, intersections, and specific elements.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type="hist", column="avgRottenTomatoesScore", assign=20), list(type="bool", column= "accepted", assign = 5, colors = c("#FF3333", "#006400")), list(type="text", column="Cities", assign=5, colors=c("Boston"="green","NYC"="navy","LA"="purple")), list(type="matrix_rows", column="Cities", colors=c("Boston"="green", "NYC"="navy", "LA"="purple"), alpha=0.5))), queries=list(list(query=intersects, params=list("Drama"), color="red", active=F), list(query=intersects, params=list("Action", "Drama"), active = T), list(query=intersects, params=list("Drama", "Comedy", "Action"), color="orange", active=T)), attribute.plots = list(gridrows=45, plots = list(list(plot=scatter_plot, x="ReleaseDate", y="AvgRating", queries=T), list(plot=scatter_plot, x="AvgRating", y="Watches", queries=F)), ncols=2), query.legend="bottom")