DATA 101: Making Prediction with Data
University of British Columbia Okanagan
R includes at least three graphical systems:
Today we’ll take a look at ggplot2.
Data visualization is the graphical representation of data to uncover insights, patterns, and trends.
Visualizations can take various forms, including charts, graphs, maps, and diagrams.
The primary goal is to communicate data clearly, aiding decision-making and storytelling.
To facilitate that, you want to reduce visual noise as much as possible
John Snow
Colours: use sparingly; too many can be distracting, create false patterns, and detract from the message.
Overplotting: when multiple data points overlap to the extent that individual points cannot be distinguished.
Size: Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small.
Axis manipulation: don’t adjust the axes to zoom in on small differences.
Do you think it is fair to say that Germans are more motivated and work more hours than do workers in other EU nations?
How about now?
ggplot2 is a (non-core) tidyverse package; written by Hadley Wickham and others (view on CRAN)
ggplot2 implements the Grammar of Graphics and enables us to concisely describe the components of a graphic.
ggplot2 does a lot of the automatic formatting , while also providing the buildable and customizable features as described in base.
There is a lot to unpack with this graphic method and it may be helpful to keep a cheatsheet nearby.
We can think of the base plotting model as blank canvas on which we can draw but not erase.
We may start with a plot boxplot, striptchart, histogram, etc
Upon viewing, we might decide we want to superimpose a line (eg abline()
, lines()
) or points (eg. points()
), or text (eg. text()
, axis()
, title()
),
In this way, we have a series of R commands which “build-up” our graphic until we are satisfied with it.
Consistency and Clarity: produces clearer and more organized code, making it easier to understand and reproduce your plots.
Layered Approach: uses a layered approach where you add different components (geometric objects, statistical transformations, facets) to build a plot step by step.
Data-Driven Aesthetics: You can map data variables to aesthetics like color, size, and shape, creating dynamic visualizations where the plot adapts to changes in your data.
Faceting: provides built-in support for faceting, allowing you to split data into multiple subplots based on one or more categorical variables.
Community and Ecosystem large and active user community, which means that you can find ample resources, tutorials, and support.
Reproducibility: The structured nature of ggplot2 code and the fact that it’s based on R means that your plots can be part of reproducible workflows.
qplot()
is a shortcut designed to be familiar if you’re used to base plot()
qplot()
using the following code: (output on next slide)As you may notice, plots produced in ggplot2 have a very distinct look from the ones made in base.
qplot()
can create multi-panel plots using facets. We can specify our desired groups using y~x
.y~.
creates a single row of plots with each panel corresponding to a unique levels of y
(i.e. the row faceting variable).~x
creates a single column of plots with each panel corresponding to a unique levels of x
(i.e. the column faceting variable)|y~x|
forms a matrix of plots whose rows and columns represents a combination of the levels of x
and y
Facets are a powerful feature in data visualization that allow you to split a single plot into multiple smaller subplots based on one or more categorical variables.
Faceting enables you to compare and contrast different subsets of your data within the same visualization.
Faceting in qplot()
accepts a formula
It uses facet_wrap()
or facet_grid()
depending on whether the formula is one- or two-sided
Let’s plot the side-by-side scatter plots for mpg
(Miles/(US) gallon) vs. disp
(Displacement (cu.in.)) for each cyl
(Number of cylinders: 4, 6, 8) in the mtcars
data set.
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
Rather than representing this information in distinct side-by-side plots, I may want to create a single plot and and distinguish the groups of 4, 6, and 8 cylinder cars using colours or shapes.
In ggplot2 terms, we could change the colour aesthetics (latter referenced as aes
) according the to factor cyl
.
ggplot2 will automatically pick the colours, and the displays the legend.
A scatter plot for mpg
vs. disp
(displacement) where points are coloured according to cyl
type
We could have distinguished our cylinders by shapes instead.
Note: to control the size and aspect ratio of plots use the fig.width
and fig.height. The
out.width` chunk option is used to control the display width of the generated plot when it is embedded in the final document
What might you expect the following code to produce?
Compare with row-wise histogram
cyl
variable to a factor data typeam
) having levels 0
,1
we should rename them "automatic"
and "manual"
ggplot()
ggplot()
function.ggplot()
is the workhorse function in ggplot2 and will be able to do a lot of things that qplot()
can’t.ggplot()
is based on the Grammar of Graphics
arguments
in a function
, we will add them (literally by using +
) to a ggplot object layer by layer.The Grammar of Graphics is a foundational framework for creating data visualizations.
It provides a structured approach to visualizing data, emphasizing the importance of consistency and repeatability in creating graphics.
The concept was introduced by Leland Wilkinson and is implemented in various data visualization libraries, including ggplot2 in R. The
Required:
Optional:
Key Components:
data
: The tidy dataset containing the variables you want to visualize.geom_point
, geom_bar
).Key Components:
data
: The tidy dataset containing the variables you want to visualize.geom_point
, geom_bar
).facet_wrap()
or facet_grid()
aes
aesthetic attributes, i.e. how data are mapped (eg. colour, shape, size, more…)geoms
geometric objects (eg. points, lines, bars, more…)facets
for forming multi-panel plots; see faceting
stats
🔗 for statistical transformation (eg. smoothing)co-ordinate system
🔗 (eg. \(x\) and \(y\) axis)Create a ggplot object
Identify your data and basic aestheics (identify \(x\) and \(y\) variables for example)
Save this to an R object (which will be ggplot
class); standard convention is to call this object g
.
Let’s save this to an object:
At minimum a ggplot requires: data, a geom function, and aes mapping.
Now we will need to add geometric markings on this plot using some <GEOM_FUCNTION>
. Examples include:
geom_point()
creates geometric pointsgeom_bar()
creates barplotsgeom_boxplot()
creates a boxplotgeom_histogram()
creates a histogramgeom_density()
creates a smoothed density estimatesTo add the geometric object layers we could write:
We can keep adding on layers, e.g. let’s add a smoothed line or curve to a scatter plot, helping to visualize trends or relationships in the data.
We can change the theme from gray to black and white.
To create panels we need the faceting functions: facet_wrap()
or facet_grid()
.
We can override the default axis labels or legend keys using the following helper functions:
xlab()
, ylab()
, ggtitle()
to modify axis, legend, and plot labels1
We can manage geom objects using argument in the geom_function, e.g.
where alpha
controls the transparency.
When you specify color within a geom_*()
function, you are setting a static, constant color for the entire layer.
When you specify color within the aes()
function, you are mapping a variable to color, which makes the color a function of the data.
In ggplot2, you can specify the aesthetics (aes) at various levels of your plot creation including within individual geom_*()
functions (layers) …
When our variable is continous, ggplot2 uses a gradient color scale instead.
We change the transparency of points using alpha
(0 = see through 1 = opaque). Notice how “coincidence points” (ie overlapping) points are more obvious with a more transparent point. size
is used to make the points (5x) bigger.
Notice how labels were added on a separate line of code.
Comments
Notice that when we just specify one variable,
qplot()
plots a histogram rather than a scatter plotWhen
cyl
appears on the left-hand side of the tilde the subplots are plotted row-wise (i.e. along the \(y\)-axis)When
cyl
appears on the right-hand side of the tilde the subplots are plotted column-wise (i.e. along the \(x\)-axis)We can faceting with two discrete variables using the
y~x
formula