Lab 0

A refresher on R and introduction to Quarto documents

Author

Dr. Irene Vrbik

Introduction

This lab session focused on refreshing your R skills and guiding you through the creation of reproducible documents using Quarto. Reproducibility is a cornerstone of modern data analysis, ensuring that results can be consistently reproduced and verified by others. In this lab, we will revisit key R programming concepts and learn how to leverage Quarto for creating dynamic, shareable reports that integrate code, data, and narrative.

Lab Objectives:

By the end of this lab you will be able to:

  1. Execute fundamental R operations, including data manipulation, visualization, and basic statistical analysis.
  2. Describe what Quarto is and explain its importance in reproducible research.
  3. Create a Quarto document that integrates narrative, code, and results, following a step-by-step guide to achieve a fully functional and reproducible report in HTML format. Specifically, you will:
    • Use basic Markdown for formatting text.
    • Navigate and utilize the visual editor in RStudio.
    • Embed R code into Quarto documents using code chunks.
    • Apply helpful chunk options for both input and output.
    • Explore different output formats including HTML, PDF, and Word.

Resources

Quarto

What

Quarto (https://quarto.org/) an open-source publishing system and the next-generation version of R Markdown. Quarto allows you to create dynamic content using multiple programming languages (including Python, Julia, and Observable) across various formats (such as articles, presentations, dashboards, websites, blogs, and books). In this course, we will show you how to use RStudio with Quarto specifically to create assignment documents that integrate R.

Why

Quarto will serve as a great tool for creating well-structured, professional-looking reports that are easy to read and follow. Furthermore, quarto promotes reproducibility by combining narrative, code, and results in one document, ensuring that analyses are transparent and can be replicated by others; a crucial component in scientific research and data analysis. Quarto documents can be easily shared with others, making it simple to collaborate on group projects or receive feedback from peers or instructors. While we won’t cover this aspect in this course, it is worth noting that Quarto works well with version control systems like Git, allowing students to track changes, collaborate on code, and maintain a history of their work.

How

To get setup follow these instructions:

  1. Make sure you have the latest version of R: https://www.r-project.org/
  2. Download the latest versions of RStudio1: https://posit.co/download/rstudio-desktop/
  3. Download Quarto from: https://quarto.org/docs/get-started/

Now you should have everything you need to get started.

Getting Started

Open R studio and create a new quarto document by navigating to: File > New File > R Script

How to create a new Quarto document. This screenshot was taken in RStudio on a Mac. Depending on your RStudio settings and your computer, your interface might look slightly different.

Give your document a meaningrul title, and get in the habbit of using your First and Last name along with your student number in the Author field. Leave the default settings of HTML output and visual editior and press Create.

qmd Overview

This section will cover the various elements commonly found in a typical Quarto document.

YAML

At the top of you document you will see a YAML header demarcated by three dashes (---) on either end. This is is where you specify the metadata and key settings that control how your document is rendered and formatted.

---
title: "Lab 0"
author: "Irene Vrbik (Student number)"
format: html
editor: visual
---

You can customize your YAML by adding various options. For example, you can include a subtitle and enable a table of contents with the following changes to your YAML (please note that nested indentation when setting HTML options):

Viewing HTML in external brower

Depending on how big your screen is you might not be able to see your table of contents. If you press the “Show in new window button” () the HTML document will open into your default browser and you should be able to view it on the right-hand-side2.

---
title: "Lab 0"
subtitle: "A refresher on R and introduction to Quarto documents"
author: "Irene Vrbik (Student number)"
format:
  html:
    toc: true
    embed-resources: true
editor: visual
---

I’ve also set the HTML option embed-resources: true. This option embeds external resources such as images, CSS, JavaScript, and fonts directly into the self-contained HTML file instead of linking to these resources as separate files. Since I will only be requesting the HTML file for assignment submissions, this setting is essential; without it, resources like images will not be visible in your submission because they would remain as external links. By embedding these resources, you ensure that everything needed to display your document correctly is included within the HTML file, allowing me to see your images and any other embedded content without additional files.

Embedding Resources

Be sure to include embed-resources: true in your assignment Quarto documents. This enables the embedding of external resources (like images) directly into the rendered HTML document.

Code Chunks

R code chunks identified with {r} with (optional) chunk options, in YAML style, identified by #| at the beginning of the line.

```{r}
#| label: fig-code-chunk-example
#| fig-cap: "Scatter plot of Miles Per Gallon vs. Horsepower"

# Basic scatter plot
plot(mtcars$hp, mtcars$mpg,
     main = "Miles Per Gallon vs. Horsepower",
     xlab = "Horsepower",
     ylab = "Miles Per Gallon",
     pch = 19,      # Solid circle point type
     col = "blue")  # Set color of points
```
Figure 1: Scatter plot of Miles Per Gallon vs. Horsepower
Cross Referencing Figures

Pro tip: Quarto enables you to create cross-references3 to figures, tables, equations, sections, code listings, theorems, proofs, and more.

To create this cross-reference: Figure 1 I simply need tlabel the code chunk that generates the image with a reserved prefix, fig-xxx, where xxx is replaced by a meaningful label (avoid using underscores). If you type @ in your .qmd source file Rstudio’s autocomplete or code completion feature will attemp to assist you by suggesting possible completions. In this case, since I labeled my code chunk fig-code-chunk-example I would cross-reference it using @fig-code-chunk-example.

When RStudio provides a list of possible options as you type, it’s called autocomplete or code completion. This feature is designed to assist you by suggesting possible completions based on what you’ve started typing, which can include function names, variable names, code chunk labels, or references, like in your case with cross-references.

Code chunk options

Below are some code chunk options which I most commonly use. For more options: https://quarto.org/docs/reference/cells/cells-knitr.html#code-output

Select chunk options
Option Description
echo Include the source code in output
eval Evaluate code cells. If false just echos the code into output (this is handy for code chunks that take a long time to execute).
code-fold

Collapse code into an HTML <details> tag so the user can display it on-demand.

  • true: collapse code
  • false (default): do not collapse code
  • show: use the <details> tag, but show the expanded code initially.
code-summary Summary text to use for code blocks collapsed using code-fold
cache Whether to cache a code chunk. When evaluating code chunks for the second time, the cached chunks are skipped (unless they have been modified), but the objects created in these chunks are loaded from previously saved databases (.rdb and .rdx files), and these files are saved when a chunk is evaluated for the first time, or when cached files are not found (e.g., you may have removed them by hand). Note that the filename consists of the chunk label with an MD5 digest of the R code and chunk options of the code chunk, which means any changes in the chunk will produce a different MD5 digest, and hence invalidate the cache.
warning Include warnings in rendered output.
error Include errors in the output (note that this implies that errors executing code will not halt processing of the document).
label Unique4 label for code cell. Used when other code needs to refer to the cell (e.g. for cross references fig-samples or tbl-summary)
fig-width: width for figures; the default width for figures is 7 inches.
fig-height height for figures; the default height for figures is 5 inches.
fig-cap Figure caption
out-width Width of the plot in the output document, which can be different from its physical fig-width, i.e., plots can be scaled in the output document. When used without a unit, the unit is assumed to be pixels. However, any of the following unit identifiers can be used: px, cm, mm, in, inch and %, for example, 3in, 8cm, 300px or 50%.
out-height Height of the plot in the output document, which can be different from its physical fig-height, i.e., plots can be scaled in the output document. Depending on the output format, this option can take special values. For example, for LaTeX output, it can be 3in, or 8cm; for HTML, it can be 300px.

For example, see Figure 2 for the rendered output as produced by the following:

```{r}
#| echo: false
#| label: fig-chunk-opt-example
#| fig-cap: "Notice how in this figure 1. you cannot see the souce code 2. we have resized the outputed image to 4.5 x 7 inches (rather than the default 7 x 5)."
#| fig-width: 4.5
#| fig-height: 7

# Basic scatter plot
plot(mtcars$hp, mtcars$mpg,
     main = "Miles Per Gallon vs. Horsepower",
     xlab = "Horsepower",
     ylab = "Miles Per Gallon",
     pch = 19,      # Solid circle point type
     col = "blue")  # Set color of points
```
Figure 2: Notice how in this figure 1. you cannot see the souce code 2. we have resized the outputed image to 4.5 x 7 inches (rather than the default 7 x 5).

Question: What do you suppose will happen if you change the default out-width?

out-width controls how the figure is displayed within the document relative to the surrounding text or page layout; typically specified as a percentage (%), pixels (px), or other CSS units (like cm, in). For instance, “50%”` means that the figure will be rescaled to occupy the 50% of the full width of the text block or container in which it is displayed.

out-width: "50%" means that the figure or output will occupy the 50% of the full width of the text block or container in which it is displayed.

Inline R code

In Quarto, you can include inline R code within your document to dynamically insert the results of R calculations directly into your text. This is particularly useful for including calculated values, summaries, or other outputs directly in the narrative of your document. For insance you could write something like:

The mean of the numbers 11, 20, and 40 is `r mean(c(11, 20, 40))`.

which renders to:

The mean of the numbers 11, 22, and 40 is 23.6666667. If you wanted to format this to control the number of digits to the right of the decimal place use the format() function with nsmall set to the desired number.

The mean of the numbers 11, 20, and 40 is `r format(mean(c(11, 20, 40)), nsmall=2)`.

which renders to:

The mean of the numbers 11, 20, and 40 is 23.66667.

This feature is handy also to call variables you have defined in previous code. For example I can define:

x <- c(11, 20, 40)
xbar <- mean(x)

and use:

The mean of `r x` is `r xbar`.

which renders to:

The mean of 11, 20, 40 is 23.6666667.

Markdown

Markdown is a lightweight markup language that allows you to format text easily using plain text syntax. It’s commonly used in Quarto documents for creating headings, lists, links, tables, code blocks, and more. While Markdown is powerful and easy to learn, Quarto also offers a Visual Editor that simplifies document creation without needing to memorize Markdown syntax. The Visual Editor provides a WYSIWYG (What You See Is What You Get) interface that allows you to format text using toolbar buttons, similar to word processors like Microsoft Word or Google Docs. By using the Visual Editor, you can quickly format your document without delving into the underlying Markdown code, allowing you to focus more on content and less on syntax.

That said, understanding basic Markdown can greatly enhance the readability and organization of your documents. To see whats happening in markdown, you can switch to Source editor; see Figure 3 how to easily toggle back to Visual editor.

(a) Visual Editor Mode
(b) Source Editor Mode
Figure 3: These buttons will help you toggle between Visual editor mode (which is typically the default and in RStudio) and the Source Editor. The Source Editor mode allows you to write and edit your document directly using Markdown, code, and YAML. The Visual Editor mode provides a WYSIWYG (What You See Is What You Get) interface, similar to traditional word processors like Microsoft Word or Google Docs.

Basic Markdown Syntax

Here are some of the most commonly used Markdown elements:

  • Headings: Create headings by using the # symbol followed by a space. The number of # symbols indicates the heading level.

    # Heading 1
    ## Second-level Heading
    ### Third-level Heading
  • Bold and Italics: Use asterisks or underscores to emphasize text.

    **Bold Text** or __Bold Text__
    *Italic Text* or _Italic Text_
  • Lists:

    • Unordered Lists: Use *, -, or + followed by a space.

      - item 1
      - item 2
        - sub-item
    • Ordered Lists: Use numbers followed by a period.

      1. first item
      2. second item
      3. third item
  • Links: Create links using square brackets followed by parentheses.

    [Quarto Website](https://quarto.org)
  • Images: Similar to links but prefixed with an exclamation mark.

    ![Alt Text](path/to/image.png)

R Basics for Data Wrangling

Data wrangling, also known as data cleaning or data manipulation, is an essential skill in data analysis that involves transforming and preparing raw data into a usable format. In R, several functions and packages facilitate efficient data wrangling. Here, we’ll cover some basic techniques using base R and the dplyr package from the tidyverse, which provides a consistent and user-friendly syntax for data manipulation.

Loading Data

Before you can wrangle data, you need to load it into R. Common functions for loading data include:

  • Reading CSV files:

    data <- read.csv("path/to/your/data.csv")
    Tip 1: Relative paths

    When working with data files or images in R, it’s best practice to use relative paths instead of absolute paths. A relative path specifies the location of a file relative to your project’s root directory, rather than the full path from your computer’s root directory. This approach makes your code more portable, shareable, and easier to manage within a project.

    When loading data or saving outputs, use paths relative to your project directory. For example, after Setting up your folder system in my R projects, if your data file is stored in a data/ subfolder, I would use the following relative path:

    data <- read.csv("data/mydata.csv")

    rather than the following absolute path:

    data <- read.csv("C:/Users/YourName/Documents/Data_311/data/mydata.csv")

Loading packages

We will often make use of various packages in R. To download package from CRAN, open your R console or RStudio and install the required packages manually using:

install.packages("xxx")

where xxx is replaced by the package name (case sensitive). For example, a very common package used in Data Science is the tidyverse package.

A screenshot showing how you can download a package in the R console. You should never include `include.packages()` in your qmd script. More generally, you should use the R Console for quick, exploratory analyses or trying out code snippets that you don’t want to save to your document (e.g. debugging or testing).

Download this package (if you don’t have it already) using

install.packages("tidyverse")
Installing packages

Never use install.packages() directly within your Quarto (.qmd) script.

Question: Why should you avoid including install.packages() code in a .qmd script.

Each time the script is rendered or re-run, it will attempt to install the package again, even if it is already installed. This redundancy is unnecessary and can significantly slow down the rendering process. Additionally, if you are not connected to the internet, the script will fail to render because it cannot download the package from CRAN.

After you have downloaded the package (you should do this directly

Viewing and Exploring Data

  • View the structure of the data:

    str(data)
  • Preview the first few rows:

    head(data)
  • Data Summary and checking for missing values

    For numerical data (e.g., integers, doubles), summary(data) displays statistics like the minimum, first quartile, median, mean, third quartile, and maximum. It also shows the number of NA values (missing values) in the column as part of the summary. E.g.

    summary(airquality)
         Ozone           Solar.R           Wind             Temp      
     Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
     1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
     Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
     Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
     3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
     Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
     NA's   :37       NA's   :7                                       
         Month            Day      
     Min.   :5.000   Min.   : 1.0  
     1st Qu.:6.000   1st Qu.: 8.0  
     Median :7.000   Median :16.0  
     Mean   :6.993   Mean   :15.8  
     3rd Qu.:8.000   3rd Qu.:23.0  
     Max.   :9.000   Max.   :31.0  
    

Selecting Columns

Use the select() function from the dplyr package (this is included in the tidyverse package) to choose specific columns from a data frame:

library(dplyr) # or library(tidyverse) 
selected_data <- select(data, column1, column2)
R comments

Note my use of th # symbol in the above code chunk. In R, comments are created using the # symbol. Anything that follows the # on a line is treated as a comment and is ignored by the R interpreter. Comments are crucial for making your code readable, understandable, and maintainable, both for yourself and for others who may work with your code in the future.

If we instead wanted to exclude certain columns we use:

excluded_data <- select(data, -column3)

Filtering Rows

Filtering allows you to subset rows based on conditions using the filter() function:

  • Filter rows based on a condition:

    filtered_data <- filter(data, column1 > 100)
  • Filter with multiple conditions:

    filtered_data <- filter(data, column1 > 100 & column2 == "CategoryA")

Mutating (Adding or Modifying Columns)

The mutate() function allows you to create new columns or modify existing ones:

  • Create a new column:

    data <- mutate(data, new_column = column1 * 2)
  • Modify an existing column:

    data <- mutate(data, column1 = column1 / 100)

Arranging (Sorting) Rows

Use the arrange() function to sort data by one or more columns:

  • Sort data in ascending order:

    sorted_data <- arrange(data, column1)
  • Sort data in descending order:

    sorted_data <- arrange(data, desc(column1))

Summarizing Data

The summarise() function is used to generate summary statistics:

  • Calculate summary statistics:

    summary_stats <- summarise(data, mean_value = mean(column1, na.rm = TRUE))
  • Group and summarize data:

    grouped_summary <- data %>%   group_by(column2) %>%   summarise(mean_value = mean(column1, na.rm = TRUE))
Note

The following sections may not make sense to you if you didn’t take COSC 304 and/or DATA 101. Don’t worry too much if you do not understand what the following code is doing. We will introduce these concepts as needed throughout the course.

Joining Data

Combining multiple data frames is a common task in data wrangling:

  • Inner join:

    joined_data <- inner_join(data1, data2, by = "common_column")
  • Left join:

    left_joined_data <- left_join(data1, data2, by = "common_column")

Pivoting Data

Reshape data using pivot_longer() and pivot_wider() from the tidyr package:

  • Convert wide data to long format:

    long_data <- pivot_longer(data, cols = c(column1, column2), names_to = "variable", values_to = "value")
  • Convert long data to wide format:

    wide_data <- pivot_wider(long_data, names_from = variable, values_from = value)

This marks the end of Lab 0. Pay attention to Canvas to see a release of corresponding Assignment (to be released shortly).

Footnotes

  1. Quarto to also supported by other tools such as VS Codes, Jupyter, Neovim, and Editor. To learn more visit https://quarto.org/docs/get-started/ . Please note that support for this course will be provided exclusively in RStudio.↩︎

  2. If you prefer it on the left-hand side, add the option toc-location: left.↩︎

  3. Cross-referencing involves creating a link or reference to another part of the document. For example, you might refer to “Figure 1” when discussing data visualizations, and clicking on this reference could take the reader directly to that figure.↩︎

  4. you will get an error when you render if you try and reuse a chunk label↩︎