Skip to Tutorial Content

Information

Introduction

This tutorial introduces you to the R language. Our approach is inspired by R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You will learn how to work with data sets using the tidyverse meta-package. You will learn how to direct the result of one function to another using the pipe – |> — and how to make a plot using the ggplot() function.

This tutorial assumes that you have already completed the “Getting Started” tutorial in the tutorial.helpers package. If you haven’t, do so now. It is quick!

From the main Positron menu, start a new window with File -> New Window. This new window is the location in which you will do all the work for the tutorial. The current window, the one in which you are reading these words, is just used to run this tutorial.

Working with data

Learn how to explore a data set using functions like summary(), glimpse(), and slice_sample().

Exercise 1

Before you start doing data science, you must load the packages you are going to use. Use the function library() to load the tidyverse package. Click “Run Code.” The check mark which appears next to “Exercise 1” above indicates that you have submitted your answer. It doesn’t verify that you have answered the question correctly.

library(...)

“Library” and “package” mean the same thing in R. We have different words for historical reasons. However, only the library() command will load a package/library, giving us access to the functions and data which it contains.

Exercise 2

In this tutorial, you will sometimes enter code into the exercise blocks, as you did above. But we will also ask you to run code in the Console. (You will do this in the other Positron window, since the Console in this window is currently busy running this tutorial.) Example:

In the Console, run library(tidyverse).

With Console questions, we will usually ask you to Copy/Paste the Command/Response into an answer block, like the one below. We usually shorten those instructions as CP/CR. Do that now.

Your answer should look like:

> library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
>

Your answer never needs to match ours perfectly. Our goal is just to ensure that you are actually following the instructions.

Exercise 3

Data frames, also referred to as “tibbles,” are spreadsheet-type data sets.

In the Console, run diamonds.

CP/CR.

diamonds
## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

Whenever we show outputs like this after a question, then we are showing our answer to the previous question, even if we do not label it as such.

Exercise 4

In the Console, run summary() on diamonds.

CP/CR.

summary(diamonds)
##      carat               cut        color        clarity          depth           table           price      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00   Min.   :  326  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00   Median : 2401  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46   Mean   : 3933  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00   Max.   :18823  
##                                     J: 2808   (Other): 2531                                                  
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800  
## 

This function provides a quick statistics overview of each variable in the data set. In some cases, as here, the tutorial displays the same object differently from what you were able to copy/paste. And that is OK! Your answer does not need to match our answer.

Exercise 5

In the Console, run slice_sample() on diamonds. This selects a random row from the data set.

CP/CR.

slice_sample(diamonds)
## # A tibble: 1 × 10
##   carat cut   color clarity depth table price     x     y     z
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.54 Good  E     SI2      63.8    54  1163  5.17  5.18   3.3

Your answer will differ from this answer because of the inherent randomness in functions like slice_sample().

Exercise 6

In the Console, hit the Up Arrow to retrieve the previous command. Edit it to add the argument n = 4 to slice_sample(diamonds). This will return 10 random rows from the diamonds data set.

CP/CR.

slice_sample(diamonds, n = 4)
## # A tibble: 4 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.41 Fair      E     SI2      64.7    55   818  4.7   4.67  3.03
## 2  1.06 Very Good G     SI2      61      57  4769  6.58  6.61  4.02
## 3  0.9  Very Good J     VS1      62.3    56  3231  6.19  6.23  3.87
## 4  1.01 Very Good E     VS2      62.7    60  6128  6.31  6.38  3.98

Editing code directly in the Console quickly becomes annoying. See the positron.tutorials package for tutorials about using Positron to write and organize your code.

Exercise 7

In the Console, run print() on diamonds. This returns the same result as typing diamonds.

CP/CR.

print(diamonds)
## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

You can choose how many rows to display by using the n argument in the print() function, and how many columns to display by using the width argument.

Exercise 8

In the Console, run print() on diamonds with the argument n = 3. This returns the first 3 rows of the diamonds data set.

CP/CR.

print(diamonds, n = 3)
## # A tibble: 53,940 × 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31
## # ℹ 53,937 more rows

print(), by default, gives the top of the tibble, so your answer should match our answer. slice_sample(), on the other hand, picks random rows to return. But, in both cases, the result is a tibble.

A central organizing principal of the Tidyverse is that most functions take a tibble as their first and return a tibble. This allows us to “chain” commands together, one after the other.

Exercise 9

In the Console, run ?diamonds.

This will look up the help page for the diamonds tibble from the ggplot2 package, which is one of the core packages in the Tidyverse. The help page will appear on the right-side of your Positron window, in the Secondary Activity Bar, which you might need to activate in order to see.

Copy/paste the Description section of the help page below.

You can find help about an entire package with help(package = "ggplot2"). It is confusing, but unavoidable, that package names are sometimes unquoted, as in library(ggplot2), and sometimes quoted, as in help(package = "ggplot2"). If one does not work, try the other.

Exercise 10

In the Console, run glimpse() on diamonds. CP/CR.

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, …
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very …
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, …
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, …
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 33…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, …
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, …
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, …

glimpse() displays columns running down the page and the data running across across. Note how the “type” of each variable is listed next to the variable name. For example, price is listed as <int>, meaning that it is an integer variable. To learn more about the glimpse() function, run ?glimpse.

view() is another useful function, but, because it is interactive, we should not use it within a tutorial.

Exercise 11

In the Console, run sqrt(144).

CP/CR.

sqrt(144)
## [1] 12

The square root function is one of many built-in functions in R. Most return their result, which R then, by default, prints out.

Exercise 12

In the Console, run x <- sqrt(144).

CP/CR.

x <- sqrt(144)

The symbol <- is the assignment operator. In this case, we are assigning the value of sqrt(144) to the variable x. Nothing is printed out because of that assignment.

Also, you can see x in the “Variables” tab under the “Session” pane in the Secondary Activity Bar on the right-hand side of the Positron window.

Exercise 13

In the Console, run x.

CP/CR.

x
## [1] 12

Now that x has been defined in the Console, it is available for your use. Above, we just print it out. But we could also use it in other calculations, i.e., x + 5.

Pipes and plots

Although the Tidyverse includes hundreds of commands for data manipulation, the most important are filter(), select(), arrange(), mutate(), and summarize().

Exercise 1

Let’s warm up by examining the gss_cat tibble from the forcats package. Since forcats is a core tidyverse package, you have already loaded it. Type gss_cat and hit “Run Code.”

Instead of using the Console, we will be doing the exercises in this section using excercise blocks.

...

As the help page notes, gss_cat is a “sample of categorical variables from the General Social Survey.”

Exercise 2

Run summary() on gss_cat.

summary(...)

Note that there are missing values in some columns. The word NA stands for “Not Available” and is used to represent missing data in R.

Exercise 3

Pipe gss_cat to drop_na(). This function removes rows with missing values. The pipe symbol — -> — allows us to chain R commands together, one after the other, with each one connected to the next with the pipe symbol. In this case, we want:

gss_cat |> 
  drop_na()
... |> 
  drop_na()

Note the number of rows in the tibble after drop_na(). Since drop_na() removes rows with missing values, the number of rows in the tibble will be less than the original number of rows.

We could achieve the same result by running drop_na(gss_cat). The symbol |> just “pipes” gss_cat into drop_na() as its first argument.

Exercise 4

Pipe gss_cat to filter(). Within filter(), use the argument year == 2014.

gss_cat |> 
  ...(year == 2014)

This workflow — in which we pipe a tibble to a function, which then outputs another tibble, which we can then pipe to another function, and so on — is very common in R programming.

The resulting tibble has the same number of columns as gss_cat because filter() only affects the rows. But there are many fewer rows.

Exercise 5

Continue the code and pipe with select(), using the argument age, marital, race, relig, tvhours. Note that you do not need to retype the code from the last exercise. You can just click the “Copy Code” button.

... |> 
  select(age, ..., race, ..., tvhours)

Note how the Hint only gives the most recent line of the pipe. Because select() does not affect the rows, we have the same number as after filter(). But we only have 5 columns now, consistent with what we told select() to do.

Exercise 6

Copy previous code. Continue the pipe with summary()

... |> 
  summary()

Note that there are missing values in the tvhours column. Let’s remove them.

Exercise 7

Copy previous code. Replace the summary() with drop_na().

... |> 
  drop_na()

The number of rows has decreased because we removed rows with missing values. drop_na() removes all rows which have a missing value for any of the variables. If we wanted to just remove the rows which are missing tvhours, we would use drop_na(tvhours).

Exercise 8

Continue the pipe with arrange(), using tvhours as the argument.

... |> 
  arrange(...)

The arrange() function sorts the rows of a tibble. By default, it sorts in ascending order.

Exercise 9

Copy the previous code. Put desc() around tvhours to sort in descending order.

... |> 
  arrange(desc(...))

Got to respect someone who watches TV 24 hours a day!

Exercise 10

Let’s make a plot. Copy the previous code, and pipe to ggplot(). Set aes(x = age, y = tvhours).

... |> 
  ggplot(aes(x = ..., y = ...))

This will return a plain graph as we have not mapped any data to specific “aesthetics” yet.

Exercise 11

Add another layer with geom_jitter() using the + sign. Plotting code in the ggplot2 package uses +, not |>, to connect different commands together. This difference comes from the fact that ggplot2 was written 10+ years before the pipe was invented.

... + 
  geom_jitter()

This is a scatterplot of age versus tvhours. The x-axis is age, and the y-axis is the number of hours of TV watched per day.

Exercise 12

Finally, add a title, subtitle, labels for x and y axes using labs(). The subtitle should be the one sentence of information about the graph with which you would hope a reader walks away. What is the most important fact demonstrated in the graphic?

Consider this example graph:

You can make yours look like ours, if you like.

... + 
  labs(title = "...", 
       subtitle = "...", 
       x = "...", 
       y = "...")

Note that the code in the exercise block is not saved. If you want to save the code, you can copy/paste it into an R script file.

Generative AI

Gneerative AI — tools like ChatGPT, Grok, Claude, DeepSeek and so on — are the future, of data science and everything else. The more you use these tools, the better off you will be. Unfortunately, the tools are changing so much that it is hard for a tutorial like this to stay up-to-date. This section provides some general advice and practice exercises.

Exercise 1

Using any AI you like, ask it to write a one-sentence summary about the R programming language. Copy the answer below.

Our answer:

R is a free, open-source programming language and software environment designed specifically for statistical computing, data analysis, and creating data visualizations.

If you do not want to pay for an AI service, then you will probably need to have free accounts with several different services. That way, if one service cuts you off for the day, you can switch to another.

Exercise 2

Type diamonds in the Console. Hit Enter. Copy/paste the command and the first few lines of the tibble.

diamonds
## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # ℹ 53,930 more rows

When working with AI, you often need to tell it about the data set. The easiest way to do that is often to just copy/paste over the first few lines. That shows the AI what the variable names and types are, which is the key information it needs for creating plots and pipes.

Exercise 3

Copy/paste the top of the diamonds tibble into your AI interface and ask it to create a pipe which calculates the average value of carat for all combinations of cut and color. Run the provided code in the Console. If it fails, show the AI the error and ask for better code.

CP/CR.

Claude gave us this answer:

diamonds %>%
  group_by(cut, color) %>%
  summarise(avg_carat = mean(carat), .groups = "drop")
## # A tibble: 35 × 3
##    cut   color avg_carat
##    <ord> <ord>     <dbl>
##  1 Fair  D         0.920
##  2 Fair  E         0.857
##  3 Fair  F         0.905
##  4 Fair  G         1.02 
##  5 Fair  H         1.22 
##  6 Fair  I         1.20 
##  7 Fair  J         1.34 
##  8 Good  D         0.745
##  9 Good  E         0.745
## 10 Good  F         0.776
## # ℹ 25 more rows

But we don’t like this answer! First, it uses %>%, an older version of the pipe symbol, instead of |>. Second, it uses group_by(), which necessitates the addition of the confusing .groups argument. We would have preferred:

## # A tibble: 35 × 3
##    cut       color avg_carat
##    <ord>     <ord>     <dbl>
##  1 Ideal     E         0.578
##  2 Premium   E         0.718
##  3 Good      E         0.745
##  4 Premium   I         1.14 
##  5 Good      J         1.10 
##  6 Very Good J         1.13 
##  7 Very Good I         1.05 
##  8 Very Good H         0.916
##  9 Fair      E         0.857
## 10 Ideal     J         1.06 
## # ℹ 25 more rows

Using AI is good! But intelligent use — use in which you understand what the AI has done and try to improve/clarify its answer — is even better.

Exercise 4

Ask AI to create a beautiful plot using diamonds and the tidyverse package. Run the provided code in the Console. If it fails, show the AI the error and ask for better code.

DeepSeek provides this code:

library(tidyverse)

# Create a sophisticated diamond price analysis plot
diamonds %>%
  sample_frac(0.3) %>% # Sample 30% for better performance
  mutate(price_per_carat = price / carat) %>%
  ggplot(aes(x = carat, y = price, color = cut)) +
  geom_point(alpha = 0.6, size = 1.5, position = position_jitter(width = 0.02)) +
  geom_smooth(formula = 'y ~ x', method = "loess", se = FALSE, linewidth = 1) +
  scale_y_continuous(labels = scales::dollar_format()) +
  scale_color_manual(
    values = c("Ideal" = "#E69F00", 
               "Premium" = "#56B4E9", 
               "Very Good" = "#009E73", 
               "Good" = "#F0E442", 
               "Fair" = "#0072B2")
  ) +
  facet_wrap(~color, labeller = labeller(color = ~paste("Color", .))) +
  labs(
    title = "Diamond Price Analysis: Carat vs Price by Cut Quality and Color",
    subtitle = "Relationship between diamond characteristics and market pricing",
    x = "Carat Weight",
    y = "Price (USD)",
    color = "Cut Quality",
    caption = "Data: ggplot2 diamonds dataset | Visualization: Tidyverse"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
    legend.position = "bottom",
    strip.background = element_rect(fill = "gray95", color = NA),
    strip.text = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    plot.caption = element_text(color = "gray50", size = 9),
    legend.key.size = unit(0.8, "lines")
  )

Note:

  • DeepSeek provided the library(tidyverse) even though that is not necessary. We have already loaded the tidyverse package. Loading it again is sloppy. It is up to you to keep track of which libraries are loaded.

  • DeepSeek first provided code that used library(ggtheme). This is a nice package which we have not loaded. But we also haven’t installed it yet. Again, AI is good at generating code that will work if all its dependencies have been installed. You need to keep track. It is also useful to tell the AI which packages to use, and which to not use.

  • AI tends to generate too much code. For example, our question does not require or suggest the use of theme(), much less all the specific settings which DeepSeek has provided. So, we often tell the AI to be “concise.”

  • AI will often make mistakes. The first version which DeepSeek provided used size instead of linewidth as an argument to geom_smooth() because that was the correct usage two years ago. DeepSeek also forgot the formula = 'y ~ x' argument. It is your job to read, understand and fix any messages/warnings/errors generated by your code.

The key is practice. Use AI every day!

Summary

This tutorial introduced you to the R language. Our approach was inspired by R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. You learned how to work with data sets using the tidyverse meta-package. You learned how to direct the result of one function to another using the pipe – |> — and how to make a plot using the ggplot() function.

Download answers

When you have completed this tutorial, follow these steps:
  1. Click the button to download a file containing your answers.
  2. Save the file onto your computer in a convenient location.
Download HTML

(If no file seems to download, try right-clicking the download button and choose "Save link as...")

Introduction to R

David Kane