Last updated: 2021-04-27 18:25:43 EST
Go to quick tables Go home
Set-up
Load the tidyverse:
The tidyverse
provides the data mpg
to demonstrate methods. mpg
has the mileage (cty
for city, hwy
for highway) for a bunch of different cars:
data("mpg") # load the data mpg
# view the first 5 rows
mpg %>%
slice_head(n=5)
| | | | | | | | | | |
---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | |
Summarizing Data
The pipe and summarise()
All summarizing operations work like this:
data %>% # the pipe: "and then"
summarise()
Inside summarise()
you can use functions like mean()
, median()
and so on.
mean
mean()
mpg %>% # take the data, THEN
summarise(mean(cty)) # summarize: mean
Note. cty
is a numerical variable.
standard deviation
sd()
mpg %>% # take the data, THEN
summarise(sd(cty)) # summarize: standard deviation
counting observations
n()
(counting)
mpg %>% # take the data, THEN
summarise(n()) # summarize: count number of observations
Why doesn’t n()
take an argument? Because it counts the number of rows, not columns.
multiple summaries
You can put as many functions as you want inside summarise()
:
mpg %>% # take the data, THEN
summarise(mean(cty), median(cty), sd(cty), n()) # summarize: mean, median, standard deviation, count
naming variables inside summarise()
You can name variables inside summarise()
using standard assignment:
mpg %>% # take the data, THEN
summarise(avg_cty = mean(cty), median_cty = median(cty)) # summarize: mean, median
Grouping (group_by()
)
group_by()
:
mpg %>% # take the data, THEN
group_by(class) %>% # group it by car class (e.g., compact, pickup) THEN
summarise(mean(cty), median(cty), sd(cty), n()) # summarize
| | | | |
---|
2seater | 15.40000 | 15 | 0.5477226 | 5 |
compact | 20.12766 | 20 | 3.3854999 | 47 |
midsize | 18.75610 | 18 | 1.9465416 | 41 |
minivan | 15.81818 | 16 | 1.8340219 | 11 |
pickup | 13.00000 | 13 | 2.0463382 | 33 |
subcompact | 20.37143 | 19 | 4.6023377 | 35 |
suv | 13.50000 | 13 | 2.4208791 | 62 |
Note. class
is a categorical variable.
Manipulating data
Selecting columns
select()
certain columns
mpg %>% # all the columns
slice_head(n=5) # view the first 5 rows
| | | | | | | | | | |
---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | |
mpg %>% # all the columns
select(manufacturer, model, hwy) %>% # select some of the columns
slice_head(n=5) # view the first 5 rows
| | | | |
---|
audi | a4 | 29 | | |
audi | a4 | 29 | | |
audi | a4 | 31 | | |
audi | a4 | 30 | | |
audi | a4 | 26 | | |
Slicing
All slice operators begin with slice_
and take an optional n=
argument that specifies the number of rows you want to see. More here.
Subsets with slice_head()
:
mpg %>%
slice_head(n=10) # view the first 10 rows
| | | | | | | | | | |
---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | |
audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | |
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | |
audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | |
audi | a4 quattro | 2.0 | 2008 | 4 | manual(m6) | 4 | 20 | 28 | p | |
Maximums with slice_max()
:
mpg %>%
slice_max(hwy, n=5) # the five highest highway miles per gallon. if there are ties they all get printed
| | | | | | | | | | |
---|
volkswagen | jetta | 1.9 | 1999 | 4 | manual(m5) | f | 33 | 44 | d | |
volkswagen | new beetle | 1.9 | 1999 | 4 | manual(m5) | f | 35 | 44 | d | |
volkswagen | new beetle | 1.9 | 1999 | 4 | auto(l4) | f | 29 | 41 | d | |
toyota | corolla | 1.8 | 2008 | 4 | manual(m5) | f | 28 | 37 | r | |
honda | civic | 1.8 | 2008 | 4 | auto(l5) | f | 25 | 36 | r | |
honda | civic | 1.8 | 2008 | 4 | auto(l5) | f | 24 | 36 | c | |
Minimums with slice_min()
:
mpg %>%
slice_min(hwy, n=2) # the two lowest highway miles per gallon. if there are ties they all get printed
| | | | | | | | | |
---|
dodge | dakota pickup 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
dodge | durango 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
dodge | ram 1500 pickup 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
dodge | ram 1500 pickup 4wd | 4.7 | 2008 | 8 | manual(m6) | 4 | 9 | 12 | |
jeep | grand cherokee 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
Random samples with slice_sample()
:
mpg %>%
slice_sample(n=3) # pick three rows at random
| | | | | | | | | |
---|
pontiac | grand prix | 3.8 | 1999 | 6 | auto(l4) | f | 16 | 26 | |
dodge | ram 1500 pickup 4wd | 4.7 | 2008 | 8 | manual(m6) | 4 | 9 | 12 | |
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | |
Slicing and grouping
When you group_by()
a data frame slice_()
will return subsets of each group:
mpg %>%
group_by(class) %>% # group by class
slice_min(hwy, n=1) # bottom two hwy by class
| | | | | | | | | |
---|
chevrolet | corvette | 5.7 | 1999 | 8 | auto(l4) | r | 15 | 23 | |
volkswagen | jetta | 2.8 | 1999 | 6 | auto(l4) | f | 16 | 23 | |
audi | a6 quattro | 4.2 | 2008 | 8 | auto(s6) | 4 | 16 | 23 | |
dodge | caravan 2wd | 3.3 | 2008 | 6 | auto(l4) | f | 11 | 17 | |
dodge | dakota pickup 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
dodge | ram 1500 pickup 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
dodge | ram 1500 pickup 4wd | 4.7 | 2008 | 8 | manual(m6) | 4 | 9 | 12 | |
ford | mustang | 5.4 | 2008 | 8 | manual(m6) | r | 14 | 20 | |
dodge | durango 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
jeep | grand cherokee 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | |
If a data frame is already grouped then slice_()
will always subset by group under the hood. If you want subsets not by group then you have to ungroup()
. See below in the section on mutate()
. This often happens when you group a data frame to create a new variable.
Filtering
filter()
with Boolean logic:
==
: “equal to”
!=
: “not equal to”
>
: “greater than”
>=
: “greater than or equal to”
<
: “less than”
<=
: “less than or equal to”
&
: “and”
|
: “or”
Boolean logic is any test that returns true or false:
[1] FALSE
[1] TRUE
[1] TRUE
Some examples:
mpg %>%
filter(year == "1999") %>% # filter all cars made in 1999
slice_head(n=5) # view the first 5 rows
| | | | | | | | | | |
---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | |
audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | |
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | |
mpg %>%
filter(hwy >= 25) %>% # filter all cars with at least 25 mpg highway
slice_head(n=5) # view the first 5 rows
| | | | | | | | | | |
---|
audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | |
audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | |
audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | |
audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | |
audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | |
mpg %>%
filter(year == "1999" & model != "a4" & hwy >= 25) %>% # filter cars made in 1999, that aren't a4's, and have at least 25 mgp highway
slice_head(n=5) # view the first 5 rows
| | | | | | | | | | |
---|
audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | |
audi | a4 quattro | 1.8 | 1999 | 4 | auto(l5) | 4 | 16 | 25 | p | |
audi | a4 quattro | 2.8 | 1999 | 6 | auto(l5) | 4 | 15 | 25 | p | |
audi | a4 quattro | 2.8 | 1999 | 6 | manual(m5) | 4 | 17 | 25 | p | |
chevrolet | corvette | 5.7 | 1999 | 8 | manual(m6) | r | 16 | 26 | p | |
Mutating
Create and modify variables (data frame columns) with mutate()
Create a new variable
For example, create a column called mean_hwy
that calculates average highway miles per gallon:
mpg %>%
mutate(mean_hwy = mean(hwy)) %>%
select(manufacturer, model, hwy, mean_hwy) %>%
slice_head(n=5)
| | | | |
---|
audi | a4 | 29 | 23.44017 | |
audi | a4 | 29 | 23.44017 | |
audi | a4 | 31 | 23.44017 | |
audi | a4 | 30 | 23.44017 | |
audi | a4 | 26 | 23.44017 | |
Create a new variable and store it as a column
To save the new variable you have to re-assign the data frame:
mpg = mpg %>%
mutate(mean_hwy = mean(hwy))
Now mpg
has a new column called mean_hwy
.
Mutating while grouping
When you create a new variable and save it to the data you re-assign the data.
When grouping with group_by()
to create a new column, the re-assigned data will be grouped, which can affect the slice_()
functions.
The solution is to ungroup()
after mutate()
.
For instance, add a column called “mean_hwy_class” that calculates average highway miles per gallon by class:
# first create the variable
mpg = mpg %>% # take the data, THEN
group_by(class) %>% # group by class, THEN
mutate(mean_hwy_class = mean(hwy)) %>% # create the new variable, THEN
ungroup() # ungroup the data
Why ungroup()
? Because otherwise slice_()
and other functions won’t return the output you expect. So better safe than sorry. When grouping to create a variable make sure you ungroup at the end.
Now we can slice in general:
mpg %>%
select(manufacturer, model, hwy, mean_hwy_class) %>%
slice_head(n=2)
| | | | |
---|
audi | a4 | 29 | 28.29787 | |
audi | a4 | 29 | 28.29787 | |
Mutating and frequency tables
Use n()
to calculate frequency and then use mutate()
to calculate relative frequency:
mpg %>%
group_by(class) %>%
summarise(frequency = n()) %>%
mutate(relative_frequency = frequency / sum(frequency))
| | | | |
---|
2seater | 5 | 0.02136752 | | |
compact | 47 | 0.20085470 | | |
midsize | 41 | 0.17521368 | | |
minivan | 11 | 0.04700855 | | |
pickup | 33 | 0.14102564 | | |
subcompact | 35 | 0.14957265 | | |
suv | 62 | 0.26495726 | | |
Visualizing data
Plotting distributions
Histograms
Histograms with ggplot
and geom_histogram()
:
mpg %>%
ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
geom_histogram() # add geom layer: distribution

Faceted histograms with facet_wrap()
:
mpg %>%
ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
geom_histogram() + # add geom layer: distribution
facet_wrap(~class) # add facetting (note the "~" before the categorical variable)

Cumulative distributions
Percentiles and cumulative distributions with ggplot
and stat_ecdf()
:
mpg %>%
ggplot(aes(x=cty)) + # blank canvas: choose the data and the x-axis variable
stat_ecdf() # add geom layer: cumulative distribution

You can add a vertical line with geom_vline()
to make it easier to see a percentile. For instance, view percent of cars with less than 15 miles per gallon:
mpg %>%
ggplot(aes(x=cty)) + # blank canvas: choose the data and the x-axis variable
stat_ecdf() + # add geom layer: cumulative distribution
geom_vline(xintercept = 15, color="red") # verticle

Box plots
geom_boxplot()
mpg %>%
ggplot(aes(x = hwy, y = class)) + # x is continuous, y is categorical
geom_boxplot()

or:
mpg %>%
ggplot(aes(x = class, y = hwy)) + # x is categorical, y is continuous
geom_boxplot()

Scatterplot
geom_point()
:
mpg %>%
ggplot(aes(x = cty, y = hwy)) +
geom_point()

Scatterplot with regression line
geom_smooth(method = "lm")
:
mpg %>%
ggplot(aes(x = cty, y = hwy)) +
geom_point() + # scatter-plot
geom_smooth(method = "lm") # regression line

Correlation and regression
Covariance
cov()
:
mpg %>%
summarise(cov(x=cty, y=hwy))
Correlation
cor()
:
mpg %>%
summarise(cor(x=cty, y=hwy))
Regression
lm()
Simple linear regression (one x variable):
y=f(x)=β0+β1(x)+ϵ In lm()
the y and x variables are separated by ~
:
mpg %>%
lm(formula = cty ~ hwy, data = .) # the "." means "use the data from the pipe %>%"
Call:
lm(formula = cty ~ hwy, data = .)
Coefficients:
(Intercept) hwy
0.8442 0.6832
Multiple linear regression (multiple x variables):
y=f(X)=f(x1,x2,…)=β0+β1(x1)+β2(x2)+⋯+ϵ In lm()
the x variables are separated by +
:
mpg %>%
lm(formula = cty ~ hwy + cyl + displ, data = .)
Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)
Coefficients:
(Intercept) hwy cyl displ
6.08786 0.58092 -0.44827 -0.05935
summary()
View hypothesis tests and regression diagnostics with summary()
:
mpg %>%
lm(formula = cty ~ hwy + cyl + displ, data = .) %>%
summary()
Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)
Residuals:
Min 1Q Median 3Q Max
-3.0347 -0.6012 -0.0229 0.7397 5.2573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.08786 0.83226 7.315 4.25e-12 ***
hwy 0.58092 0.02010 28.900 < 2e-16 ***
cyl -0.44827 0.13010 -3.446 0.000677 ***
displ -0.05935 0.16351 -0.363 0.716971
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.148 on 230 degrees of freedom
Multiple R-squared: 0.9281, Adjusted R-squared: 0.9272
F-statistic: 989.9 on 3 and 230 DF, p-value: < 2.2e-16
Plotting regression results
Use the package sJplot
. (If you forgot how to install package, see here.)
The function you want is plot_model
. The baseline plot shows the coefficients and standard errors:
mpg %>%
lm(formula = cty ~ hwy + class, data = .) %>%
plot_model(model = .)

To plot predicted values you can set type = "pred"
and then choose which coefficient to plot by setting terms
:
mpg %>%
lm(formula = cty ~ hwy + class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy"))

The terms
argument accepts multiple terms:
mpg %>%
lm(formula = cty ~ hwy + class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy", "class"))

This is useful when you have an interaction effect (e.g., hwy*class
):
mpg %>%
lm(formula = cty ~ hwy + class + hwy*class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy", "class"))

sJplot
uses ggplot
so you can add ggplot
stuff to it, like a different theme and titles:
mpg %>%
lm(formula = cty ~ hwy + class + hwy*class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy", "class")) +
labs(x = "Highway miles per gallon", y = "Predicted city miles per gallon",
title = "A very interesting linear model", subtitle = "So interesting") +
theme_minimal()

The package has tons of great features. Check out the website!
Quick tables
%>% |
Pipe operator (“and then”) |
summarise() |
Summarize a vector or multiple vectors from a data frame |
mean() |
Calculate the mean of a vector |
median() |
Calculate the median of a vector |
sd() |
Calculate the standard deviation of a vector |
n() |
Count the number of observations. Takes no argument. |
group_by() |
Group observations by a categorical variable |
select() |
Select certain columns |
slice_() |
Slice rows from the data |
slice_head(n=5) |
View the head (first five rows) of the data. n = can be any number. |
slice_max(var, n=5) |
View the rows with the 5 highest values of column “var” |
slice_min(var, n=5) |
View the rows with the 5 lowest values of column “var” |
slice_sample(n=5) |
Draw 5 rows at random |
filter() |
Filter observations |
mutate() |
Create a new vector |
cov() |
Calculate the covariation between two variables |
cor() |
Calculate the correlation between two variables |
lm() |
Estimate a linear regression |
ifelse() |
Create a vector based on a “True/False” test |
ggplot() |
Create a base plot |
geom_histogram() |
Histogram |
stat_ecdf() |
Cumulative distribution plot |
geom_boxplot() |
Boxplot |
geom_point() |
Scatter plot |
geom_smooth(method = “lm”) |
Regression line |
xi |
data point i |
“x i” |
|
n |
sample size |
|
|
N |
population size |
|
|
ˉx |
the sample mean |
“x bar” |
∑ni=1xin |
μ |
the population mean |
“mu” |
|
s2 |
the sample variance |
|
∑(xi−ˉx)2n−1 |
σ2 |
the population variance |
|
|
s |
the sample standard deviation |
|
√s2 |
σ |
the population standard deviation |
“sigma” |
|
zi |
z-score for observation i |
|
zi=xi−ˉxs |
β |
regression coefficient |
“beta” |
y=β0+β1x+ϵ |
ˆβ |
estimated value of regression coefficient |
“beta hat” |
|
ϵ |
regression error |
“epsilon” |
|
∑ |
summation operator |
“sum” |
∑2i=1xi=x1+x2 |
sxy |
covariation between two variables x and y |
|
∑(xi−ˉx)(yi−ˉy)n−1∈[−∞,∞] |
rxy |
correlation between two variables x and y |
|
sxysxsy∈[−1,1] |
