Processing math: 100%
  • Set-up
  • 1 Summarizing Data
    • 1.1 The pipe and summarise()
      • 1.1.1 mean
      • 1.1.2 median
      • 1.1.3 standard deviation
      • 1.1.4 counting observations
      • 1.1.5 multiple summaries
      • 1.1.6 naming variables inside summarise()
    • 1.2 Grouping (group_by())
  • 2 Manipulating data
    • 2.1 Selecting columns
    • 2.2 Slicing
      • 2.2.1 Slicing and grouping
    • 2.3 Filtering
    • 2.4 Mutating
      • 2.4.1 Create a new variable
      • 2.4.2 Create a new variable and store it as a column
      • 2.4.3 Mutating while grouping
      • 2.4.4 Mutating and frequency tables
  • 3 Visualizing data
    • 3.1 Plotting distributions
      • 3.1.1 Histograms
      • 3.1.2 Cumulative distributions
      • 3.1.3 Box plots
    • 3.2 Scatterplot
      • 3.2.1 Scatterplot with regression line
  • 4 Correlation and regression
    • 4.1 Covariance
    • 4.2 Correlation
    • 4.3 Regression
    • 4.4 summary()
    • 4.5 Plotting regression results
  • 5 Quick tables

Last updated: 2021-04-27 18:25:43 EST

Go to quick tables Go home

Set-up

Load the tidyverse:

library(tidyverse) 

The tidyverse provides the data mpg to demonstrate methods. mpg has the mileage (cty for city, hwy for highway) for a bunch of different cars:

data("mpg") # load the data mpg

# view the first 5 rows
mpg %>% 
  slice_head(n=5)
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
audia41.819994auto(l5)f1829p
audia41.819994manual(m5)f2129p
audia42.020084manual(m6)f2031p
audia42.020084auto(av)f2130p
audia42.819996auto(l5)f1626p

1 Summarizing Data

1.1 The pipe and summarise()

All summarizing operations work like this:

data %>% # the pipe: "and then"
  summarise() 

Inside summarise() you can use functions like mean(), median() and so on.

1.1.1 mean

mean()

mpg %>% # take the data, THEN
  summarise(mean(cty)) # summarize: mean
ABCDEFGHIJ0123456789
mean(cty)
<dbl>
16.85897

Note. cty is a numerical variable.

1.1.2 median

median()

mpg %>% # take the data, THEN
  summarise(median(cty)) # summarize: median
ABCDEFGHIJ0123456789
median(cty)
<dbl>
17

1.1.3 standard deviation

sd()

mpg %>% # take the data, THEN
  summarise(sd(cty)) # summarize: standard deviation
ABCDEFGHIJ0123456789
sd(cty)
<dbl>
4.255946

1.1.4 counting observations

n() (counting)

mpg %>% # take the data, THEN
  summarise(n()) # summarize: count number of observations
ABCDEFGHIJ0123456789
n()
<int>
234

Why doesn’t n() take an argument? Because it counts the number of rows, not columns.

1.1.5 multiple summaries

You can put as many functions as you want inside summarise():

mpg %>% # take the data, THEN
  summarise(mean(cty), median(cty), sd(cty), n()) # summarize: mean, median, standard deviation, count
ABCDEFGHIJ0123456789
mean(cty)
<dbl>
median(cty)
<dbl>
sd(cty)
<dbl>
n()
<int>
16.85897174.255946234

1.1.6 naming variables inside summarise()

You can name variables inside summarise() using standard assignment:

mpg %>% # take the data, THEN
  summarise(avg_cty = mean(cty), median_cty = median(cty)) # summarize: mean, median
ABCDEFGHIJ0123456789
avg_cty
<dbl>
median_cty
<dbl>
16.8589717

1.2 Grouping (group_by())

group_by():

mpg %>% # take the data, THEN
  group_by(class) %>% # group it by car class (e.g., compact, pickup) THEN
  summarise(mean(cty), median(cty), sd(cty), n()) # summarize
ABCDEFGHIJ0123456789
class
<chr>
mean(cty)
<dbl>
median(cty)
<dbl>
sd(cty)
<dbl>
n()
<int>
2seater15.40000150.54772265
compact20.12766203.385499947
midsize18.75610181.946541641
minivan15.81818161.834021911
pickup13.00000132.046338233
subcompact20.37143194.602337735
suv13.50000132.420879162

Note. class is a categorical variable.

2 Manipulating data

2.1 Selecting columns

select() certain columns

mpg %>% # all the columns
  slice_head(n=5) # view the first 5 rows
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
audia41.819994auto(l5)f1829p
audia41.819994manual(m5)f2129p
audia42.020084manual(m6)f2031p
audia42.020084auto(av)f2130p
audia42.819996auto(l5)f1626p
mpg %>% # all the columns
  select(manufacturer, model, hwy) %>% # select some of the columns
  slice_head(n=5) # view the first 5 rows
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
hwy
<int>
audia429
audia429
audia431
audia430
audia426

2.2 Slicing

All slice operators begin with slice_ and take an optional n= argument that specifies the number of rows you want to see. More here.

Subsets with slice_head():

mpg %>% 
  slice_head(n=10) # view the first 10 rows
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
audia41.819994auto(l5)f1829p
audia41.819994manual(m5)f2129p
audia42.020084manual(m6)f2031p
audia42.020084auto(av)f2130p
audia42.819996auto(l5)f1626p
audia42.819996manual(m5)f1826p
audia43.120086auto(av)f1827p
audia4 quattro1.819994manual(m5)41826p
audia4 quattro1.819994auto(l5)41625p
audia4 quattro2.020084manual(m6)42028p

Maximums with slice_max():

mpg %>% 
  slice_max(hwy, n=5) # the five highest highway miles per gallon. if there are ties they all get printed
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
volkswagenjetta1.919994manual(m5)f3344d
volkswagennew beetle1.919994manual(m5)f3544d
volkswagennew beetle1.919994auto(l4)f2941d
toyotacorolla1.820084manual(m5)f2837r
hondacivic1.820084auto(l5)f2536r
hondacivic1.820084auto(l5)f2436c

Minimums with slice_min():

mpg %>% 
  slice_min(hwy, n=2) # the two lowest highway miles per gallon. if there are ties they all get printed
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
dodgedakota pickup 4wd4.720088auto(l5)4912
dodgedurango 4wd4.720088auto(l5)4912
dodgeram 1500 pickup 4wd4.720088auto(l5)4912
dodgeram 1500 pickup 4wd4.720088manual(m6)4912
jeepgrand cherokee 4wd4.720088auto(l5)4912

Random samples with slice_sample():

mpg %>% 
  slice_sample(n=3) # pick three rows at random
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
pontiacgrand prix3.819996auto(l4)f1626
dodgeram 1500 pickup 4wd4.720088manual(m6)4912
audia4 quattro1.819994manual(m5)41826

2.2.1 Slicing and grouping

When you group_by() a data frame slice_() will return subsets of each group:

mpg %>% 
  group_by(class) %>% # group by class 
  slice_min(hwy, n=1) # bottom two hwy by class
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
chevroletcorvette5.719998auto(l4)r1523
volkswagenjetta2.819996auto(l4)f1623
audia6 quattro4.220088auto(s6)41623
dodgecaravan 2wd3.320086auto(l4)f1117
dodgedakota pickup 4wd4.720088auto(l5)4912
dodgeram 1500 pickup 4wd4.720088auto(l5)4912
dodgeram 1500 pickup 4wd4.720088manual(m6)4912
fordmustang5.420088manual(m6)r1420
dodgedurango 4wd4.720088auto(l5)4912
jeepgrand cherokee 4wd4.720088auto(l5)4912

If a data frame is already grouped then slice_() will always subset by group under the hood. If you want subsets not by group then you have to ungroup(). See below in the section on mutate(). This often happens when you group a data frame to create a new variable.

2.3 Filtering

filter() with Boolean logic:

  • ==: “equal to”
  • !=: “not equal to”
  • >: “greater than”
  • >=: “greater than or equal to”
  • <: “less than”
  • <=: “less than or equal to”
  • &: “and”
  • |: “or”

Boolean logic is any test that returns true or false:

2 == 3
[1] FALSE
2 != 3
[1] TRUE
2 < 3
[1] TRUE

Some examples:

mpg %>% 
  filter(year == "1999") %>% # filter all cars made in 1999
  slice_head(n=5) # view the first 5 rows
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
audia41.819994auto(l5)f1829p
audia41.819994manual(m5)f2129p
audia42.819996auto(l5)f1626p
audia42.819996manual(m5)f1826p
audia4 quattro1.819994manual(m5)41826p
mpg %>% 
  filter(hwy >= 25) %>% # filter all cars with at least 25 mpg highway
  slice_head(n=5) # view the first 5 rows
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
audia41.819994auto(l5)f1829p
audia41.819994manual(m5)f2129p
audia42.020084manual(m6)f2031p
audia42.020084auto(av)f2130p
audia42.819996auto(l5)f1626p
mpg %>% 
  filter(year == "1999" & model != "a4" & hwy >= 25) %>% # filter cars made in 1999, that aren't a4's, and have at least 25 mgp highway
  slice_head(n=5) # view the first 5 rows
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
displ
<dbl>
year
<int>
cyl
<int>
trans
<chr>
drv
<chr>
cty
<int>
hwy
<int>
fl
<chr>
audia4 quattro1.819994manual(m5)41826p
audia4 quattro1.819994auto(l5)41625p
audia4 quattro2.819996auto(l5)41525p
audia4 quattro2.819996manual(m5)41725p
chevroletcorvette5.719998manual(m6)r1626p

2.4 Mutating

Create and modify variables (data frame columns) with mutate()

2.4.1 Create a new variable

For example, create a column called mean_hwy that calculates average highway miles per gallon:

mpg %>% 
  mutate(mean_hwy = mean(hwy)) %>% 
  select(manufacturer, model, hwy, mean_hwy) %>% 
  slice_head(n=5)
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
hwy
<int>
mean_hwy
<dbl>
audia42923.44017
audia42923.44017
audia43123.44017
audia43023.44017
audia42623.44017

2.4.2 Create a new variable and store it as a column

To save the new variable you have to re-assign the data frame:

mpg = mpg %>% 
  mutate(mean_hwy = mean(hwy)) 

Now mpg has a new column called mean_hwy.

2.4.3 Mutating while grouping

When you create a new variable and save it to the data you re-assign the data.

When grouping with group_by() to create a new column, the re-assigned data will be grouped, which can affect the slice_() functions.

The solution is to ungroup() after mutate().

For instance, add a column called “mean_hwy_class” that calculates average highway miles per gallon by class:

# first create the variable
mpg = mpg %>% # take the data, THEN
  group_by(class) %>% # group by class, THEN
  mutate(mean_hwy_class = mean(hwy)) %>% # create the new variable, THEN
  ungroup() # ungroup the data

Why ungroup()? Because otherwise slice_() and other functions won’t return the output you expect. So better safe than sorry. When grouping to create a variable make sure you ungroup at the end.

Now we can slice in general:

mpg %>% 
  select(manufacturer, model, hwy, mean_hwy_class) %>% 
  slice_head(n=2)
ABCDEFGHIJ0123456789
manufacturer
<chr>
model
<chr>
hwy
<int>
mean_hwy_class
<dbl>
audia42928.29787
audia42928.29787

2.4.4 Mutating and frequency tables

Use n() to calculate frequency and then use mutate() to calculate relative frequency:

mpg %>% 
  group_by(class) %>% 
  summarise(frequency = n()) %>% 
  mutate(relative_frequency = frequency / sum(frequency))
ABCDEFGHIJ0123456789
class
<chr>
frequency
<int>
relative_frequency
<dbl>
2seater50.02136752
compact470.20085470
midsize410.17521368
minivan110.04700855
pickup330.14102564
subcompact350.14957265
suv620.26495726

3 Visualizing data

3.1 Plotting distributions

3.1.1 Histograms

Histograms with ggplot and geom_histogram():

mpg %>% 
  ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
  geom_histogram() # add geom layer: distribution

Faceted histograms with facet_wrap():

mpg %>% 
  ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
  geom_histogram() + # add geom layer:  distribution
  facet_wrap(~class) # add facetting (note the "~" before the categorical variable)

3.1.2 Cumulative distributions

Percentiles and cumulative distributions with ggplot and stat_ecdf():

mpg %>% 
  ggplot(aes(x=cty)) +  # blank canvas: choose the data and the x-axis variable
  stat_ecdf() # add geom layer: cumulative distribution

You can add a vertical line with geom_vline() to make it easier to see a percentile. For instance, view percent of cars with less than 15 miles per gallon:

mpg %>% 
  ggplot(aes(x=cty)) +  # blank canvas: choose the data and the x-axis variable
  stat_ecdf() + # add geom layer: cumulative distribution
  geom_vline(xintercept = 15, color="red") # verticle 

3.1.3 Box plots

geom_boxplot()

mpg %>% 
  ggplot(aes(x = hwy, y = class)) + # x is continuous, y is categorical
  geom_boxplot() 

or:

mpg %>% 
  ggplot(aes(x = class, y = hwy)) + # x is categorical, y is continuous
  geom_boxplot() 

3.2 Scatterplot

geom_point():

mpg %>% 
  ggplot(aes(x = cty, y = hwy)) + 
  geom_point() 

3.2.1 Scatterplot with regression line

geom_smooth(method = "lm"):

mpg %>% 
  ggplot(aes(x = cty, y = hwy)) + 
  geom_point() + # scatter-plot
  geom_smooth(method = "lm") # regression line

4 Correlation and regression

4.1 Covariance

cov():

mpg %>% 
  summarise(cov(x=cty, y=hwy))
ABCDEFGHIJ0123456789
cov(x = cty, y = hwy)
<dbl>
24.22543

4.2 Correlation

cor():

mpg %>% 
  summarise(cor(x=cty, y=hwy))
ABCDEFGHIJ0123456789
cor(x = cty, y = hwy)
<dbl>
0.9559159

4.3 Regression

lm()

Simple linear regression (one x variable):

y=f(x)=β0+β1(x)+ϵ In lm() the y and x variables are separated by ~:

mpg %>% 
  lm(formula = cty ~ hwy, data = .) # the "." means "use the data from the pipe %>%"

Call:
lm(formula = cty ~ hwy, data = .)

Coefficients:
(Intercept)          hwy  
     0.8442       0.6832  

Multiple linear regression (multiple x variables):

y=f(X)=f(x1,x2,)=β0+β1(x1)+β2(x2)++ϵ In lm() the x variables are separated by +:

mpg %>% 
  lm(formula = cty ~ hwy + cyl + displ, data = .)

Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)

Coefficients:
(Intercept)          hwy          cyl        displ  
    6.08786      0.58092     -0.44827     -0.05935  

4.4 summary()

View hypothesis tests and regression diagnostics with summary():

mpg %>% 
  lm(formula = cty ~ hwy + cyl + displ, data = .) %>% 
  summary()

Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0347 -0.6012 -0.0229  0.7397  5.2573 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.08786    0.83226   7.315 4.25e-12 ***
hwy          0.58092    0.02010  28.900  < 2e-16 ***
cyl         -0.44827    0.13010  -3.446 0.000677 ***
displ       -0.05935    0.16351  -0.363 0.716971    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.148 on 230 degrees of freedom
Multiple R-squared:  0.9281,    Adjusted R-squared:  0.9272 
F-statistic: 989.9 on 3 and 230 DF,  p-value: < 2.2e-16

4.5 Plotting regression results

Use the package sJplot. (If you forgot how to install package, see here.)

library(sjPlot)

The function you want is plot_model. The baseline plot shows the coefficients and standard errors:

mpg %>% 
  lm(formula = cty ~ hwy + class, data = .) %>% 
  plot_model(model = .)

To plot predicted values you can set type = "pred" and then choose which coefficient to plot by setting terms:

mpg %>% 
  lm(formula = cty ~ hwy + class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy"))

The terms argument accepts multiple terms:

mpg %>% 
  lm(formula = cty ~ hwy + class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy", "class"))

This is useful when you have an interaction effect (e.g., hwy*class):

mpg %>% 
  lm(formula = cty ~ hwy + class + hwy*class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy", "class"))

sJplot uses ggplot so you can add ggplot stuff to it, like a different theme and titles:

mpg %>% 
  lm(formula = cty ~ hwy + class + hwy*class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy", "class")) + 
  labs(x = "Highway miles per gallon", y = "Predicted city miles per gallon",
       title = "A very interesting linear model", subtitle = "So interesting") + 
  theme_minimal()

The package has tons of great features. Check out the website!

5 Quick tables

Function Description
%>% Pipe operator (“and then”)
summarise() Summarize a vector or multiple vectors from a data frame
mean() Calculate the mean of a vector
median() Calculate the median of a vector
sd() Calculate the standard deviation of a vector
n() Count the number of observations. Takes no argument.
group_by() Group observations by a categorical variable
select() Select certain columns
slice_() Slice rows from the data
slice_head(n=5) View the head (first five rows) of the data. n = can be any number.
slice_max(var, n=5) View the rows with the 5 highest values of column “var”
slice_min(var, n=5) View the rows with the 5 lowest values of column “var”
slice_sample(n=5) Draw 5 rows at random
filter() Filter observations
mutate() Create a new vector
cov() Calculate the covariation between two variables
cor() Calculate the correlation between two variables
lm() Estimate a linear regression
ifelse() Create a vector based on a “True/False” test
ggplot() Create a base plot
geom_histogram() Histogram
stat_ecdf() Cumulative distribution plot
geom_boxplot() Boxplot
geom_point() Scatter plot
geom_smooth(method = “lm”) Regression line
Term Meaning Pronunciation Formula/Example
xi data point i “x i”
n sample size
N population size
ˉx the sample mean “x bar” ni=1xin
μ the population mean “mu”
s2 the sample variance (xiˉx)2n1
σ2 the population variance
s the sample standard deviation s2
σ the population standard deviation “sigma”
zi z-score for observation i zi=xiˉxs
β regression coefficient “beta” y=β0+β1x+ϵ
ˆβ estimated value of regression coefficient “beta hat”
ϵ regression error “epsilon”
summation operator “sum” 2i=1xi=x1+x2
sxy covariation between two variables x and y (xiˉx)(yiˉy)n1[,]
rxy correlation between two variables x and y sxysxsy[1,1]
