Last updated: 2021-04-27 18:25:43 EST
Load the tidyverse:
library(tidyverse) The tidyverse provides the data mpg to demonstrate methods. mpg has the mileage (cty for city, hwy for highway) for a bunch of different cars:
data("mpg") # load the data mpg
# view the first 5 rows
mpg %>% 
  slice_head(n=5)summarise()All summarizing operations work like this:
data %>% # the pipe: "and then"
  summarise() %>%, “and then”)summarise()Inside summarise() you can use functions like mean(), median() and so on.
mean()
mpg %>% # take the data, THEN
  summarise(mean(cty)) # summarize: meanNote. cty is a numerical variable.
median()
mpg %>% # take the data, THEN
  summarise(median(cty)) # summarize: mediansd()
mpg %>% # take the data, THEN
  summarise(sd(cty)) # summarize: standard deviationmpg %>% # take the data, THEN
  summarise(n()) # summarize: count number of observationsWhy doesn’t n() take an argument? Because it counts the number of rows, not columns.
You can put as many functions as you want inside summarise():
mpg %>% # take the data, THEN
  summarise(mean(cty), median(cty), sd(cty), n()) # summarize: mean, median, standard deviation, countsummarise()You can name variables inside summarise() using standard assignment:
mpg %>% # take the data, THEN
  summarise(avg_cty = mean(cty), median_cty = median(cty)) # summarize: mean, mediangroup_by())mpg %>% # take the data, THEN
  group_by(class) %>% # group it by car class (e.g., compact, pickup) THEN
  summarise(mean(cty), median(cty), sd(cty), n()) # summarizeNote. class is a categorical variable.
select() certain columns
mpg %>% # all the columns
  slice_head(n=5) # view the first 5 rowsmpg %>% # all the columns
  select(manufacturer, model, hwy) %>% # select some of the columns
  slice_head(n=5) # view the first 5 rowsAll slice operators begin with slice_ and take an optional n= argument that specifies the number of rows you want to see. More here.
Subsets with slice_head():
mpg %>% 
  slice_head(n=10) # view the first 10 rowsMaximums with slice_max():
mpg %>% 
  slice_max(hwy, n=5) # the five highest highway miles per gallon. if there are ties they all get printedMinimums with slice_min():
mpg %>% 
  slice_min(hwy, n=2) # the two lowest highway miles per gallon. if there are ties they all get printedRandom samples with slice_sample():
mpg %>% 
  slice_sample(n=3) # pick three rows at randomWhen you group_by() a data frame slice_() will return subsets of each group:
mpg %>% 
  group_by(class) %>% # group by class 
  slice_min(hwy, n=1) # bottom two hwy by classIf a data frame is already grouped then slice_() will always subset by group under the hood. If you want subsets not by group then you have to ungroup(). See below in the section on mutate(). This often happens when you group a data frame to create a new variable.
filter() with Boolean logic:
==: “equal to”!=: “not equal to”>: “greater than”>=: “greater than or equal to”<: “less than”<=: “less than or equal to”&: “and”|: “or”Boolean logic is any test that returns true or false:
2 == 3[1] FALSE
2 != 3[1] TRUE
2 < 3[1] TRUE
Some examples:
mpg %>% 
  filter(year == "1999") %>% # filter all cars made in 1999
  slice_head(n=5) # view the first 5 rowsmpg %>% 
  filter(hwy >= 25) %>% # filter all cars with at least 25 mpg highway
  slice_head(n=5) # view the first 5 rowsmpg %>% 
  filter(year == "1999" & model != "a4" & hwy >= 25) %>% # filter cars made in 1999, that aren't a4's, and have at least 25 mgp highway
  slice_head(n=5) # view the first 5 rowsCreate and modify variables (data frame columns) with mutate()
For example, create a column called mean_hwy that calculates average highway miles per gallon:
mpg %>% 
  mutate(mean_hwy = mean(hwy)) %>% 
  select(manufacturer, model, hwy, mean_hwy) %>% 
  slice_head(n=5)To save the new variable you have to re-assign the data frame:
mpg = mpg %>% 
  mutate(mean_hwy = mean(hwy)) Now mpg has a new column called mean_hwy.
When you create a new variable and save it to the data you re-assign the data.
When grouping with group_by() to create a new column, the re-assigned data will be grouped, which can affect the slice_() functions.
The solution is to ungroup() after mutate().
For instance, add a column called “mean_hwy_class” that calculates average highway miles per gallon by class:
# first create the variable
mpg = mpg %>% # take the data, THEN
  group_by(class) %>% # group by class, THEN
  mutate(mean_hwy_class = mean(hwy)) %>% # create the new variable, THEN
  ungroup() # ungroup the dataWhy ungroup()? Because otherwise slice_() and other functions won’t return the output you expect. So better safe than sorry. When grouping to create a variable make sure you ungroup at the end.
Now we can slice in general:
mpg %>% 
  select(manufacturer, model, hwy, mean_hwy_class) %>% 
  slice_head(n=2)Use n() to calculate frequency and then use mutate() to calculate relative frequency:
mpg %>% 
  group_by(class) %>% 
  summarise(frequency = n()) %>% 
  mutate(relative_frequency = frequency / sum(frequency))Histograms with ggplot and geom_histogram():
mpg %>% 
  ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
  geom_histogram() # add geom layer: distributionFaceted histograms with facet_wrap():
mpg %>% 
  ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
  geom_histogram() + # add geom layer:  distribution
  facet_wrap(~class) # add facetting (note the "~" before the categorical variable)Percentiles and cumulative distributions with ggplot and stat_ecdf():
mpg %>% 
  ggplot(aes(x=cty)) +  # blank canvas: choose the data and the x-axis variable
  stat_ecdf() # add geom layer: cumulative distributionYou can add a vertical line with geom_vline() to make it easier to see a percentile. For instance, view percent of cars with less than 15 miles per gallon:
mpg %>% 
  ggplot(aes(x=cty)) +  # blank canvas: choose the data and the x-axis variable
  stat_ecdf() + # add geom layer: cumulative distribution
  geom_vline(xintercept = 15, color="red") # verticle mpg %>% 
  ggplot(aes(x = hwy, y = class)) + # x is continuous, y is categorical
  geom_boxplot() or:
mpg %>% 
  ggplot(aes(x = class, y = hwy)) + # x is categorical, y is continuous
  geom_boxplot() geom_point():
mpg %>% 
  ggplot(aes(x = cty, y = hwy)) + 
  geom_point() mpg %>% 
  ggplot(aes(x = cty, y = hwy)) + 
  geom_point() + # scatter-plot
  geom_smooth(method = "lm") # regression linecov():
mpg %>% 
  summarise(cov(x=cty, y=hwy))cor():
mpg %>% 
  summarise(cor(x=cty, y=hwy))lm()
Simple linear regression (one \(x\) variable):
\[
\begin{aligned}
y &= f(x) \\
  &= \beta_0 + \beta_1(x) + \epsilon
\end{aligned}
\] In lm() the \(y\) and \(x\) variables are separated by ~:
mpg %>% 
  lm(formula = cty ~ hwy, data = .) # the "." means "use the data from the pipe %>%"
Call:
lm(formula = cty ~ hwy, data = .)
Coefficients:
(Intercept)          hwy  
     0.8442       0.6832  
Multiple linear regression (multiple \(x\) variables):
\[
\begin{aligned}
y &= f(\mathbf{X}) \\
  &= f(x_1, x_2, \dots) \\
  &= \beta_0 + \beta_1(x_1) + \beta_2(x_2) + \dots + \epsilon
\end{aligned}
\] In lm() the \(x\) variables are separated by +:
mpg %>% 
  lm(formula = cty ~ hwy + cyl + displ, data = .)
Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)
Coefficients:
(Intercept)          hwy          cyl        displ  
    6.08786      0.58092     -0.44827     -0.05935  
summary()View hypothesis tests and regression diagnostics with summary():
mpg %>% 
  lm(formula = cty ~ hwy + cyl + displ, data = .) %>% 
  summary()
Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)
Residuals:
    Min      1Q  Median      3Q     Max 
-3.0347 -0.6012 -0.0229  0.7397  5.2573 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.08786    0.83226   7.315 4.25e-12 ***
hwy          0.58092    0.02010  28.900  < 2e-16 ***
cyl         -0.44827    0.13010  -3.446 0.000677 ***
displ       -0.05935    0.16351  -0.363 0.716971    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.148 on 230 degrees of freedom
Multiple R-squared:  0.9281,    Adjusted R-squared:  0.9272 
F-statistic: 989.9 on 3 and 230 DF,  p-value: < 2.2e-16
Use the package sJplot. (If you forgot how to install package, see here.)
library(sjPlot)The function you want is plot_model. The baseline plot shows the coefficients and standard errors:
mpg %>% 
  lm(formula = cty ~ hwy + class, data = .) %>% 
  plot_model(model = .)To plot predicted values you can set type = "pred" and then choose which coefficient to plot by setting terms:
mpg %>% 
  lm(formula = cty ~ hwy + class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy"))The terms argument accepts multiple terms:
mpg %>% 
  lm(formula = cty ~ hwy + class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy", "class"))This is useful when you have an interaction effect (e.g., hwy*class):
mpg %>% 
  lm(formula = cty ~ hwy + class + hwy*class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy", "class"))sJplot uses ggplot so you can add ggplot stuff to it, like a different theme and titles:
mpg %>% 
  lm(formula = cty ~ hwy + class + hwy*class, data = .) %>% 
  plot_model(model = ., type = "pred", terms = c("hwy", "class")) + 
  labs(x = "Highway miles per gallon", y = "Predicted city miles per gallon",
       title = "A very interesting linear model", subtitle = "So interesting") + 
  theme_minimal()The package has tons of great features. Check out the website!
| Function | Description | 
|---|---|
%>% | 
Pipe operator (“and then”) | 
summarise() | 
Summarize a vector or multiple vectors from a data frame | 
mean() | 
Calculate the mean of a vector | 
median() | 
Calculate the median of a vector | 
sd() | 
Calculate the standard deviation of a vector | 
n() | 
Count the number of observations. Takes no argument. | 
group_by() | 
Group observations by a categorical variable | 
select() | 
Select certain columns | 
slice_() | 
Slice rows from the data | 
slice_head(n=5) | 
View the head (first five rows) of the data. n =  can be any number. | 
slice_max(var, n=5) | 
View the rows with the 5 highest values of column “var” | 
slice_min(var, n=5) | 
View the rows with the 5 lowest values of column “var” | 
slice_sample(n=5) | 
Draw 5 rows at random | 
filter() | 
Filter observations | 
mutate() | 
Create a new vector | 
cov() | 
Calculate the covariation between two variables | 
cor() | 
Calculate the correlation between two variables | 
lm() | 
Estimate a linear regression | 
ifelse() | 
Create a vector based on a “True/False” test | 
ggplot() | 
Create a base plot | 
geom_histogram() | 
Histogram | 
stat_ecdf() | 
Cumulative distribution plot | 
geom_boxplot() | 
Boxplot | 
geom_point() | 
Scatter plot | 
geom_smooth(method = “lm”) | 
Regression line | 
| Term | Meaning | Pronunciation | Formula/Example | 
|---|---|---|---|
| \(x_i\) | data point \(i\) | “x i” | |
| \(n\) | sample size | ||
| \(N\) | population size | ||
| \(\bar{x}\) | the sample mean | “x bar” | \(\frac{\sum_{i=1}^n x_i}{n}\) | 
| \(\mu\) | the population mean | “mu” | |
| \(s^2\) | the sample variance | \(\frac{\sum (x_i - \bar{x})^2}{n-1}\) | |
| \(\sigma^2\) | the population variance | ||
| \(s\) | the sample standard deviation | \(\sqrt{s^2}\) | |
| \(\sigma\) | the population standard deviation | “sigma” | |
| \(z_i\) | z-score for observation \(i\) | \(z_i = \frac{x_i - \bar{x}}{s}\) | |
| \(\beta\) | regression coefficient | “beta” | \(y = \beta_0 + \beta_1 x + \epsilon\) | 
| \(\hat{\beta}\) | estimated value of regression coefficient | “beta hat” | |
| \(\epsilon\) | regression error | “epsilon” | |
| \(\sum\) | summation operator | “sum” | \(\sum_{i=1}^2 x_i = x_1 + x_2\) | 
| \(s_{xy}\) | covariation between two variables \(x\) and \(y\) | \(\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} \in [-\infty,\infty]\) | |
| \(r_{xy}\) | correlation between two variables \(x\) and \(y\) | \(\frac{s_{xy}}{s_x s_y} \in [-1,1]\) |