Last updated: 2021-04-27 18:25:43 EST
Load the tidyverse:
library(tidyverse)
The tidyverse
provides the data mpg
to demonstrate methods. mpg
has the mileage (cty
for city, hwy
for highway) for a bunch of different cars:
data("mpg") # load the data mpg
# view the first 5 rows
%>%
mpg slice_head(n=5)
summarise()
All summarizing operations work like this:
%>% # the pipe: "and then"
data summarise()
%>%
, “and then”)summarise()
Inside summarise()
you can use functions like mean()
, median()
and so on.
mean()
%>% # take the data, THEN
mpg summarise(mean(cty)) # summarize: mean
Note. cty
is a numerical variable.
median()
%>% # take the data, THEN
mpg summarise(median(cty)) # summarize: median
sd()
%>% # take the data, THEN
mpg summarise(sd(cty)) # summarize: standard deviation
%>% # take the data, THEN
mpg summarise(n()) # summarize: count number of observations
Why doesn’t n()
take an argument? Because it counts the number of rows, not columns.
You can put as many functions as you want inside summarise()
:
%>% # take the data, THEN
mpg summarise(mean(cty), median(cty), sd(cty), n()) # summarize: mean, median, standard deviation, count
summarise()
You can name variables inside summarise()
using standard assignment:
%>% # take the data, THEN
mpg summarise(avg_cty = mean(cty), median_cty = median(cty)) # summarize: mean, median
group_by()
)%>% # take the data, THEN
mpg group_by(class) %>% # group it by car class (e.g., compact, pickup) THEN
summarise(mean(cty), median(cty), sd(cty), n()) # summarize
Note. class
is a categorical variable.
select()
certain columns
%>% # all the columns
mpg slice_head(n=5) # view the first 5 rows
%>% # all the columns
mpg select(manufacturer, model, hwy) %>% # select some of the columns
slice_head(n=5) # view the first 5 rows
All slice operators begin with slice_
and take an optional n=
argument that specifies the number of rows you want to see. More here.
Subsets with slice_head()
:
%>%
mpg slice_head(n=10) # view the first 10 rows
Maximums with slice_max()
:
%>%
mpg slice_max(hwy, n=5) # the five highest highway miles per gallon. if there are ties they all get printed
Minimums with slice_min()
:
%>%
mpg slice_min(hwy, n=2) # the two lowest highway miles per gallon. if there are ties they all get printed
Random samples with slice_sample()
:
%>%
mpg slice_sample(n=3) # pick three rows at random
When you group_by()
a data frame slice_()
will return subsets of each group:
%>%
mpg group_by(class) %>% # group by class
slice_min(hwy, n=1) # bottom two hwy by class
If a data frame is already grouped then slice_()
will always subset by group under the hood. If you want subsets not by group then you have to ungroup()
. See below in the section on mutate()
. This often happens when you group a data frame to create a new variable.
filter()
with Boolean logic:
==
: “equal to”!=
: “not equal to”>
: “greater than”>=
: “greater than or equal to”<
: “less than”<=
: “less than or equal to”&
: “and”|
: “or”Boolean logic is any test that returns true or false:
2 == 3
[1] FALSE
2 != 3
[1] TRUE
2 < 3
[1] TRUE
Some examples:
%>%
mpg filter(year == "1999") %>% # filter all cars made in 1999
slice_head(n=5) # view the first 5 rows
%>%
mpg filter(hwy >= 25) %>% # filter all cars with at least 25 mpg highway
slice_head(n=5) # view the first 5 rows
%>%
mpg filter(year == "1999" & model != "a4" & hwy >= 25) %>% # filter cars made in 1999, that aren't a4's, and have at least 25 mgp highway
slice_head(n=5) # view the first 5 rows
Create and modify variables (data frame columns) with mutate()
For example, create a column called mean_hwy
that calculates average highway miles per gallon:
%>%
mpg mutate(mean_hwy = mean(hwy)) %>%
select(manufacturer, model, hwy, mean_hwy) %>%
slice_head(n=5)
To save the new variable you have to re-assign the data frame:
mpg %>%
mpg = mutate(mean_hwy = mean(hwy))
Now mpg
has a new column called mean_hwy
.
When you create a new variable and save it to the data you re-assign the data.
When grouping with group_by()
to create a new column, the re-assigned data will be grouped, which can affect the slice_()
functions.
The solution is to ungroup()
after mutate()
.
For instance, add a column called “mean_hwy_class” that calculates average highway miles per gallon by class:
# first create the variable
mpg %>% # take the data, THEN
mpg = group_by(class) %>% # group by class, THEN
mutate(mean_hwy_class = mean(hwy)) %>% # create the new variable, THEN
ungroup() # ungroup the data
Why ungroup()
? Because otherwise slice_()
and other functions won’t return the output you expect. So better safe than sorry. When grouping to create a variable make sure you ungroup at the end.
Now we can slice in general:
%>%
mpg select(manufacturer, model, hwy, mean_hwy_class) %>%
slice_head(n=2)
Use n()
to calculate frequency and then use mutate()
to calculate relative frequency:
%>%
mpg group_by(class) %>%
summarise(frequency = n()) %>%
mutate(relative_frequency = frequency / sum(frequency))
Histograms with ggplot
and geom_histogram()
:
%>%
mpg ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
geom_histogram() # add geom layer: distribution
Faceted histograms with facet_wrap()
:
%>%
mpg ggplot(aes(x = cty)) + # blank canvas: choose the data and the x-axis variable
geom_histogram() + # add geom layer: distribution
facet_wrap(~class) # add facetting (note the "~" before the categorical variable)
Percentiles and cumulative distributions with ggplot
and stat_ecdf()
:
%>%
mpg ggplot(aes(x=cty)) + # blank canvas: choose the data and the x-axis variable
stat_ecdf() # add geom layer: cumulative distribution
You can add a vertical line with geom_vline()
to make it easier to see a percentile. For instance, view percent of cars with less than 15 miles per gallon:
%>%
mpg ggplot(aes(x=cty)) + # blank canvas: choose the data and the x-axis variable
stat_ecdf() + # add geom layer: cumulative distribution
geom_vline(xintercept = 15, color="red") # verticle
%>%
mpg ggplot(aes(x = hwy, y = class)) + # x is continuous, y is categorical
geom_boxplot()
or:
%>%
mpg ggplot(aes(x = class, y = hwy)) + # x is categorical, y is continuous
geom_boxplot()
geom_point()
:
%>%
mpg ggplot(aes(x = cty, y = hwy)) +
geom_point()
%>%
mpg ggplot(aes(x = cty, y = hwy)) +
geom_point() + # scatter-plot
geom_smooth(method = "lm") # regression line
cov()
:
%>%
mpg summarise(cov(x=cty, y=hwy))
cor()
:
%>%
mpg summarise(cor(x=cty, y=hwy))
lm()
Simple linear regression (one \(x\) variable):
\[
\begin{aligned}
y &= f(x) \\
&= \beta_0 + \beta_1(x) + \epsilon
\end{aligned}
\] In lm()
the \(y\) and \(x\) variables are separated by ~
:
%>%
mpg lm(formula = cty ~ hwy, data = .) # the "." means "use the data from the pipe %>%"
Call:
lm(formula = cty ~ hwy, data = .)
Coefficients:
(Intercept) hwy
0.8442 0.6832
Multiple linear regression (multiple \(x\) variables):
\[
\begin{aligned}
y &= f(\mathbf{X}) \\
&= f(x_1, x_2, \dots) \\
&= \beta_0 + \beta_1(x_1) + \beta_2(x_2) + \dots + \epsilon
\end{aligned}
\] In lm()
the \(x\) variables are separated by +
:
%>%
mpg lm(formula = cty ~ hwy + cyl + displ, data = .)
Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)
Coefficients:
(Intercept) hwy cyl displ
6.08786 0.58092 -0.44827 -0.05935
summary()
View hypothesis tests and regression diagnostics with summary()
:
%>%
mpg lm(formula = cty ~ hwy + cyl + displ, data = .) %>%
summary()
Call:
lm(formula = cty ~ hwy + cyl + displ, data = .)
Residuals:
Min 1Q Median 3Q Max
-3.0347 -0.6012 -0.0229 0.7397 5.2573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.08786 0.83226 7.315 4.25e-12 ***
hwy 0.58092 0.02010 28.900 < 2e-16 ***
cyl -0.44827 0.13010 -3.446 0.000677 ***
displ -0.05935 0.16351 -0.363 0.716971
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.148 on 230 degrees of freedom
Multiple R-squared: 0.9281, Adjusted R-squared: 0.9272
F-statistic: 989.9 on 3 and 230 DF, p-value: < 2.2e-16
Use the package sJplot
. (If you forgot how to install package, see here.)
library(sjPlot)
The function you want is plot_model
. The baseline plot shows the coefficients and standard errors:
%>%
mpg lm(formula = cty ~ hwy + class, data = .) %>%
plot_model(model = .)
To plot predicted values you can set type = "pred"
and then choose which coefficient to plot by setting terms
:
%>%
mpg lm(formula = cty ~ hwy + class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy"))
The terms
argument accepts multiple terms:
%>%
mpg lm(formula = cty ~ hwy + class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy", "class"))
This is useful when you have an interaction effect (e.g., hwy*class
):
%>%
mpg lm(formula = cty ~ hwy + class + hwy*class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy", "class"))
sJplot
uses ggplot
so you can add ggplot
stuff to it, like a different theme and titles:
%>%
mpg lm(formula = cty ~ hwy + class + hwy*class, data = .) %>%
plot_model(model = ., type = "pred", terms = c("hwy", "class")) +
labs(x = "Highway miles per gallon", y = "Predicted city miles per gallon",
title = "A very interesting linear model", subtitle = "So interesting") +
theme_minimal()
The package has tons of great features. Check out the website!
Function | Description |
---|---|
%>% |
Pipe operator (“and then”) |
summarise() |
Summarize a vector or multiple vectors from a data frame |
mean() |
Calculate the mean of a vector |
median() |
Calculate the median of a vector |
sd() |
Calculate the standard deviation of a vector |
n() |
Count the number of observations. Takes no argument. |
group_by() |
Group observations by a categorical variable |
select() |
Select certain columns |
slice_() |
Slice rows from the data |
slice_head(n=5) |
View the head (first five rows) of the data. n = can be any number. |
slice_max(var, n=5) |
View the rows with the 5 highest values of column “var” |
slice_min(var, n=5) |
View the rows with the 5 lowest values of column “var” |
slice_sample(n=5) |
Draw 5 rows at random |
filter() |
Filter observations |
mutate() |
Create a new vector |
cov() |
Calculate the covariation between two variables |
cor() |
Calculate the correlation between two variables |
lm() |
Estimate a linear regression |
ifelse() |
Create a vector based on a “True/False” test |
ggplot() |
Create a base plot |
geom_histogram() |
Histogram |
stat_ecdf() |
Cumulative distribution plot |
geom_boxplot() |
Boxplot |
geom_point() |
Scatter plot |
geom_smooth(method = “lm”) |
Regression line |
Term | Meaning | Pronunciation | Formula/Example |
---|---|---|---|
\(x_i\) | data point \(i\) | “x i” | |
\(n\) | sample size | ||
\(N\) | population size | ||
\(\bar{x}\) | the sample mean | “x bar” | \(\frac{\sum_{i=1}^n x_i}{n}\) |
\(\mu\) | the population mean | “mu” | |
\(s^2\) | the sample variance | \(\frac{\sum (x_i - \bar{x})^2}{n-1}\) | |
\(\sigma^2\) | the population variance | ||
\(s\) | the sample standard deviation | \(\sqrt{s^2}\) | |
\(\sigma\) | the population standard deviation | “sigma” | |
\(z_i\) | z-score for observation \(i\) | \(z_i = \frac{x_i - \bar{x}}{s}\) | |
\(\beta\) | regression coefficient | “beta” | \(y = \beta_0 + \beta_1 x + \epsilon\) |
\(\hat{\beta}\) | estimated value of regression coefficient | “beta hat” | |
\(\epsilon\) | regression error | “epsilon” | |
\(\sum\) | summation operator | “sum” | \(\sum_{i=1}^2 x_i = x_1 + x_2\) |
\(s_{xy}\) | covariation between two variables \(x\) and \(y\) | \(\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{n-1} \in [-\infty,\infty]\) | |
\(r_{xy}\) | correlation between two variables \(x\) and \(y\) | \(\frac{s_{xy}}{s_x s_y} \in [-1,1]\) |