Group by
A typical task for a data scientist involves examining summary statistics
for specific subsets of data. For instance, in the case of the iris
dataset, we learned how to extract data subsets using the
filter
function. If at some time we need to determine the
mean petal length for each of the three species, we would need to:
- Split the dataset into three separate subsets of data - once for each Species (setosa, versicolor, and virginica)
- And then calculate the mean for each subset of data.
However, due to the frequency of this procedure, the
group_by
function was introduced to streamline this process,
eliminating the need to create separate subsets of data.
In the following example we find the mean (a.k.a. average) Petal.Length for
each species by first using group_by
to make the three
datasets for each species, then add the summarise
function after the pipe operator to calculate the means.
Grouping might seem a bit tricky at first, but many find it quite
intuitive once they get the hang of it! There is a lot more we can do
with group_by
. Let's explore a few more ways
we can use this function.
A more complex example
In this example let's suppose we want to find the mean Petal.Length plus the mean Sepal.Length for each Species. This involves now three steps:
- Subset the data by Species.
- Calculate the mean Petal.Length andSepal.Length for each subset.
- Add the calculated means together.
Instead of doing these steps separately, we can simplify the process
by chaining the group_by
, summarise
and
mutate
functions.
Wow that's a lot of code! This may look intimidating at first
because it is 7 lines of code. However, these are the same
summarise
and mutate
functions we have already
learned, simply strung together with the pipe operator and white
spaces added.
Group by multiple variables
The group_by
function can take multiple arguments,
and it groups the data based on the order in which the column names
appear. To illustrate this, let's add another column to the iris
dataset for this second level grouping.
We will use the row_number
function to add row numbers
to the dataset. Then, we will use the >=
conditional expression, which we learned in the
Filter lesson, to create a boolean variable equal
to TRUE
for the row numbers equal to or
above 76, and FALSE
with row numbers below 76.
When we use group_by
on both Species and
top_half, we obtain four groups instead of six. Why is that?
This is because the versicolor species has two possible values for
top_half: TRUE
or FALSE
, resulting in two
groups for versicolor. However, the other two Species have
a constant value, either entirely TRUE
or FALSE
,
resulting in only one group for each of them.
Practice exercise
Use group_by
and summarise
to find the maximum Petal.Width and Sepal.Width of each
Species.