Group by

A typical task for a data scientist involves examining summary statistics for specific subsets of data. For instance, in the case of the iris dataset, we learned how to extract data subsets using the filter function. If at some time we need to determine the mean petal length for each of the three species, we would need to:

  1. Split the dataset into three separate subsets of data - once for each Species (setosa, versicolor, and virginica)
  2. And then calculate the mean for each subset of data.

However, due to the frequency of this procedure, the group_by function was introduced to streamline this process, eliminating the need to create separate subsets of data.

In the following example we find the mean (a.k.a. average) Petal.Length for each species by first using group_by to make the three datasets for each species, then add the summarise function after the pipe operator to calculate the means.

Grouping might seem a bit tricky at first, but many find it quite intuitive once they get the hang of it! There is a lot more we can do with group_by. Let's explore a few more ways we can use this function.

A more complex example

In this example let's suppose we want to find the mean Petal.Length plus the mean Sepal.Length for each Species. This involves now three steps:

  • Subset the data by Species.
  • Calculate the mean Petal.Length andSepal.Length for each subset.
  • Add the calculated means together.

Instead of doing these steps separately, we can simplify the process by chaining the group_by, summarise and mutatefunctions.

Wow that's a lot of code! This may look intimidating at first because it is 7 lines of code. However, these are the same summarise and mutate functions we have already learned, simply strung together with the pipe operator and white spaces added.

Group by multiple variables

The group_by function can take multiple arguments, and it groups the data based on the order in which the column names appear. To illustrate this, let's add another column to the iris dataset for this second level grouping.

We will use the row_number function to add row numbers to the dataset. Then, we will use the >= conditional expression, which we learned in the Filter lesson, to create a boolean variable equal to TRUE for the row numbers equal to or above 76, and FALSE with row numbers below 76.

When we use group_by on both Species and top_half, we obtain four groups instead of six. Why is that? This is because the versicolor species has two possible values for top_half: TRUE or FALSE, resulting in two groups for versicolor. However, the other two Species have a constant value, either entirely TRUE or FALSE, resulting in only one group for each of them.

Practice exercise

Use group_by and summarise to find the maximum Petal.Width and Sepal.Width of each Species.