Group by
A typical task for a data scientist involves examining summary statistics
for specific subsets of data. For instance, in the case of the iris
dataset, we learned how to extract data subsets using the
query
function. If at some time we need to determine the
mean petal length for each of the three species, we would need to:
- Split the dataset into three separate subsets of data - once for each Species (setosa, versicolor, and virginica)
- And then calculate the mean for each subset of data.
However, due to the frequency of this procedure, the
groupby
method was introduced to streamline this process,
eliminating the need to create separate subsets of data.
In the following example we find the mean (a.k.a. average) Petal.Length for
each species by first using groupby
to make the three
datasets for each species, select just one column we want to calculate
the mean of, then use the mean
method to calculate the
means.
Grouping might seem a bit tricky at first, but many find it quite
intuitive once they get the hang of it! There is a lot more we can do
with groupby
. Let's explore a few more ways
we can use this method.
A more complex example
In this example let's suppose we want to find the mean Petal.Length plus the mean Sepal.Length for each Species. This involves now three steps:
- Subset the data by Species.
- Calculate the mean Petal.Length andSepal.Length for each subset.
- Add the calculated means together.
Instead of doing these steps separately, we can simplify the process
by chaining the groupby
, mean
and
assign
methods.
Wow that's a lot of code! This may look intimidating at first
because it is 16 lines of code. However, these are the same
mean
and assign
methods we have already
learned, simply strung together with method chaining and white
spaces added.
Group by multiple variables
The groupby
function can take multiple arguments,
and it groups the data based on the order in which the column names
appear. To illustrate this, let's add another column to the iris
dataset for this second level grouping.
We will use the index attribute to add row numbers
to the dataset. Then, we will use the >=
conditional expression, which we learned in the
Query lesson, to create a boolean variable equal
to True
for the row numbers equal to or
above 75, and False
with row numbers below 75.
When we use groupby
on both Species and
top_half, we obtain four groups instead of six. Why is that?
This is because the versicolor species has two possible values for
top_half: TRUE
or FALSE
, resulting in two
groups for versicolor. However, the other two Species have
a constant value, either entirely TRUE
or FALSE
,
resulting in only one group for each of them.
Practice exercise
Use groupby
and max
to find the maximum Petal.Width and Sepal.Width of each
Species.