Digi Cafe - Beginner Data Science in Python

A typical task for a data scientist involves examining summary statistics for specific subsets of data. For instance, in the case of the iris dataset, we learned how to extract data subsets using the query function. If at some time we need to determine the mean petal length for each of the three species, we would need to:

Split the dataset into three separate subsets of data - once for each Species (setosa, versicolor, and virginica)
And then calculate the mean for each subset of data.

However, due to the frequency of this procedure, the groupby method was introduced to streamline this process, eliminating the need to create separate subsets of data.

In the following example we find the mean (a.k.a. average) Petal.Length for each species by first using groupby to make the three datasets for each species, select just one column we want to calculate the mean of, then use the mean method to calculate the means.

Grouping might seem a bit tricky at first, but many find it quite intuitive once they get the hang of it! There is a lot more we can do with groupby. Let's explore a few more ways we can use this method.

A more complex example

In this example let's suppose we want to find the mean Petal.Length plus the mean Sepal.Length for each Species. This involves now three steps:

Subset the data by Species.
Calculate the mean Petal.Length andSepal.Length for each subset.
Add the calculated means together.

Instead of doing these steps separately, we can simplify the process by chaining the groupby, mean and assign methods.

Wow that's a lot of code! This may look intimidating at first because it is 16 lines of code. However, these are the same mean and assign methods we have already learned, simply strung together with method chaining and white spaces added.

Group by multiple variables

The groupby function can take multiple arguments, and it groups the data based on the order in which the column names appear. To illustrate this, let's add another column to the iris dataset for this second level grouping.

We will use the index attribute to add row numbers to the dataset. Then, we will use the >= conditional expression, which we learned in the Query lesson, to create a boolean variable equal to True for the row numbers equal to or above 75, and False with row numbers below 75.

When we use groupby on both Species and top_half, we obtain four groups instead of six. Why is that? This is because the versicolor species has two possible values for top_half: TRUE or FALSE, resulting in two groups for versicolor. However, the other two Species have a constant value, either entirely TRUE or FALSE, resulting in only one group for each of them.

Practice exercise

Use groupby and max to find the maximum Petal.Width and Sepal.Width of each Species.