Pivot Wider
Data can come in all sorts of arrangements. Two ways to describe a dataset are either as long or wide. We have actually seen an example of a long dataset already with the iris dataset. A long dataset contains the data stacked in a few columns across rows with one or more columns indicating a different group variable. A wide dataset spreads out the data across columns, where the group variable has its corresponding values in distinct columns.
The iris dataset is long because the group variable Species has its group indicator values of setosa, versicolor and virginica within its same column.
Sometimes a computation is easier done in wide form, and
sometimes it is easier done in long form. As such the
pivot
and melt
methods can be
used to transform our data between the two.
We will need to construct an index column, this is used as
a reference column to keep the data in the correct order.
To prepare for the long-to-wide transformation, let's make this
index column using the cumcount
method after a
groupby
. The cumcount
method cumulatively
adds 1 to each row, starting over at 0 when it reaches the next group.
Since this starts counting at 0, we will want to add 1 to the result
to get the index in the range of 1 to 50.
In the above we use the loc
method we learned in the
previous lesson to display rows 47 through 52. We look at these 6 rows
to show the transition from setosa to versicolor,
to showcase the row column starting over in count.
After the pivot, row 1 for
setosa will be lined up with row 1 for versicolor and row 1
for virginica. And so on for rows 2 through 50. This is why
we need an index column, to make this alignment possible.
Let's now add to this method chain by selecting only this new
row column,
Species, which we will use as the group
column and Petal.Length, which will be the values column.
Then we'll call the pivot
method which
requires three arguments:
- index: This parameter specifies the index column.
- columns: This parameter specifies the group column - this will be spread out as the column names across the new columns in the wide dataset.
- values: This parameter specifies the values column - this is the column consisting of the values to fill in for each of the new columns in the wide dataset.
As we can see there are now three columns: setosa, versicolor, and virginica. The row column is the identifying column which is used to line up the values coming from the group column and is kept as the index. And the setosa, versicolor, and virginica Petal.Length values fill in the rest of the dataframe.
Practice exercise
Use assign
, loc
and pivot
to transform the iris dataset from long to wide filling in the values from
Petal.Width