Pivot Wider
Data can come in all sorts of arrangements. Two ways to describe a dataset are either as long or wide. We have actually seen an example of a long dataset already with the iris dataset. A long dataset contains the data stacked in a few columns across rows with one or more columns indicating a different group variable. A wide dataset spreads out the data across columns, where the group variable has its corresponding values in distinct columns.
The iris dataset is long because the group variable Species has its group indicator values of setosa, versicolor and virginica within its same column.
Sometimes a computation is easier done in wide form, and
sometimes it is easier done in long form. As such the
pivot_wider
and pivot_longer
functions can be
used to transform our data between the two.
We will need to construct an index column, this is used as
a reference column to keep the data in the correct order.
To prepare for the long-to-wide transformation, let's make this
index column using the rep
function
(rep is an abbreviation for repeat) and call it row.
This row column will have the values 1 through 50 repeated 3 times;
corresponding for 1 through 50 for each species.
In the above we use the slice
function we learned in the
previous lesson to display rows 48 through 53. We look at these 6 rows
to show the transition from setosa to versicolor,
to showcase the row column starting over in count.
After the pivot, row 1 for
setosa will be lined up with row 1 for versicolor and row 1
for virginica. And so on for rows 2 through 50. This is why
we need an index column, to make this alignment possible.
Let's now add to this pipe chain by selecting only this new
row column,
Species, which we will use as the group
column and Petal.Length, which will be the values column.
Then we'll call the pivot_wider
function which
requires two arguments:
- names_from: This parameter specifies the group column - this will be spread out as the column names across the new columns in the wide dataset.
- values_from: This parameter specifies the values column - this is the column consisting of the values to fill in for each of the new columns in the wide dataset.
As we can see there are now four columns: row, setosa, versicolor, and virginica. The row column is the identifying column which is used to line up the values coming from the group column. And the setosa, versicolor, and virginica Petal.Length values fill in the rest of the dataframe.
Practice exercise
Use mutate
, select
and pivot_wider
to transform the iris dataset from long to wide filling in the values from
Petal.Width