Digi Cafe - Beginner Data Science in R

Data can come in all sorts of arrangements. Two ways to describe a dataset are either as long or wide. We have actually seen an example of a long dataset already with the iris dataset. A long dataset contains the data stacked in a few columns across rows with one or more columns indicating a different group variable. A wide dataset spreads out the data across columns, where the group variable has its corresponding values in distinct columns.

The iris dataset is long because the group variable Species has its group indicator values of setosa, versicolor and virginica within its same column.

Sometimes a computation is easier done in wide form, and sometimes it is easier done in long form. As such the pivot_wider and pivot_longer functions can be used to transform our data between the two.

We will need to construct an index column, this is used as a reference column to keep the data in the correct order. To prepare for the long-to-wide transformation, let's make this index column using the rep function (rep is an abbreviation for repeat) and call it row. This row column will have the values 1 through 50 repeated 3 times; corresponding for 1 through 50 for each species.

In the above we use the slice function we learned in the previous lesson to display rows 48 through 53. We look at these 6 rows to show the transition from setosa to versicolor, to showcase the row column starting over in count. After the pivot, row 1 for setosa will be lined up with row 1 for versicolor and row 1 for virginica. And so on for rows 2 through 50. This is why we need an index column, to make this alignment possible.

Let's now add to this pipe chain by selecting only this new row column, Species, which we will use as the group column and Petal.Length, which will be the values column. Then we'll call the pivot_wider function which requires two arguments:

names_from: This parameter specifies the group column - this will be spread out as the column names across the new columns in the wide dataset.
values_from: This parameter specifies the values column - this is the column consisting of the values to fill in for each of the new columns in the wide dataset.

As we can see there are now four columns: row, setosa, versicolor, and virginica. The row column is the identifying column which is used to line up the values coming from the group column. And the setosa, versicolor, and virginica Petal.Length values fill in the rest of the dataframe.

Practice exercise

Use mutate, select and pivot_wider to transform the iris dataset from long to wide filling in the values from Petal.Width

Pivot Wider

Practice exercise