Digi Cafe - Beginner Data Science in Python

Data can come in all sorts of arrangements. Two ways to describe a dataset are either as long or wide. We have actually seen an example of a long dataset already with the iris dataset. A long dataset contains the data stacked in a few columns across rows with one or more columns indicating a different group variable. A wide dataset spreads out the data across columns, where the group variable has its corresponding values in distinct columns.

The iris dataset is long because the group variable Species has its group indicator values of setosa, versicolor and virginica within its same column.

Sometimes a computation is easier done in wide form, and sometimes it is easier done in long form. As such the pivot and melt methods can be used to transform our data between the two.

We will need to construct an index column, this is used as a reference column to keep the data in the correct order. To prepare for the long-to-wide transformation, let's make this index column using the cumcount method after a groupby. The cumcount method cumulatively adds 1 to each row, starting over at 0 when it reaches the next group. Since this starts counting at 0, we will want to add 1 to the result to get the index in the range of 1 to 50.

In the above we use the loc method we learned in the previous lesson to display rows 47 through 52. We look at these 6 rows to show the transition from setosa to versicolor, to showcase the row column starting over in count. After the pivot, row 1 for setosa will be lined up with row 1 for versicolor and row 1 for virginica. And so on for rows 2 through 50. This is why we need an index column, to make this alignment possible.

Let's now add to this method chain by selecting only this new row column, Species, which we will use as the group column and Petal.Length, which will be the values column. Then we'll call the pivot method which requires three arguments:

index: This parameter specifies the index column.
columns: This parameter specifies the group column - this will be spread out as the column names across the new columns in the wide dataset.
values: This parameter specifies the values column - this is the column consisting of the values to fill in for each of the new columns in the wide dataset.

As we can see there are now three columns: setosa, versicolor, and virginica. The row column is the identifying column which is used to line up the values coming from the group column and is kept as the index. And the setosa, versicolor, and virginica Petal.Length values fill in the rest of the dataframe.

Practice exercise

Use assign, loc and pivot to transform the iris dataset from long to wide filling in the values from Petal.Width

Pivot Wider

Practice exercise