Summarise

Let's take a look at our first function in the Tidyverse. The summarise function is useful for finding summary statistics about a dataframe. We use this by starting with the name of the dataframe, call the pipe operator, then call the summarise function. Within the summarise function, it takes an argument of the form: y = stat_function(column_name). Where y is the name of the column to be made in the returned dataframe; column_name is the column name in the dataframe to the left of the pipe operator we want to calculate a summary statistic of; and stat_function is a statistical function such as one below:

  • mean: calculate the mean
  • sd: calculate the standard deviation
  • n: return the length of the dataframe
  • max: calculate the maximum
  • min: calculate the minimum

In the previous code editor we used summarise to calculate the mean. Another way to do this would be to simply use the mean function.

There is a difference in the output of these two methods, and the difference is in the datatypes of the returned object.

We see that the datatype of sepal_mean_1 is data.frame while sepal_mean_2 is numeric.

If we want the standard deviation, we can use sd.

The length of the dataframe can be found with n. An interesting point here is that we do not need to pass in a column name to this function as we do the rest. This is because n operates on the entire dataframe.

Another way to do this is with the dim function. The first value is the number of rows and the second is the number of columns.

If we want the maximum or minimum, we can use max or min.

Practice exercise

Use the pipe operator and the summarise function to find the mean of Petal.Length