Select

In this lesson we will learn how to subset the data by selecting columns in a dataframe.

The select function is a data manipulation function used to subset the data and is relatively straightforward to learn and use compared to the summarise and mutate functions covered in the previous two lessons.

select takes the names of the columns we want to retain and subsets the dataframe to only include those columns. For instance, in the following code block we retain only Sepal.Length and store it in a new dataframe called df.

Multiple columns

If we want to keep multiple columns, such as both Sepal.Length and Species, we can write:

Or we can collect the column names into a vector using the c function. For example, with the same two columns, we can collect these into a vector which we will name as vars2keep, and place it inside another function: all_of. This further step is required in newer version of the Tidyverse to avoid ambiguity in whether we want dataframe columns or an external object.

Drop

Now suppose we have a dataframe with a hundred columns - this is not an unreasonable number of columns to have in research or industry level data science applications. And suppose we want to keep all but one column. It does not make sense to write the names of 99 column names to keep.

Instead, we can use the minus sign - to drop a column. So if we want to omit only Sepal.Length we can write:

If we want to omit both Sepal.Length and Species we may similarly collect the two in a vector and negate that, such as in the following code editor.

Practice exercise

Select both Petal.Length and Petal.Width from the iris dataset.