Select
In this lesson we will learn how to subset the data by selecting columns in a dataframe.
The select
function is a data manipulation function used to
subset the data and is relatively straightforward
to learn and use compared to the summarise
and
mutate
functions covered in the previous two lessons.
select
takes the names of the columns we want to
retain and subsets the dataframe to only include those columns.
For instance, in the following code block we retain only
Sepal.Length and store it in a new dataframe
called df
.
Multiple columns
If we want to keep multiple columns, such as both Sepal.Length and Species, we can write:
Or we can collect the column names
into a vector using the c
function. For example, with
the same two columns, we can collect these into a vector which we
will name as vars2keep
,
and place it inside another function: all_of
. This
further step is required in newer version of the Tidyverse to
avoid ambiguity in whether we want dataframe columns or
an external object.
Drop
Now suppose we have a dataframe with a hundred columns - this is not an unreasonable number of columns to have in research or industry level data science applications. And suppose we want to keep all but one column. It does not make sense to write the names of 99 column names to keep.
Instead, we can use the minus sign -
to drop a column.
So if we want to omit only Sepal.Length we can write:
If we want to omit both Sepal.Length and Species we may similarly collect the two in a vector and negate that, such as in the following code editor.
Practice exercise
Select both Petal.Length and Petal.Width from the iris dataset.