Selecting Columns
In this lesson we will learn how to subset the data by selecting columns in a dataframe.
The loc
method is a data manipulation method used to
subset the data and is relatively straightforward
to learn and use compared to the describe
and
assign
methods covered in the previous two lessons.
loc
uses bracket notation, separated by a comma to
subset the dataframe to the rows and columns we want.
We include a colon in the rows position to select all rows.
Then in the columns position we include the column name we want
to keep. For instance, in the following code block we retain only
Sepal.Length and store it in a new dataframe
called df
.
Multiple columns
If we want to keep multiple columns, such as both
Sepal.Length and Species, we can
collect the column names into a list. Then we can use the
list in the columns position of loc
.
Another option is to use double square bracket notation to select columns.
Drop
Now suppose we have a dataframe with a hundred columns - this is not an unreasonable number of columns to have in research or industry level data science applications. And suppose we want to keep all but one column. It does not make sense to write the names of 99 column names to keep.
Instead, we can drop a list of columns using the
drop
method. With this method we need
to additionally specify the argument axis = 1
because Pandas orders rows as axis 0 and columns as
axis 1, and this method's default axis is 0.
So if we want to omit only Sepal.Length we can write:
If we want to omit both Sepal.Length and Species we may similarly collect the two in a list and drop that, such as in the following code editor.
Practice exercise
Select both Petal.Length and Petal.Width from the iris dataset.