Selecting Columns

In this lesson we will learn how to subset the data by selecting columns in a dataframe.

The loc method is a data manipulation method used to subset the data and is relatively straightforward to learn and use compared to the describe and assign methods covered in the previous two lessons.

loc uses bracket notation, separated by a comma to subset the dataframe to the rows and columns we want. We include a colon in the rows position to select all rows. Then in the columns position we include the column name we want to keep. For instance, in the following code block we retain only Sepal.Length and store it in a new dataframe called df.

Multiple columns

If we want to keep multiple columns, such as both Sepal.Length and Species, we can collect the column names into a list. Then we can use the list in the columns position of loc.

Another option is to use double square bracket notation to select columns.

Drop

Now suppose we have a dataframe with a hundred columns - this is not an unreasonable number of columns to have in research or industry level data science applications. And suppose we want to keep all but one column. It does not make sense to write the names of 99 column names to keep.

Instead, we can drop a list of columns using the drop method. With this method we need to additionally specify the argument axis = 1 because Pandas orders rows as axis 0 and columns as axis 1, and this method's default axis is 0. So if we want to omit only Sepal.Length we can write:

If we want to omit both Sepal.Length and Species we may similarly collect the two in a list and drop that, such as in the following code editor.

Practice exercise

Select both Petal.Length and Petal.Width from the iris dataset.