Dropping Columns and Rows in Pandas

2021-01-14

df.drop()

The easiest way to drop rows and columns from a Pandas DataFrame is with the .drop() method, which accepts one or more labels passed in as index=<rows to drop> and/or columns=<cols to drop>:

import pandas as pd

df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv").head()
df

# Expected result
#          date region  revenue
# 0  1999-01-02   APAC      928
# 1  1999-01-03   AMER      526
# 2  1999-01-04   EMEA      497
# 3  1999-01-06   APAC      135
# 4  1999-01-07   APAC      829

df.drop(columns='region', index=[0, 2, 4])

# Expected result
#          date  revenue
# 1  1999-01-03      526
# 3  1999-01-06      135

The above command drops both the 'region' column and rows [0, 2, 4]. Notice the columns and index arguments both accept a single label or a “list-like” sequence of labels. A couple things to be aware of:

  • “list-like” does not include tuples. For reasons I don’t fully understand, if you pass a tuple into columns or index you’ll get a KeyError. You can pass in range objects.
  • Before version 0.21.0, you need to drop rows and columns separately using the axis argument, e.g. df.drop('region', axis=1). Probably better to upgrade Pandas 🙂

Dropping by index

Even if your axis is not labeled with an integer index, you can still drop rows and columns by index: just slice the labels.

For example, to drop the odd-numbered rows and the even-numbered columns:

df.drop(index=df.index[1::2], columns=df.columns[::2])

# Expected result
#   region
# 0   APAC
# 2   EMEA
# 4   APAC

The del operator

You can also drop individual columns with the del operator. Be careful because this actually deletes the column (which is similar to passing inplace=True to .drop()). Here, I’ll copy the DataFrame first so we can use it again later:

copied_df = df.copy()

del copied_df["date"]
copied_df

#   region  revenue
# 0   APAC      928
# 1   AMER      526
# 2   EMEA      497
# 3   APAC      135
# 4   APAC      829

Row-dropping utilities

Pandas also provides a couple convenience methods for dropping duplicate rows and rows with missing data.

You can drop duplicate rows with df.drop_duplicates():

df.drop_duplicates(subset='region')

# Expected result
#          date region  revenue
# 0  1999-01-02   APAC      928
# 1  1999-01-03   AMER      526
# 2  1999-01-04   EMEA      497

And you can use df.dropna() to drop rows with missing data. My dummy dataset doesn’t have any missing data so I’ll leave this one as an exercise for the reader! If you do this, you should think about:

  1. How am I going to insert a handful of missing values?
  2. How can I drop rows where only some columns have a missing value?
  3. How can I drop rows where any column has a missing value?

Happy tinkering!

Leave a Comment

Your email address will not be published. Required fields are marked *