Filtering DataFrames with the .query() Method in Pandas

2021-02-19

Pandas provides a .query() method on DataFrames with a convenient string syntax for filtering. Think of the .query() syntax like the where clause in SQL.

Here’s a basic example:

import pandas as pd

df = pd.read_csv("https://jbencook.s3.amazonaws.com/data/dummy-sales.csv")
df.query("region == 'APAC' and revenue < 300")

# Expected result
#           date region  revenue
# 3   1999-01-06   APAC      135
# 9   1999-01-18   APAC      147
# 11  1999-01-24   APAC      100
# 24  1999-03-20   APAC      108

The query string "region == 'APAC' and revenue < 300" selects the rows where region is 'APAC' and revenue is less than 300.

Pretty simple! You can also reference local variables by prefixing them with @. If we wanted to get examples where revenue was 1 standard deviation above the mean, we could first compute these values and then reference them in our query string:

avg_revenue = df.revenue.mean()
std_revenue = df.revenue.std()
df.query("revenue > @avg_revenue + @std_revenue")

# Expected result
#           date region  revenue
# 0   1999-01-02   APAC      928
# 8   1999-01-16   APAC      970
# 19  1999-02-16   EMEA      918
# 25  1999-03-23   AMER      972
# 26  1999-03-24   AMER      956
# 27  1999-03-24   EMEA      954
# 29  1999-03-26   AMER      994

But you can also call methods on the columns inside the string. Here’s the same query without pre-computing the mean and standard deviation:

df.query("revenue > revenue.mean() + revenue.std()")

# Expected result
#           date region  revenue
# 0   1999-01-02   APAC      928
# 8   1999-01-16   APAC      970
# 19  1999-02-16   EMEA      918
# 25  1999-03-23   AMER      972
# 26  1999-03-24   AMER      956
# 27  1999-03-24   EMEA      954
# 29  1999-03-26   AMER      994

It’s worth mentioning one other cool trick. You can check whether a column value is in a local list:

valid_dates = ["1999-01-02", "1999-01-03", "1999-01-04"]
df.query("date in @valid_dates")

# Expected result
#          date region  revenue
# 0  1999-01-02   APAC      928
# 1  1999-01-03   AMER      526
# 2  1999-01-04   EMEA      497

A few other things to be aware of:

  • You can’t reference columns if they share a name with Python keywords.
  • You can use backticks, e.g. hello world to reference a columns that aren’t valid Python variables.
  • The result is a new DataFrame, unless you pass inplace=True, in which case it modifies the existing DataFrame.
  • Performance of .query() will often be better than complex masking operations (such as df[(df.region == "APAC") & (df.revenue < 300)]), because .query() doesn’t create intermediate objects, leaving everything in C.

Check out my Jupyter notebook if you want to play around with the .query() method!

Leave a Comment

Your email address will not be published. Required fields are marked *