Selecting Columns And Filtering Rows In Pandas
Following on from my blog on working with dates in Pandas. I wanted to discuss filtering and selecting data in Python Pandas.
Pandas has a number of ways to filter, select and subset data. I will cover some of these options.
Lets lead some data
import seaborn as sns
import pandas as pd
taxis = sns.load_dataset('taxis')
taxis.drop(['passengers', 'distance', 'pickup_zone', 'dropoff_zone'], inplace=True, axis=1)
taxis.head(3)
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Manhattan | Manhattan |
1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Manhattan | Manhattan |
2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
Selecting Columns¶
Selecting one column¶
There are two way to reference a columns. Using dot notation or bracket notation. I would suggest using the bracket notation as dot notation raises and error if there is spaces in the column name.
# bracket notation
taxis['dropoff'].head(3)
0 2019-03-23 20:27:24 1 2019-03-04 16:19:00 2 2019-03-27 18:00:25 Name: dropoff, dtype: object
# dot notation
taxis.dropoff.head(3)
0 2019-03-23 20:27:24 1 2019-03-04 16:19:00 2 2019-03-27 18:00:25 Name: dropoff, dtype: object
Selecting multiple columns¶
Can be done by passing a list in to the brackets. It can also be done putting the list directly into the brackets, this is something I found confusing when learning, the below example keeps it simple and clear.
my_cols = ['fare', 'total', 'pickup_borough']
taxis[my_cols].head(3)
fare | total | pickup_borough | |
---|---|---|---|
0 | 7.0 | 12.95 | Manhattan |
1 | 5.0 | 9.30 | Manhattan |
2 | 7.5 | 14.16 | Manhattan |
Selecting columns using the columns parameter¶
The columns parameter returns a list of all the columns.
taxis.columns
Index(['pickup', 'dropoff', 'fare', 'tip', 'tolls', 'total', 'color', 'payment', 'pickup_borough', 'dropoff_borough'], dtype='object')
So if we pass the df and columns parameter an index number.
col7 = taxis.columns[7]
taxis[col7].head(3)
0 credit card 1 cash 2 credit card Name: payment, dtype: object
The same thing can be done using a slice
cols = taxis.columns[3:7]
taxis[cols].head(3)
tip | tolls | total | color | |
---|---|---|---|---|
0 | 2.15 | 0.0 | 12.95 | yellow |
1 | 0.00 | 0.0 | 9.30 | yellow |
2 | 2.36 | 0.0 | 14.16 | yellow |
Selecting columns using loc[]¶
The loc method can be used for selecting columns and filtering rows but for now I will focus on selecting columns and just get all rows.
Loc takes the columns names and one or more columns can be selected.
A conditional can also be used to make the boolean list.
my_cols = ['fare', 'total', 'pickup_borough']
taxis.loc[:, my_cols].head(3)
fare | total | pickup_borough | |
---|---|---|---|
0 | 7.0 | 12.95 | Manhattan |
1 | 5.0 | 9.30 | Manhattan |
2 | 7.5 | 14.16 | Manhattan |
Selecting columns using iloc[]¶
iloc uses integer-location based indexing and so slicing can be used if more than one column is needed.
taxis.iloc[:, 3:7].head(3)
tip | tolls | total | color | |
---|---|---|---|---|
0 | 2.15 | 0.0 | 12.95 | yellow |
1 | 0.00 | 0.0 | 9.30 | yellow |
2 | 2.36 | 0.0 | 14.16 | yellow |
Selecting columns, by values in the columns, using any()¶
Say you want to select all columns that have a value of 7 somewhere in the column. We can produce a boolean list looking for equality and using the any method.
(taxis == 7).any().values
array([False, False, True, True, False, True, False, False, False, False])
selection = (taxis == 7).any()
taxis.loc[:, selection].head(3)
fare | tip | total | |
---|---|---|---|
0 | 7.0 | 2.15 | 12.95 |
1 | 5.0 | 0.00 | 9.30 |
2 | 7.5 | 2.36 | 14.16 |
Same thing could be used for selecting with multiple conditions using and '&' or or '|'.
taxis_sel = taxis.select_dtypes(include=['float64']) # select floats only
selection = ((taxis_sel >= 10) | (taxis_sel <= 30)).any()
taxis_sel.loc[:, selection].head(3)
fare | tip | tolls | total | |
---|---|---|---|---|
0 | 7.0 | 2.15 | 0.0 | 12.95 |
1 | 5.0 | 0.00 | 0.0 | 9.30 |
2 | 7.5 | 2.36 | 0.0 | 14.16 |
Selecting columns by column name using isin() and columns¶
This can be useful if you have a large number of columns and have a list of column names that you want to compare with your dataframe.
cols_list = ['pickup_borough', 'other_col_name', 'fare']
taxis.loc[:, taxis.columns.isin(cols_list)].head(3)
fare | pickup_borough | |
---|---|---|
0 | 7.0 | Manhattan |
1 | 5.0 | Manhattan |
2 | 7.5 | Manhattan |
Filtering rows¶
Many of the same ways we select columns also work for rows.
The Dataframe indexing operator [] is designed for selecting columns and not rows. But can be used via slicing. I would suggest to not use it and rather use loc or iloc.
Let quickly customise the index
taxis['id'] = 'id'
taxis.reset_index(level=0, inplace=True)
taxis['index'] = taxis['id'] + '_' + taxis['index'].astype(str)
taxis.drop('id', axis=1, inplace=True)
taxis.set_index('index', drop=True, inplace=True)
taxis.head(3)
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Manhattan | Manhattan |
id_1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Manhattan | Manhattan |
id_2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
Selecting one row with loc¶
taxis.loc['id_2',:]
pickup 2019-03-27 17:53:01 dropoff 2019-03-27 18:00:25 fare 7.5 tip 2.36 tolls 0.0 total 14.16 color yellow payment credit card pickup_borough Manhattan dropoff_borough Manhattan Name: id_2, dtype: object
Selecting multiple rows with loc¶
taxis.loc[['id_2', 'id_6'],:]
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
id_6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 13.0 | 2.00 | 0.0 | 18.80 | yellow | credit card | Manhattan | Manhattan |
Selecting rows with iloc¶
taxis.iloc[2,:]
pickup 2019-03-27 17:53:01 dropoff 2019-03-27 18:00:25 fare 7.5 tip 2.36 tolls 0.0 total 14.16 color yellow payment credit card pickup_borough Manhattan dropoff_borough Manhattan Name: id_2, dtype: object
Selecting multiple rows with iloc¶
taxis.iloc[[2,6],:]
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
id_6 | 2019-03-26 21:07:31 | 2019-03-26 21:17:29 | 13.0 | 2.00 | 0.0 | 18.80 | yellow | credit card | Manhattan | Manhattan |
Selecting rows by the values in a column with one condition¶
There are a number of way of doing this but most use a boolean mask. There is also the query method.
mask = (taxis['payment'] == 'credit card')
taxis.loc[mask].head(3)
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Manhattan | Manhattan |
id_2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
id_3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Manhattan | Manhattan |
Selecting rows by the values in a column with multiple conditions¶
mask = (taxis['payment'] == 'credit card') & (taxis['color'] == 'green')
taxis.loc[mask].head(3)
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_5452 | 2019-03-14 22:04:04 | 2019-03-14 22:10:00 | 5.50 | 0.0 | 0.00 | 6.80 | green | credit card | Queens | Queens |
id_5453 | 2019-03-29 18:12:27 | 2019-03-29 18:20:40 | 7.50 | 1.2 | 0.00 | 10.50 | green | credit card | Manhattan | Manhattan |
id_5456 | 2019-03-12 21:11:03 | 2019-03-12 21:41:36 | 42.82 | 0.0 | 5.76 | 49.08 | green | credit card | Queens | Manhattan |
Using isin, multiple values in one column¶
A very handy function.
mask = taxis['pickup_borough'].isin(['Manhattan', 'Queens'])
taxis.loc[mask]
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Manhattan | Manhattan |
id_1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Manhattan | Manhattan |
id_2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
id_3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Manhattan | Manhattan |
id_4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Manhattan | Manhattan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
id_6423 | 2019-03-12 08:10:47 | 2019-03-12 08:35:35 | 18.5 | 0.00 | 0.0 | 19.30 | green | credit card | Queens | Queens |
id_6424 | 2019-03-30 20:52:15 | 2019-03-30 20:59:55 | 8.0 | 0.00 | 0.0 | 9.30 | green | cash | Manhattan | Manhattan |
id_6426 | 2019-03-28 08:04:47 | 2019-03-28 08:07:46 | 4.5 | 0.50 | 0.0 | 5.80 | green | credit card | Manhattan | Manhattan |
id_6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | Manhattan | Manhattan |
id_6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Queens | Bronx |
5925 rows × 10 columns
Selecting rows using query()¶
Query is nice to work with but just be careful with the inverted commas.
taxis.query("pickup_borough == 'Manhattan' | pickup_borough == 'Queens'")
pickup | dropoff | fare | tip | tolls | total | color | payment | pickup_borough | dropoff_borough | |
---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||
id_0 | 2019-03-23 20:21:09 | 2019-03-23 20:27:24 | 7.0 | 2.15 | 0.0 | 12.95 | yellow | credit card | Manhattan | Manhattan |
id_1 | 2019-03-04 16:11:55 | 2019-03-04 16:19:00 | 5.0 | 0.00 | 0.0 | 9.30 | yellow | cash | Manhattan | Manhattan |
id_2 | 2019-03-27 17:53:01 | 2019-03-27 18:00:25 | 7.5 | 2.36 | 0.0 | 14.16 | yellow | credit card | Manhattan | Manhattan |
id_3 | 2019-03-10 01:23:59 | 2019-03-10 01:49:51 | 27.0 | 6.15 | 0.0 | 36.95 | yellow | credit card | Manhattan | Manhattan |
id_4 | 2019-03-30 13:27:42 | 2019-03-30 13:37:14 | 9.0 | 1.10 | 0.0 | 13.40 | yellow | credit card | Manhattan | Manhattan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
id_6423 | 2019-03-12 08:10:47 | 2019-03-12 08:35:35 | 18.5 | 0.00 | 0.0 | 19.30 | green | credit card | Queens | Queens |
id_6424 | 2019-03-30 20:52:15 | 2019-03-30 20:59:55 | 8.0 | 0.00 | 0.0 | 9.30 | green | cash | Manhattan | Manhattan |
id_6426 | 2019-03-28 08:04:47 | 2019-03-28 08:07:46 | 4.5 | 0.50 | 0.0 | 5.80 | green | credit card | Manhattan | Manhattan |
id_6428 | 2019-03-31 09:51:53 | 2019-03-31 09:55:27 | 4.5 | 1.06 | 0.0 | 6.36 | green | credit card | Manhattan | Manhattan |
id_6429 | 2019-03-31 17:38:00 | 2019-03-31 18:34:23 | 58.0 | 0.00 | 0.0 | 58.80 | green | credit card | Queens | Bronx |
5925 rows × 10 columns
Conclusion¶
Pandas gives a number way to select columns/filter rows and it can be tricky to know which to use. If you are new to Pandas then .loc[] is your best bet. If you are coming from SQL then maybe query() might make you feel more at home.
Clearly these methods can be mixed and matched as needed. Thats all for now. Happy coding!