Grouping Data In Pandas
Following from my last blog on Transforming data with Pandas, today in my series on Pandas Python I am going to discuss one of the more powerful features of pandas, the groupby object.
It works best with categorical data as you can effectively subset the data in to groups.
Groupby is a class and has its own methods many use the same names as the series and dataframe methods making it quicker to learn.
One way to think about a groupby object is that it is a dataframe of dataframes.
Once you have made a groupby object you can then using aggregation to combine the results, this allow you to get statistics for the groups.
Import data - Cars.CSV¶
import pandas as pd
import numpy as np
cars = pd.read_csv('cars.csv')
cars.head()
Unnamed: 0 | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Lets split up the make and model so we have two categorise.
cars[['Unnamed: 0', 'model']] = cars['Unnamed: 0'].str.split(' ', expand=True, n=1)
cars = cars.rename(columns={'Unnamed: 0': 'make'})
cars.head(3)
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | RX4 |
1 | Mazda | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 | RX4 Wag |
2 | Datsun | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 710 |
print(cars.shape)
cars.dtypes
(32, 13)
make object mpg float64 cyl int64 disp float64 hp int64 drat float64 wt float64 qsec float64 vs int64 am int64 gear int64 carb int64 model object dtype: object
Lets produce a groupby object using the make as the category and check it. It will not show data but rather than pandas object memory space.
make_group = cars.groupby('make')
make_group
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000198E62F66D0>
Lets use the size method to get a count of the values. Note it does not sort the data so a sort_values() can be used.
Note that size() is much the same thing as value_counts() which we use on dataframes
make_group.size().sort_values(ascending=False).head(5)
make Merc 7 Hornet 2 Toyota 2 Mazda 2 Fiat 2 dtype: int64
We can get the first row from each group but using head(1).
Note head(2) will show the first two rows from each group.
I am using a second head() to just return the top five.
make_group.head(1).head()
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | RX4 |
2 | Datsun | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 710 |
3 | Hornet | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 4 Drive |
5 | Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | None |
6 | Duster | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 | 360 |
Lets have a look at the first row from each of the categories in the groupby object using first().
This returns a dataframe with the 'make' as the index. It also sorted the values by this index.
make_group.first(1).head()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
make | |||||||||||
AMC | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Cadillac | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Camaro | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Chrysler | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Datsun | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Another way to get the first from each group is the nth() method.
The good thing about nth() is it return all columns not just numerical columns like first()
make_group.nth(0).head()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
make | ||||||||||||
AMC | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 | Javelin |
Cadillac | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 | Fleetwood |
Camaro | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 | Z28 |
Chrysler | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 | Imperial |
Datsun | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 710 |
We can use the groups attrabute to see the groups. It returns the index values for all rows in the group. Some makes only have one model in the dataset.
make_group.groups
{'AMC': [22], 'Cadillac': [14], 'Camaro': [23], 'Chrysler': [16], 'Datsun': [2], 'Dodge': [21], 'Duster': [6], 'Ferrari': [29], 'Fiat': [17, 25], 'Ford': [28], 'Honda': [18], 'Hornet': [3, 4], 'Lincoln': [15], 'Lotus': [27], 'Maserati': [30], 'Mazda': [0, 1], 'Merc': [7, 8, 9, 10, 11, 12, 13], 'Pontiac': [24], 'Porsche': [26], 'Toyota': [19, 20], 'Valiant': [5], 'Volvo': [31]}
So in a past post I discussed filtering. If want to filter for a single 'make' it could be done like this:-
filt = (cars['make'] == 'Merc')
cars[filt]
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | Merc | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20.0 | 1 | 0 | 4 | 2 | 240D |
8 | Merc | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.15 | 22.9 | 1 | 0 | 4 | 2 | 230 |
9 | Merc | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.3 | 1 | 0 | 4 | 4 | 280 |
10 | Merc | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.9 | 1 | 0 | 4 | 4 | 280C |
11 | Merc | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.07 | 17.4 | 0 | 0 | 3 | 3 | 450SE |
12 | Merc | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.73 | 17.6 | 0 | 0 | 3 | 3 | 450SL |
13 | Merc | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.78 | 18.0 | 0 | 0 | 3 | 3 | 450SLC |
But if you want to do this for more than one model it would be slow.
So this is where groupby can come in handy.
Get all rows for the 'Merc' group by using the get_group() method.
Remember we made the groupby object by running: make_group = cars.groupby('make')
make_group.get_group('Merc').head(3)
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | Merc | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20.0 | 1 | 0 | 4 | 2 | 240D |
8 | Merc | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.15 | 22.9 | 1 | 0 | 4 | 2 | 230 |
9 | Merc | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.44 | 18.3 | 1 | 0 | 4 | 4 | 280 |
But once you have the groupby object you can quickly get access to any/all the groups.
make_group.get_group('Fiat')
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | Fiat | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 | 128 |
25 | Fiat | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 | X1-9 |
Show a list of the names of all the groups. Like doing a nunique on a series.
make_group.groups.keys()
dict_keys(['AMC', 'Cadillac', 'Camaro', 'Chrysler', 'Datsun', 'Dodge', 'Duster', 'Ferrari', 'Fiat', 'Ford', 'Honda', 'Hornet', 'Lincoln', 'Lotus', 'Maserati', 'Mazda', 'Merc', 'Pontiac', 'Porsche', 'Toyota', 'Valiant', 'Volvo'])
Aggregation¶
So in the Pandas documentation it says "A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups."
This is referring to aggregation.
Lets use max on the 'mpg' column. This will give a pandas series for make_group, groupby object, and the max returns the largest values from 'mpg'
make_group['mpg'].max().head()
make AMC 15.2 Cadillac 10.4 Camaro 13.3 Chrysler 14.7 Datsun 22.8 Name: mpg, dtype: float64
Now lets make a groupby object with two columns
power = cars.groupby(['cyl', 'hp'])
If we use the max() function now it will bring back the numerical columns as max for a multi index dataframe for grouping of 'cyl' and 'hp'. This will use a multiIndex.
So this is giving us the max power by make and cylinders.
power.max().head(15)
make | mpg | disp | drat | wt | qsec | vs | am | gear | carb | model | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
cyl | hp | |||||||||||
4 | 52 | Honda | 30.4 | 75.7 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 | Civic |
62 | Merc | 24.4 | 146.7 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 | 240D | |
65 | Toyota | 33.9 | 71.1 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 | Corolla | |
66 | Fiat | 32.4 | 79.0 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 | X1-9 | |
91 | Porsche | 26.0 | 120.3 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 | 914-2 | |
93 | Datsun | 22.8 | 108.0 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 710 | |
95 | Merc | 22.8 | 140.8 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 | 230 | |
97 | Toyota | 21.5 | 120.1 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 | Corona | |
109 | Volvo | 21.4 | 121.0 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 | 142E | |
113 | Lotus | 30.4 | 95.1 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 | Europa | |
6 | 105 | Valiant | 18.1 | 225.0 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | NaN |
110 | Mazda | 21.4 | 258.0 | 3.90 | 3.215 | 19.44 | 1 | 1 | 4 | 4 | RX4 Wag | |
123 | Merc | 19.2 | 167.6 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 | 280C | |
175 | Ferrari | 19.7 | 145.0 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 | Dino | |
8 | 150 | Dodge | 15.5 | 318.0 | 3.15 | 3.520 | 17.30 | 0 | 0 | 3 | 2 | Javelin |
Agg Method¶
The agg method can be used to give more control and allows different aggregation(s) on each column. You can even pass a list.
power.agg({'mpg': 'max',
'hp': ['max', 'mean']}).head(3)
mpg | hp | |||
---|---|---|---|---|
max | max | mean | ||
cyl | hp | |||
4 | 52 | 30.4 | 52 | 52 |
62 | 24.4 | 62 | 62 | |
65 | 33.9 | 65 | 65 |
power.agg(['sum', 'max']).head(3)
make | mpg | disp | drat | wt | ... | vs | am | gear | carb | model | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sum | max | sum | max | sum | max | sum | max | sum | max | ... | sum | max | sum | max | sum | max | sum | max | sum | max | ||
cyl | hp | |||||||||||||||||||||
4 | 52 | Honda | Honda | 30.4 | 30.4 | 75.7 | 75.7 | 4.93 | 4.93 | 1.615 | 1.615 | ... | 1 | 1 | 1 | 1 | 4 | 4 | 2 | 2 | Civic | Civic |
62 | Merc | Merc | 24.4 | 24.4 | 146.7 | 146.7 | 3.69 | 3.69 | 3.190 | 3.190 | ... | 1 | 1 | 0 | 0 | 4 | 4 | 2 | 2 | 240D | 240D | |
65 | Toyota | Toyota | 33.9 | 33.9 | 71.1 | 71.1 | 4.22 | 4.22 | 1.835 | 1.835 | ... | 1 | 1 | 1 | 1 | 4 | 4 | 1 | 1 | Corolla | Corolla |
3 rows × 22 columns
Iterating over groups¶
Sometimes you find you have no choice but to iterate over groups. Lets see how to do this.
Lets make a new datframe and give it the same column names as the original cleaned dataframe
len(cars)
32
cars2 = pd.DataFrame(columns=cars.columns)
cars2
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model |
---|
Allows you to perform any action row by row. Note this will be slow on large dataframes.
I will use a Python for loop.
Using the cars dataframe again to get the highest 'mpg' form each group 'make'
# Make_group is the groupby object
# make takes one make group at a time
# data will then iterate the make group
for make, data in make_group:
highest_mpg_in_each_group = data.nlargest(1, 'mpg')
cars2 = cars2.append(highest_mpg_in_each_group)
cars2
# lets sort by mpg
cars2.sort_values(by='mpg', ascending=False)
make | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 | Toyota | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 | Corolla |
17 | Fiat | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 | 128 |
27 | Lotus | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 | Europa |
18 | Honda | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 | Civic |
26 | Porsche | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 | 914-2 |
7 | Merc | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 | 240D |
2 | Datsun | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 | 710 |
3 | Hornet | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 | 4 Drive |
31 | Volvo | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 | 142E |
0 | Mazda | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 | RX4 |
29 | Ferrari | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 | Dino |
24 | Pontiac | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 | Firebird |
5 | Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 | None |
28 | Ford | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 | Pantera L |
21 | Dodge | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 | Challenger |
22 | AMC | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 | Javelin |
30 | Maserati | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 | Bora |
16 | Chrysler | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 | Imperial |
6 | Duster | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 | 360 |
23 | Camaro | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 | Z28 |
14 | Cadillac | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 | Fleetwood |
15 | Lincoln | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 | Continental |
Right that is all for today. Hopefully this was interesting for you.