Grouping Data In Pandas

groupby

Following from my last blog on Transforming data with Pandas, today in my series on Pandas Python I am going to discuss one of the more powerful features of pandas, the groupby object.

It works best with categorical data as you can effectively subset the data in to groups.

Groupby is a class and has its own methods many use the same names as the series and dataframe methods making it quicker to learn.

One way to think about a groupby object is that it is a dataframe of dataframes.

Once you have made a groupby object you can then using aggregation to combine the results, this allow you to get statistics for the groups.

Import data - Cars.CSV¶

In [ ]:

import pandas as pd
import numpy as np
cars = pd.read_csv('cars.csv')
cars.head()

Out[ ]:

	Unnamed: 0	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2

Lets split up the make and model so we have two categorise.

In [ ]:

cars[['Unnamed: 0', 'model']] = cars['Unnamed: 0'].str.split(' ', expand=True, n=1)
cars = cars.rename(columns={'Unnamed: 0': 'make'})

cars.head(3)

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	model
0	Mazda	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4	RX4
1	Mazda	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4	RX4 Wag
2	Datsun	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1	710

In [ ]:

print(cars.shape)
cars.dtypes

(32, 13)

Out[ ]:

make      object
mpg      float64
cyl        int64
disp     float64
hp         int64
drat     float64
wt       float64
qsec     float64
vs         int64
am         int64
gear       int64
carb       int64
model     object
dtype: object

Lets produce a groupby object using the make as the category and check it. It will not show data but rather than pandas object memory space.

In [ ]:

make_group = cars.groupby('make')
make_group

Out[ ]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000198E62F66D0>

Lets use the size method to get a count of the values. Note it does not sort the data so a sort_values() can be used.

Note that size() is much the same thing as value_counts() which we use on dataframes

In [ ]:

make_group.size().sort_values(ascending=False).head(5)

Out[ ]:

make
Merc      7
Hornet    2
Toyota    2
Mazda     2
Fiat      2
dtype: int64

We can get the first row from each group but using head(1).

Note head(2) will show the first two rows from each group.

I am using a second head() to just return the top five.

In [ ]:

make_group.head(1).head()

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	model
0	Mazda	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4	RX4
2	Datsun	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1	710
3	Hornet	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1	4 Drive
5	Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1	None
6	Duster	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4	360

Lets have a look at the first row from each of the categories in the groupby object using first().

This returns a dataframe with the 'make' as the index. It also sorted the values by this index.

In [ ]:

make_group.first(1).head()

Out[ ]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
make
AMC	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Cadillac	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Camaro	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Chrysler	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Datsun	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1

Another way to get the first from each group is the nth() method.

The good thing about nth() is it return all columns not just numerical columns like first()

In [ ]:

make_group.nth(0).head()

Out[ ]:

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	model
make
AMC	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2	Javelin
Cadillac	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4	Fleetwood
Camaro	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4	Z28
Chrysler	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4	Imperial
Datsun	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1	710

We can use the groups attrabute to see the groups. It returns the index values for all rows in the group. Some makes only have one model in the dataset.

In [ ]:

make_group.groups

Out[ ]:

{'AMC': [22], 'Cadillac': [14], 'Camaro': [23], 'Chrysler': [16], 'Datsun': [2], 'Dodge': [21], 'Duster': [6], 'Ferrari': [29], 'Fiat': [17, 25], 'Ford': [28], 'Honda': [18], 'Hornet': [3, 4], 'Lincoln': [15], 'Lotus': [27], 'Maserati': [30], 'Mazda': [0, 1], 'Merc': [7, 8, 9, 10, 11, 12, 13], 'Pontiac': [24], 'Porsche': [26], 'Toyota': [19, 20], 'Valiant': [5], 'Volvo': [31]}

So in a past post I discussed filtering. If want to filter for a single 'make' it could be done like this:-

In [ ]:

filt = (cars['make'] == 'Merc')
cars[filt]

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb	model
7	Merc	24.4	4	146.7	62	3.69	3.19	20.0	1	4	2	240D
8	Merc	22.8	4	140.8	95	3.92	3.15	22.9	1	4	2	230
9	Merc	19.2	6	167.6	123	3.92	3.44	18.3	1	4	4	280
10	Merc	17.8	6	167.6	123	3.92	3.44	18.9	1	4	4	280C
11	Merc	16.4	8	275.8	180	3.07	4.07	17.4	0	3	3	450SE
12	Merc	17.3	8	275.8	180	3.07	3.73	17.6	0	3	3	450SL
13	Merc	15.2	8	275.8	180	3.07	3.78	18.0	0	3	3	450SLC

But if you want to do this for more than one model it would be slow.

So this is where groupby can come in handy.

Get all rows for the 'Merc' group by using the get_group() method.

Remember we made the groupby object by running: make_group = cars.groupby('make')

In [ ]:

make_group.get_group('Merc').head(3)

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb	model
7	Merc	24.4	4	146.7	62	3.69	3.19	20.0	1	4	2	240D
8	Merc	22.8	4	140.8	95	3.92	3.15	22.9	1	4	2	230
9	Merc	19.2	6	167.6	123	3.92	3.44	18.3	1	4	4	280

But once you have the groupby object you can quickly get access to any/all the groups.

In [ ]:

make_group.get_group('Fiat')

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	model
17	Fiat	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1	128
25	Fiat	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1	X1-9

Show a list of the names of all the groups. Like doing a nunique on a series.

In [ ]:

make_group.groups.keys()

Out[ ]:

dict_keys(['AMC', 'Cadillac', 'Camaro', 'Chrysler', 'Datsun', 'Dodge', 'Duster', 'Ferrari', 'Fiat', 'Ford', 'Honda', 'Hornet', 'Lincoln', 'Lotus', 'Maserati', 'Mazda', 'Merc', 'Pontiac', 'Porsche', 'Toyota', 'Valiant', 'Volvo'])

Aggregation¶

So in the Pandas documentation it says "A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups."

This is referring to aggregation.

Lets use max on the 'mpg' column. This will give a pandas series for make_group, groupby object, and the max returns the largest values from 'mpg'

In [ ]:

make_group['mpg'].max().head()

Out[ ]:

make
AMC         15.2
Cadillac    10.4
Camaro      13.3
Chrysler    14.7
Datsun      22.8
Name: mpg, dtype: float64

Now lets make a groupby object with two columns

In [ ]:

power = cars.groupby(['cyl', 'hp'])

If we use the max() function now it will bring back the numerical columns as max for a multi index dataframe for grouping of 'cyl' and 'hp'. This will use a multiIndex.

So this is giving us the max power by make and cylinders.

In [ ]:

power.max().head(15)

Out[ ]:

		make	mpg	disp	drat	wt	qsec	vs	am	gear	carb	model
cyl	hp
4	52	Honda	30.4	75.7	4.93	1.615	18.52	1	1	4	2	Civic
	62	Merc	24.4	146.7	3.69	3.190	20.00	1	0	4	2	240D
	65	Toyota	33.9	71.1	4.22	1.835	19.90	1	1	4	1	Corolla
	66	Fiat	32.4	79.0	4.08	2.200	19.47	1	1	4	1	X1-9
	91	Porsche	26.0	120.3	4.43	2.140	16.70	0	1	5	2	914-2
	93	Datsun	22.8	108.0	3.85	2.320	18.61	1	1	4	1	710
	95	Merc	22.8	140.8	3.92	3.150	22.90	1	0	4	2	230
	97	Toyota	21.5	120.1	3.70	2.465	20.01	1	0	3	1	Corona
	109	Volvo	21.4	121.0	4.11	2.780	18.60	1	1	4	2	142E
	113	Lotus	30.4	95.1	3.77	1.513	16.90	1	1	5	2	Europa
6	105	Valiant	18.1	225.0	2.76	3.460	20.22	1	0	3	1	NaN
	110	Mazda	21.4	258.0	3.90	3.215	19.44	1	1	4	4	RX4 Wag
	123	Merc	19.2	167.6	3.92	3.440	18.90	1	0	4	4	280C
	175	Ferrari	19.7	145.0	3.62	2.770	15.50	0	1	5	6	Dino
8	150	Dodge	15.5	318.0	3.15	3.520	17.30	0	0	3	2	Javelin

Agg Method¶

The agg method can be used to give more control and allows different aggregation(s) on each column. You can even pass a list.

In [ ]:

power.agg({'mpg': 'max',
            'hp': ['max', 'mean']}).head(3)

Out[ ]:

		mpg	hp
		max	max	mean
cyl	hp
4	52	30.4	52	52
	62	24.4	62	62
	65	33.9	65	65

In [ ]:

power.agg(['sum', 'max']).head(3)

Out[ ]:

		make		mpg		disp		drat		wt		...	vs		am		gear		carb		model
		sum	max	sum	max	sum	max	sum	max	sum	max	...	sum	max	sum	max	sum	max	sum	max	sum	max
cyl	hp
4	52	Honda	Honda	30.4	30.4	75.7	75.7	4.93	4.93	1.615	1.615	...	1	1	1	1	4	4	2	2	Civic	Civic
	62	Merc	Merc	24.4	24.4	146.7	146.7	3.69	3.69	3.190	3.190	...	1	1	0	0	4	4	2	2	240D	240D
	65	Toyota	Toyota	33.9	33.9	71.1	71.1	4.22	4.22	1.835	1.835	...	1	1	1	1	4	4	1	1	Corolla	Corolla

3 rows × 22 columns

Iterating over groups¶

Sometimes you find you have no choice but to iterate over groups. Lets see how to do this.

Lets make a new datframe and give it the same column names as the original cleaned dataframe

In [ ]:

len(cars)

Out[ ]:

In [ ]:

cars2 = pd.DataFrame(columns=cars.columns)
cars2

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	model

Allows you to perform any action row by row. Note this will be slow on large dataframes.

I will use a Python for loop.

Using the cars dataframe again to get the highest 'mpg' form each group 'make'

In [ ]:

# Make_group is the groupby object
# make takes one make group at a time
# data will then iterate the make group

for make, data in make_group:
    highest_mpg_in_each_group = data.nlargest(1, 'mpg')
    cars2 = cars2.append(highest_mpg_in_each_group)
cars2

# lets sort by mpg
cars2.sort_values(by='mpg', ascending=False)

Out[ ]:

	make	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	model
19	Toyota	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1	Corolla
17	Fiat	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1	128
27	Lotus	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2	Europa
18	Honda	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2	Civic
26	Porsche	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2	914-2
7	Merc	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2	240D
2	Datsun	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1	710
3	Hornet	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1	4 Drive
31	Volvo	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2	142E
0	Mazda	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4	RX4
29	Ferrari	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6	Dino
24	Pontiac	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2	Firebird
5	Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1	None
28	Ford	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4	Pantera L
21	Dodge	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2	Challenger
22	AMC	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2	Javelin
30	Maserati	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8	Bora
16	Chrysler	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4	Imperial
6	Duster	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4	360
23	Camaro	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4	Z28
14	Cadillac	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4	Fleetwood
15	Lincoln	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4	Continental

Right that is all for today. Hopefully this was interesting for you.

Transforming Datasets With Pandas

Joining Data In Pandas