Pandas Split

I use the open source Python pandas library frequently for data processing and analysis and plotting and all the things.

A function I use often is the GroupBy function, which pandas creator Wes Mckinney defines as a method to “split-apply-combine”. It allows the user to split the data into subsets according to values in a selected array, apply some function on each separate subset, then combine all the data back into an object with the aggregated values.

In essence, we’re performing a map-reduce, where the first two steps (split & apply) are the mapping step and the last step (combine) is the reduce step.

Alright. But the reason I’m writing this lightning post is to share excitement about a functionality of GroupBy I had never seen before.

Here goes.

My use case: I’ll often have a dataset which I’ll want to subset according to values in one or more columns, and manipulate each of the subsets individually.

Groupby solution: Create a dictionary from the GroupBy object!

Example: Unemployment data from Spain, years 2013-2015

(Source)

Here are the first 10 rows:

Let’s say I want to subset the data by year. There are a few ways to do this in pandas, but with GroupBy… ONE ELEGANT LINE:

year_subsets = dict(list(new_df.groupby('Year')))

The resulting year_subsets is a dictionary whose keys are years (2013, 2014, 2015) and whose values are DataFrames of rows from said year. For example, year_subsets['2014'] returns a DataFrame object.

There are alternative ways to subset data in pandas, but this is the only one-liner I’ve seen that returns the data in its original format, with its original indices.

March 21, 2016