Pandas is Magic

Pandas is a hugely popular Python library for doing data analysis. After having used it for around a year now I can safely say that it deserves all off the hype that it gets.

Pandas is mostly know for its use in the data science space. Since I'm in no way a data scientist I initially avoided it–I'm not working with big data after all–not even close. But the fact is that Pandas is so incredibly useful for working with tabular data of any kind and any size. And it's perfect for doing the kinds of ad-hoc reporting and mashing-together of spreadsheets that comprises much of my daily work.

There are a few "killer" features of Pandas in my opinion and, of these, only one or two are technical in nature. Mostly it just does simple things very very well. Here are some of my favorite features:

1. It is dead simple to read in data of almost any kind (JSON, CSV, Excel, SQL..):

Read in data about co-sponsors of climate bills

2. Pandas displays data in clean, well-formatted HTML tables. And you can slice and dice the data however you like. I had no appreciation for how much mental energy I was expending trying to make sense of data smashed together in the console back what I was using Ruby to work with data like this. Here I'm looking at a few columns for just Republican cosponsors:

Select fields for Republican cosponsors

3. Pandas makes grouping and summarizing data dead simple. Say we want counts for each of our bills:

Total counts for each bill

Or counts by party for each bill:

Counts by bill and party

4. If you need to bring in data from another spreadsheet or table, it's very simple to join it to your dataset based on a similar key. In this case we have a table of congressional member data that includes how often a member of congress votes against their party. For members who are cosponsoring these climate bills, let's add that percentage, their Twitter handle, and the year that they are next up for election:

A sample of our member spreadsheet

The common field is the id field in the member table, which corresponds to the cosponsor_id field in the cosponsor dataset:

Join cosponsor with member data

In this case we use pd.merge and pass it the cosponsor dataset ds and the member dataset member_ds second. the left_on and right_on arguments are saying which ID fields we're using for the join and the how=left means that we want all of the cosponsor records, even if they don't have a corresponding record in the member table. I did that here becuase my member dataset only has house members.

With the output you can see that we now have those new fields from the member table. This would be similar to the vlookup function to join together Excel tables.

Now that we have this joined data we can do really cool stuff, like see how many Democrat vs Republican members are on these bills, and how often they vote against their party:

Aggregate our newly-joined fields

5. Lasty, Pandas is wicked fast. Python is not known to be the most highly-performant language out there, so how does it accomplish this speed and efficiency? The trick is that Pandas is based on NumPy, which is a numerical analysis library. NumPy does a lot of its actual work in C rather than Python. That means that it can be very memory-efficient, and can do highly concurrent operations. From what I understand (and we're outside my area of expertise here) when you do things like index on a Pandas column, or apply some operation on it, NumPy is vectorizing that process across all of the rows.

Since Pandas is so widely-used, there are a huge number of libraries and utilities that work with it. In particular graphing and data visualization libraries can easily take in Pandas dataframes and graph them out. One of my favorites is Altair. Let's graph out our co-sponsor dataset, looking at the count of cosponors on each bill:

Altair visualization of cosponsor counts for each bill

Not bad for one line right?

I could go on and on but this gives you a small idea of the kinds of things you can do with Pandas. As I mentioned above I'm no data expert–I use a tiny fraction of the features that this library offers–but you can see how quickly and easily we can read in, look at, and manipulate data for these simple scenarios. When searching for resources and tutorials on Pandas, I found that much of it was aimed at data scientists and was tough to follow. One of the best courses that I found was Move from Excel to Python with Pandas by Talk Python. It does a great job of walking through scenarios like this one, and shows the what the corresponding grouping/joining/pivoting operations would be within Excel.

Note: Data here was taken from the ProPublica Congress API