SQL —> Pandas: Cheatsheet

Guha Ayan
6 min readDec 21, 2020

Here is a comprehensive list of common SQL operations and how to convert them to Pandas. I think in terms of SQL and it comes naturally to me. Hence I keep this cheat sheet handy so I can convert my SQL thoughts to Pandas without much fuss.

There are lots of SQL to Pandas articles out there in SO and elsewhere. However I could not find one with all the common operations. Hence came up with mine.

So, first things first. Let’s get a dataset. I live in Melbourne, Australia. Hence I used Melbourne Pedestrian Sensor Data and Sensor Location Data.

import pandas as pd
import numpy as np
import os

Once downloaded, you will have 2 pipe separated files

files = [f for f in os.listdir(".") if f.endswith('.csv')]
print(files)
['Pedestrian_Counting_System_-_Sensor_Locations.csv', 'Pedestrian_Counting_System_-_Monthly__counts_per_hour_.csv']

Now, Let’s import the data. And Start rolling.

Import Data and Cast Timestamp

tdf = pd.read_csv("Pedestrian_Counting_System_-_Monthly__counts_per_hour_.csv")
tdf['dt'] = pd.to_datetime(tdf.Date_Time, format="%m/%d/%Y %I:%M:%S %p")
tdf.head()

Also, read Sensor metadata

Projection, ie SELECT Columns

(Remember to use 2 square brackets)

ndf = tdf[['Date_Time', 'Sensor_Name', 'Hourly_Counts' ]]
ndf.head(10)

SQL

SELECT Date_Time, Sensor_Name, Hourly_Counts FROM T

Filter — Single/Multiple Clause

(Remember using Python Style double equality)

wdf = tdf[tdf.Hourly_Counts > 300]
ndf = wdf[['dt', 'Sensor_Name', 'Hourly_Counts' ]]

Multiple Clauses.

(Remember to join them using & for AND and | for OR)

wdf = tdf[(tdf.Hourly_Counts > 300) & (tdf.Hourly_Counts <= 1000)]
ndf = wdf[['Date_Time', 'Sensor_Name', 'Hourly_Counts' ]]
ndf.head()

SQL

SELECT Date_Time, Sensor_Name, Hourly_Counts 
FROM T
WHERE Hourly_Counts > 300
AND Hourly_Counts <= 1000

Group By Distinct

It is very common operation to get a distinct list of an attribute using GROUP BY in SQL. In Pandas, there are dedicated functions for that

u = tdf.Sensor_Name.unique()

SQL

SELECT Sensor_Name
FROM T
GROUP BY Sensor_Name

Group By — Aggregate

One of the key use of SQL is to group by and aggregate. Here is how it is written in Pandas. Note the COUNT(DISTINCT) function name is nunique

gdf = tdf.groupby(tdf.Year)['Hourly_Counts', 'Sensor_Name']
gdf = gdf.agg(
min_hc=pd.NamedAgg(column='Hourly_Counts', aggfunc='min'),
max_hc=pd.NamedAgg(column='Hourly_Counts', aggfunc='max'),
count_distinct_Sensor_Name=pd.NamedAgg(column='Sensor_Name', aggfunc='nunique')
)
gdf.head()

SQL

SELECT Year, 
MIN(Hourly_Counts) min_hc,
MAX(Hourly_Counts) max_hc,
COUNT(DISTINCT Sensor_Name) count_distinct_Sensor_Name
FROM T
GROUP BY Year

Order By — ASC/DESC

This one is pretty simple. Just remember to use a list of columns in the function.

odf = tdf.sort_values(by=['dt'],ascending=True)
odf = tdf.sort_values(by=['dt'],ascending=False)

CASE WHEN Expressions

Case when statements are very useful to implement inline business logic. To implement this in Pandas, we will take help of numpy.

Let us define the business logic. Let us assume we want to create a column “Time of Day”, which is defined as follows: Time 0–7 = TOD-1, 8–15 = TOD-2, 16–23 = TOD-3

Here is the Pandas version. Essentially it is a mapping between conditions and outcomes. Pretty reasonable.

rdf = tdf[['dt', 'Year', 'Month', 'Mdate', 'Time', 'Sensor_ID', 'Sensor_Name', 'Hourly_Counts' ]]rdf = rdf.sort_values(by = ['dt'])
rdf['tod'] = np.select(
[
rdf['Time'].between(0, 7, inclusive=True),
rdf['Time'].between(8, 15, inclusive=True),
rdf['Time'].between(16,23, inclusive=True)
],
[
'TOD-1',
'TOD-2',
'TOD-3'
],
default='Unknown'
)rdf.head()

SQL Version

SELECT dt, Year, Month, Mdate, Time, Sensor_Id, Sensor_Name,   
Hourly_Counts,
CASE WHEN TIME BETWEEN 0 AND 7 THEN 'TOD-1'
WHEN TIME BETWEEN 8 AND 15 THEN 'TOD-2'
WHEN TIME BETWEEN 16 AND 23 THEN 'TOD-3'
ELSE 'Unknown' END tod
FROM T

JOINs — INNER/OUTER

Pandas has a join API, but it is restrictive in terms of syntax. I found df.merge syntax is closer to SQL syntax and thus relatively easier to grasp.

To understand the following code, first let us find out if there is any sensor in the metadata which is not used in the dataset. (Based on time, this may be a bit different)

u = tdf.Sensor_ID.unique()
u1 = sdf.sensor_id.unique()
missing = [x for x in u1 if x not in u]
print(missing)

As it turns out, Sensor ID = 72 is not used.

Now, Let us aggregate pedestrian count data at Sensor Id level.

gdf = tdf.groupby(['Sensor_ID'])['Hourly_Counts'].sum()

Now, let us join sensor metadata with pedestrian data

rdf = pd.merge(sdf,gdf,left_on='sensor_id', right_on='Sensor_ID', how='inner' ).sort_values(by=['sensor_id'])
rdf.head(80)

As expected, No Sensor Id = 72 in case of INNER Join

SQL Version

SELECT s.sensor_description, s.sensor_id, s.sensor_name, h.hc
FROM SENSORS s INNER JOIN
(SELECT Sensor_ID, SUM(Hourly_Counts) hc
FROM T
GROUP BY Sensor_ID) h
ON s.sensor_id = h.Sensor_ID

And, of course, it shows up in Left Outer

rdf = pd.merge(sdf,gdf,left_on='sensor_id', right_on='Sensor_ID', how='left' ).sort_values(by=['sensor_id'])
rdf.head(80)

SQL Version

SELECT s.sensor_description, s.sensor_id, s.sensor_name, h.hc
FROM SENSORS s LEFT OUTER JOIN
(SELECT Sensor_ID, SUM(Hourly_Counts) hc
FROM T
GROUP BY Sensor_ID) h
ON s.sensor_id = h.Sensor_ID

Analytical Query — Partition By

Partitions By queries are very useful in SQL world. To simulate the same in Pandas world is extremely liberating.

fdf = tdf[((tdf.Sensor_Name == 'Alfred Place')| (tdf.Sensor_Name == 'Queen St (West)')) & (tdf.Year == 2020) & (tdf.Month == 'November') & (tdf.Mdate == 1)]## First, you have to sort the resultfdf = fdf.sort_values(by = ['Year', 'Month', 'Mdate', 'Time']).drop("ID", 1).drop("Date_Time",1).drop("Sensor_ID",1)## Lag/Lead
fdf['Lag'] = fdf.groupby(['Sensor_Name'])['Hourly_Counts'].shift(1)
fdf['Lead'] = fdf.groupby(['Sensor_Name'])['Hourly_Counts'].shift(-1)
## Cumulative SUMfdf['cumsum'] = fdf.groupby(['Sensor_Name'])['Hourly_Counts'].cumsum()## Total Sum
fdf['Total'] = fdf.groupby(['Sensor_Name'])['Hourly_Counts'].transform('sum')
## Cumulative SUm over 3 Preceeding Windowfdf['cumsum_3p'] = fdf.groupby(['Sensor_Name'])['Hourly_Counts'].rolling(3, min_periods = 1).sum()\
.reset_index(drop=True, level=0)
fdf.head(20)

This one requires you study the results a bit closely

SQL Version

SELECT dt, Year, Month, Mdate, Time, Sensor_Name, Hourly_Counts,
LAG(Hourly_Counts) OVER (PARTITION BY Sensor_Name ORDER BY dt) lag,
LEAD(Hourly_Counts) OVER (PARTITION BY Sensor_Name ORDER BY dt) lead,
SUM(Hourly_Counts) OVER (PARTITION BY Sensor_Name ORDER BY dt) cumsum,
SUM(Hourly_Counts) OVER (PARTITION BY Sensor_Name) total,
SUM(Hourly_Counts) OVER (PARTITION BY Sensor_Name ORDER BY dt ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) cumsum_3p FROM T
WHERE Sensor_Name IN ('Alfred Place', 'Queen St (West)')
AND Year = 2020
AND Month = 'November'
AND Mdate = 1

And I dare say once you are more comfortable with Pandas syntax, it seems more natural. This section is heavily adapted from here

So, here it is. If you are like me who writes SQL for living but wants to explore something different, I hope this article will help you. Please feel free to let me know if you have any other type of SQL you want me to cover/add.

--

--