Skip to main content

JupySQL and chDB

JupySQL is a Python library that lets you run SQL in Jupyter notebooks and the iPython shell. In this guide, we're going to learn how to query data using chDB and JupySQL.

Setup

Let's first create a virtual environment:

python -m venv .venv
source .venv/bin/activate

And then, we'll install JupySQL, iPython, and Jupyter Lab:

pip install jupysql ipython jupyterlab

We can use JupySQL in iPython, which we can launch by running:

ipython

Or in Jupyter Lab, by running:

jupyter lab
Note

If you're using Jupyter Lab, you'll need to create a notebook before following the rest of the guide.

Downloading a dataset

We're going to use one of Jeff Sackmann's tennis_atp dataset, which contains metadata about players and their rankings over time. Let's start by downloading the rankings files:

from urllib.request import urlretrieve
files = ['00s', '10s', '20s', '70s', '80s', '90s', 'current']
base = "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master"
for file in files:
_ = urlretrieve(
f"{base}/atp_rankings_{file}.csv",
f"atp_rankings_{file}.csv",
)

Configuring chDB and JupySQL

Next, let's import chDB's dbapi module:

from chdb import dbapi

And we'll create a chDB connection. Any data that we persist will be saved to the atp.chdb directory:

conn = dbapi.connect(path="atp.chdb")

Let's now load the sql magic and create a connection to chDB:

%load_ext sql
%sql conn --alias chdb

Next, we'll display the display limit so that results of queries won't be truncated:

%config SqlMagic.displaylimit = None

## Querying data in CSV files

We've downloaded a bunch of files with the atp_rankings prefix. Let's use the DESCRIBE clause to understand the schema:

%%sql
DESCRIBE file('atp_rankings*.csv')
SETTINGS describe_compact_output=1,
schema_inference_make_columns_nullable=0
+--------------+-------+
| name | type |
+--------------+-------+
| ranking_date | Int64 |
| rank | Int64 |
| player | Int64 |
| points | Int64 |
+--------------+-------+

We can also write a SELECT query directly against these files to see what the data looks like:

%sql SELECT * FROM file('atp_rankings*.csv') LIMIT 1
+--------------+------+--------+--------+
| ranking_date | rank | player | points |
+--------------+------+--------+--------+
| 20000110 | 1 | 101736 | 4135 |
+--------------+------+--------+--------+

The format of the data is a bit weird. Let's clean that date up and use the REPLACE clause to return the cleaned up ranking_date:

%%sql
SELECT * REPLACE (
toDate(parseDateTime32BestEffort(toString(ranking_date))) AS ranking_date
)
FROM file('atp_rankings*.csv')
LIMIT 10
SETTINGS schema_inference_make_columns_nullable=0
+--------------+------+--------+--------+
| ranking_date | rank | player | points |
+--------------+------+--------+--------+
| 2000-01-10 | 1 | 101736 | 4135 |
| 2000-01-10 | 2 | 102338 | 2915 |
| 2000-01-10 | 3 | 101948 | 2419 |
| 2000-01-10 | 4 | 103017 | 2184 |
| 2000-01-10 | 5 | 102856 | 2169 |
| 2000-01-10 | 6 | 102358 | 2107 |
| 2000-01-10 | 7 | 102839 | 1966 |
| 2000-01-10 | 8 | 101774 | 1929 |
| 2000-01-10 | 9 | 102701 | 1846 |
| 2000-01-10 | 10 | 101990 | 1739 |
+--------------+------+--------+--------+

Importing CSV files into chDB

Now we're going to store the data from these CSV files in a table. The default database doesn't persist data on disk, so we need to create another database first:

%sql CREATE DATABASE atp

And now we're going to create a table called rankings whos schema will be derived from the structure of the data in the CSV files:

%%sql
CREATE TABLE atp.rankings
ENGINE=MergeTree
ORDER BY ranking_date AS
SELECT * REPLACE (
toDate(parseDateTime32BestEffort(toString(ranking_date))) AS ranking_date
)
FROM file('atp_rankings*.csv')
SETTINGS schema_inference_make_columns_nullable=0

Let's do a quick check on the data in our table:

%sql SELECT * FROM atp.rankings LIMIT 10
+--------------+------+--------+--------+
| ranking_date | rank | player | points |
+--------------+------+--------+--------+
| 2000-01-10 | 1 | 101736 | 4135 |
| 2000-01-10 | 2 | 102338 | 2915 |
| 2000-01-10 | 3 | 101948 | 2419 |
| 2000-01-10 | 4 | 103017 | 2184 |
| 2000-01-10 | 5 | 102856 | 2169 |
| 2000-01-10 | 6 | 102358 | 2107 |
| 2000-01-10 | 7 | 102839 | 1966 |
| 2000-01-10 | 8 | 101774 | 1929 |
| 2000-01-10 | 9 | 102701 | 1846 |
| 2000-01-10 | 10 | 101990 | 1739 |
+--------------+------+--------+--------+

Looks good - the output, as expected, is the same as when querying the CSV files directly.

We're going to follow the same process for the player metadata. This time the data is all in a single CSV file, so let's download that file:

_ = urlretrieve(
f"{base}/atp_players.csv",
"atp_players.csv",
)

And then create a table called players based on the content of the CSV file. We'll also clean up the dob field so that its a Date32 type.

In ClickHouse, the Date type only supports dates from 1970 onwards. Since the dob column contains dates from before 1970, we'll use the Date32 type instead.

%%sql
CREATE TABLE atp.players
Engine=MergeTree
ORDER BY player_id AS
SELECT * REPLACE (
makeDate32(
toInt32OrNull(substring(toString(dob), 1, 4)),
toInt32OrNull(substring(toString(dob), 5, 2)),
toInt32OrNull(substring(toString(dob), 7, 2))
)::Nullable(Date32) AS dob
)
FROM file('atp_players.csv')
SETTINGS schema_inference_make_columns_nullable=0

Once that's finished running, we can have a look at the data we've ingested:

%sql SELECT * FROM atp.players LIMIT 10
+-----------+------------+-----------+------+------------+-----+--------+-------------+
| player_id | name_first | name_last | hand | dob | ioc | height | wikidata_id |
+-----------+------------+-----------+------+------------+-----+--------+-------------+
| 100001 | Gardnar | Mulloy | R | 1913-11-22 | USA | 185 | Q54544 |
| 100002 | Pancho | Segura | R | 1921-06-20 | ECU | 168 | Q54581 |
| 100003 | Frank | Sedgman | R | 1927-10-02 | AUS | 180 | Q962049 |
| 100004 | Giuseppe | Merlo | R | 1927-10-11 | ITA | 0 | Q1258752 |
| 100005 | Richard | Gonzalez | R | 1928-05-09 | USA | 188 | Q53554 |
| 100006 | Grant | Golden | R | 1929-08-21 | USA | 175 | Q3115390 |
| 100007 | Abe | Segal | L | 1930-10-23 | RSA | 0 | Q1258527 |
| 100008 | Kurt | Nielsen | R | 1930-11-19 | DEN | 0 | Q552261 |
| 100009 | Istvan | Gulyas | R | 1931-10-14 | HUN | 0 | Q51066 |
| 100010 | Luis | Ayala | R | 1932-09-18 | CHI | 170 | Q1275397 |
+-----------+------------+-----------+------+------------+-----+--------+-------------+

Querying chDB

Data ingestion is done, now it's time for the fun part - querying the data!

Tennis players receive points based on how well they perform in the tournaments they play. The points for each player over a 52 week rolling period. We're going to write a query that finds the maximum points accumulate by each player along with their ranking at the time:

%%sql
SELECT name_first, name_last,
max(points) as maxPoints,
argMax(rank, points) as rank,
argMax(ranking_date, points) as date
FROM atp.players
JOIN atp.rankings ON rankings.player = players.player_id
GROUP BY ALL
ORDER BY maxPoints DESC
LIMIT 10
+------------+-----------+-----------+------+------------+
| name_first | name_last | maxPoints | rank | date |
+------------+-----------+-----------+------+------------+
| Novak | Djokovic | 16950 | 1 | 2016-06-06 |
| Rafael | Nadal | 15390 | 1 | 2009-04-20 |
| Andy | Murray | 12685 | 1 | 2016-11-21 |
| Roger | Federer | 12315 | 1 | 2012-10-29 |
| Daniil | Medvedev | 10780 | 2 | 2021-09-13 |
| Carlos | Alcaraz | 9815 | 1 | 2023-08-21 |
| Dominic | Thiem | 9125 | 3 | 2021-01-18 |
| Jannik | Sinner | 8860 | 2 | 2024-05-06 |
| Stefanos | Tsitsipas | 8350 | 3 | 2021-09-20 |
| Alexander | Zverev | 8240 | 4 | 2021-08-23 |
+------------+-----------+-----------+------+------------+

It's quite interesting that some of the players in this list accumulated a lot of points without being number 1 with that points total.

Saving queries

We can save queries using the --save parameter on the same line as the %%sql magic. The --no-execute parameter means that query execution will be skipped.

%%sql --save best_points --no-execute
SELECT name_first, name_last,
max(points) as maxPoints,
argMax(rank, points) as rank,
argMax(ranking_date, points) as date
FROM atp.players
JOIN atp.rankings ON rankings.player = players.player_id
GROUP BY ALL
ORDER BY maxPoints DESC

When we run a saved query it will be converted into a Common Table Expression (CTE) before executing. In the following query we compute the maximum points achieved by players when they were ranked 1:

%sql select * FROM best_points WHERE rank=1
+-------------+-----------+-----------+------+------------+
| name_first | name_last | maxPoints | rank | date |
+-------------+-----------+-----------+------+------------+
| Novak | Djokovic | 16950 | 1 | 2016-06-06 |
| Rafael | Nadal | 15390 | 1 | 2009-04-20 |
| Andy | Murray | 12685 | 1 | 2016-11-21 |
| Roger | Federer | 12315 | 1 | 2012-10-29 |
| Carlos | Alcaraz | 9815 | 1 | 2023-08-21 |
| Pete | Sampras | 5792 | 1 | 1997-08-11 |
| Andre | Agassi | 5652 | 1 | 1995-08-21 |
| Lleyton | Hewitt | 5205 | 1 | 2002-08-12 |
| Gustavo | Kuerten | 4750 | 1 | 2001-09-10 |
| Juan Carlos | Ferrero | 4570 | 1 | 2003-10-20 |
| Stefan | Edberg | 3997 | 1 | 1991-02-25 |
| Jim | Courier | 3973 | 1 | 1993-08-23 |
| Ivan | Lendl | 3420 | 1 | 1990-02-26 |
| Ilie | Nastase | 0 | 1 | 1973-08-27 |
+-------------+-----------+-----------+------+------------+

Querying with parameters

We can also use parameters in our queries. Parameters are just normal variables:

rank = 10

And then we can use the {{variable}} syntax in our query. The following query finds the players who had the least number of days between when they first had a ranking in the top 10 and last had a ranking in the top 10:

%%sql
SELECT name_first, name_last,
MIN(ranking_date) AS earliest_date,
MAX(ranking_date) AS most_recent_date,
most_recent_date - earliest_date AS days,
1 + (days/7) AS weeks
FROM atp.rankings
JOIN atp.players ON players.player_id = rankings.player
WHERE rank <= {{rank}}
GROUP BY ALL
ORDER BY days
LIMIT 10
+------------+-----------+---------------+------------------+------+-------+
| name_first | name_last | earliest_date | most_recent_date | days | weeks |
+------------+-----------+---------------+------------------+------+-------+
| Alex | Metreveli | 1974-06-03 | 1974-06-03 | 0 | 1 |
| Mikael | Pernfors | 1986-09-22 | 1986-09-22 | 0 | 1 |
| Felix | Mantilla | 1998-06-08 | 1998-06-08 | 0 | 1 |
| Wojtek | Fibak | 1977-07-25 | 1977-07-25 | 0 | 1 |
| Thierry | Tulasne | 1986-08-04 | 1986-08-04 | 0 | 1 |
| Lucas | Pouille | 2018-03-19 | 2018-03-19 | 0 | 1 |
| John | Alexander | 1975-12-15 | 1975-12-15 | 0 | 1 |
| Nicolas | Massu | 2004-09-13 | 2004-09-20 | 7 | 2 |
| Arnaud | Clement | 2001-04-02 | 2001-04-09 | 7 | 2 |
| Ernests | Gulbis | 2014-06-09 | 2014-06-23 | 14 | 3 |
+------------+-----------+---------------+------------------+------+-------+

Plotting histograms

JupySQL also has limited charting functionality. We can create box plots or histograms.

We're going to create a histogram, but first let's write (and save) a query that computes the rankings within the top 100 that each player has achieved. We'll be able to use this to create a histogram that counts how many players achieved each ranking:

%%sql --save players_per_rank --no-execute
select distinct player, rank
FROM atp.rankings
WHERE rank <= 100

We can then create a histogram by running the following:

from sql.ggplot import ggplot, geom_histogram, aes

plot = (
ggplot(
table="players_per_rank",
with_="players_per_rank",
mapping=aes(x="rank", fill="#69f0ae", color="#fff"),
) + geom_histogram(bins=100)
)
Migrating Self-managed ClickHouse