We can’t be waiting around for five seconds for a database to respond.
When we are young we believe the SQL execution engine will handle everything and make sure we get our results on time. Sadly, this is not the case, at least in my experience.
I’ll walk through the evolution of one SQL query, from initial steps to optimization and finally caching with Redis to observe the improvements we get each step along the way.
Background on the data
The US Jobs Data project is my open source data-cleaning/interface project ( for the Bureau of Labor Statistics (BLS) Occupational Employment Statistics (OES). It provides a script to generate a SQLite database of all of the OES data going back to 1999, which otherwise is stored in disparate Excel files with slightly different formats. You can generate the SQLite database using the repo linked above, or download the completed file here.
We want to select all the jobs data:
- For a specific location (Arizona, in this example)
- For occupation categories, not specific occupations (occupation code ends in four 0’s)
- All years
- All industries (industry code is ‘000000’
In this case, we have as ‘input’ to our query, a URL slug (“arizona.html”) which exists in a sitemap table. We can join the area_code field of the sitemap table to our other dimension tables and ultimately to our fact table (‘value’) to get what we want.
SELECT o.code,o.name,v.year,v.value FROM value v, series_code sc, occupation_code o, sitemap sm WHERE v.series_code = sc.code AND sc.occupation_code LIKE '__0000' AND sm.area_code=sc.area_code AND sm.slug="arizona.html" and sc.industry_code='000000' and sc.data_type='01' AND o.code = sc.occupation_code order by 1 asc, 2 asc;
Using the query planner
The query planner tells us how the SQL execution engine will carry out this query and is accessed as follows:
EXPLAIN QUERY PLAN [rest of query goes here]
And this is the result:
QUERY PLAN |--SCAN TABLE value AS v |--SEARCH TABLE series_code AS sc USING INDEX sqlite_autoindex_series_code_1 (code=?) |--SEARCH TABLE occupation_code AS o USING AUTOMATIC COVERING INDEX (code=?) |--SEARCH TABLE sitemap AS sm USING AUTOMATIC PARTIAL COVERING INDEX (slug=? AND area_code=?) `--USE TEMP B-TREE FOR ORDER BY
Here, SCAN means the whole table must be scanned, without the benefit of any INDEX. SEARCH indicates the use of an index. In this case, the query planner is not using an index on the value table, which is the fact table and largest table (~4.5 million rows).
There is and INDEX over value (on the series_code column), and I was not sure why the query planner does not engage that index. I consulted the SQLite documentation about indexes.
Refactoring the query
If I interpret this correctly, an INDEX cannot engage in a comparison between two columns, which is not something I would have expected.
In order to engage the index on the value column, I use a more explicit IN (subquery) formulation:
EXPLAIN QUERY PLAN SELECT o.code,o.name,v.year,v.value FROM value v, series_code sc, occupation_code o WHERE v.series_code IN ( SELECT sc.code from series_code sc, sitemap sm WHERE sc.occupation_code LIKE '__0000' AND sm.area_code=sc.area_code AND sm.slug="arizona.html" and sc.industry_code='000000' and sc.data_type='01' ) AND v.series_code = sc.code AND o.code = sc.occupation_code order by 1 asc, 2 asc; QUERY PLAN |--SEARCH TABLE series_code AS sc USING INDEX sqlite_autoindex_series_code_1 (code=?) |--LIST SUBQUERY 1 | |--SCAN TABLE series_code AS sc | `--SEARCH TABLE sitemap AS sm USING AUTOMATIC PARTIAL COVERING INDEX (area_code=? AND slug=?) |--SEARCH TABLE value AS v USING INDEX value_code_index (series_code=?) |--REUSE LIST SUBQUERY 1 |--SEARCH TABLE occupation_code AS o USING AUTOMATIC COVERING INDEX (code=?) `--USE TEMP B-TREE FOR ORDER BY
And we can see that we are now using the index over the value column.
Improvement in performance
The initial formulation with a single query runs in about 5 seconds. The second formulation with the query (subquery) structure runs in roughly 0.5 seconds.
The faster query returns exactly the same information in one tenth the time.
In the context of an application, 0.5 seconds is not a fatal delay, but if there is a way to make it faster, we should.
In our case, Redis caching is one option. Our page-builder function takes the slug (“arizona.html”) as an argument, queries our SQLite database and returns JSON-like nested dicts and lists to build the page.
@json_formatter @redis_cache @string_formatter def route_build_page(slug): [some code which is going to query our SQLite database, return a JSON-like object of nested dicts and lists]
The decorators have the following functions:
- string_formatter: serialize the Python dicts and lists into a JSON-compatible string.
- redis_cache: store the serialized string in redis, keyed by the page slug.
- json_formatter: convert the string version back into the Python objects, for use in a Flask page template.
The caching layer has a conditional, such that if the key for that page exists, we query Redis, and return the result, rather than referring to the actual SQLite database. Here is the actual function:
pool = redis.ConnectionPool(host='localhost', port=6379, db=0) _r = redis.Redis(connection_pool=pool) def redis_cache(query_fun): def wrapper_fun(*args): key = make_key( query_fun, args ) print(key) if _r.exists(key): print('cached',key) return _r.get(key) else: ret = query_fun(*args) _r.set(key,ret) print('uncached',key) return ret return wrapper_fun
Using the caching layer speeds up the query even more, up to about 0.05 seconds.
Performant queries are essential for any application. In my case, understanding that column comparison in SQLite can’t engage the indices was the critical factor.
In addition, effective use of caching can speed up your application even more.