Query optimization with SQLite and Redis caching

We can’t be waiting around for five seconds for a database to respond.

When we are young we believe the SQL execution engine will handle everything and make sure we get our results on time. Sadly, this is not the case, at least in my experience.

I’ll walk through the evolution of one SQL query, from initial steps to optimization and finally caching with Redis to observe the improvements we get each step along the way.

Background on the data

The US Jobs Data project is my open source data-cleaning/interface project ( for the Bureau of Labor Statistics (BLS) Occupational Employment Statistics (OES). It provides a script to generate a SQLite database of all of the OES data going back to 1999, which otherwise is stored in disparate Excel files with slightly different formats. You can generate the SQLite database using the repo linked above, or download the completed file here.

The query

We want to select all the jobs data:

  • For a specific location (Arizona, in this example)
  • For occupation categories, not specific occupations (occupation code ends in four 0’s)
  • All years
  • All industries (industry code is ‘000000’

In this case, we have as ‘input’ to our query, a URL slug (“arizona.html”) which exists in a sitemap table. We can join the area_code field of the sitemap table to our other dimension tables and ultimately to our fact table (‘value’) to get what we want.

SELECT o.code,o.name,v.year,v.value
    FROM value v, series_code sc, occupation_code o, sitemap sm
    WHERE v.series_code = sc.code
    AND sc.occupation_code LIKE '__0000'
    AND sm.area_code=sc.area_code
    AND sm.slug="arizona.html"
    and sc.industry_code='000000'
    and sc.data_type='01'
    AND o.code = sc.occupation_code
    order by 1 asc, 2 asc;

Using the query planner

The query planner tells us how the SQL execution engine will carry out this query and is accessed as follows:

EXPLAIN QUERY PLAN
[rest of query goes here]

And this is the result:

QUERY PLAN
|--SCAN TABLE value AS v
|--SEARCH TABLE series_code AS sc USING INDEX sqlite_autoindex_series_code_1 (code=?)
|--SEARCH TABLE occupation_code AS o USING AUTOMATIC COVERING INDEX (code=?)
|--SEARCH TABLE sitemap AS sm USING AUTOMATIC PARTIAL COVERING INDEX (slug=? AND area_code=?)
`--USE TEMP B-TREE FOR ORDER BY

Here, SCAN means the whole table must be scanned, without the benefit of any INDEX. SEARCH indicates the use of an index. In this case, the query planner is not using an index on the value table, which is the fact table and largest table (~4.5 million rows).

There is and INDEX over value (on the series_code column), and I was not sure why the query planner does not engage that index. I consulted the SQLite documentation about indexes.

Refactoring the query

If I interpret this correctly, an INDEX cannot engage in a comparison between two columns, which is not something I would have expected.

In order to engage the index on the value column, I use a more explicit IN (subquery) formulation:

EXPLAIN QUERY PLAN
SELECT o.code,o.name,v.year,v.value
    FROM value v, series_code sc, occupation_code o
    WHERE v.series_code IN (
        SELECT sc.code from 
        series_code sc, sitemap sm
        WHERE sc.occupation_code LIKE '__0000'
        AND sm.area_code=sc.area_code
        AND sm.slug="arizona.html"
        and sc.industry_code='000000'
        and sc.data_type='01'
    )
    AND v.series_code = sc.code
    AND o.code = sc.occupation_code
    order by 1 asc, 2 asc;

QUERY PLAN
|--SEARCH TABLE series_code AS sc USING INDEX sqlite_autoindex_series_code_1 (code=?)
|--LIST SUBQUERY 1
|  |--SCAN TABLE series_code AS sc
|  `--SEARCH TABLE sitemap AS sm USING AUTOMATIC PARTIAL COVERING INDEX (area_code=? AND slug=?)
|--SEARCH TABLE value AS v USING INDEX value_code_index (series_code=?)
|--REUSE LIST SUBQUERY 1
|--SEARCH TABLE occupation_code AS o USING AUTOMATIC COVERING INDEX (code=?)
`--USE TEMP B-TREE FOR ORDER BY

And we can see that we are now using the index over the value column.

Improvement in performance

The initial formulation with a single query runs in about 5 seconds. The second formulation with the query (subquery) structure runs in roughly 0.5 seconds.

The faster query returns exactly the same information in one tenth the time.

Redis caching

In the context of an application, 0.5 seconds is not a fatal delay, but if there is a way to make it faster, we should.

In our case, Redis caching is one option. Our page-builder function takes the slug (“arizona.html”) as an argument, queries our SQLite database and returns JSON-like nested dicts and lists to build the page.

@json_formatter
@redis_cache
@string_formatter
def route_build_page(slug):
    [some code which is going to query our SQLite database, return a JSON-like object of nested dicts and lists]

The decorators have the following functions:

  • string_formatter: serialize the Python dicts and lists into a JSON-compatible string.
  • redis_cache: store the serialized string in redis, keyed by the page slug.
  • json_formatter: convert the string version back into the Python objects, for use in a Flask page template.

The caching layer has a conditional, such that if the key for that page exists, we query Redis, and return the result, rather than referring to the actual SQLite database. Here is the actual function:

pool = redis.ConnectionPool(host='localhost', port=6379, db=0)
_r = redis.Redis(connection_pool=pool)

def redis_cache(query_fun):
    def wrapper_fun(*args):
        key = make_key( query_fun, args )
        print(key)
        if _r.exists(key):
            print('cached',key)
            return _r.get(key)
        else:
            ret = query_fun(*args)
            _r.set(key,ret)
            print('uncached',key)
            return ret
    return wrapper_fun

Using the caching layer speeds up the query even more, up to about 0.05 seconds.

Conclusion

Performant queries are essential for any application. In my case, understanding that column comparison in SQLite can’t engage the indices was the critical factor.

In addition, effective use of caching can speed up your application even more.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s