User Tools

Site Tools



SQLite Scaling

The following code snippet is intended to crudely measure the ability of SQLite to perform on large tables assuming a database with a single table of simple financial transaction records. This test is highly simplistic and shouldn't be taken as a general benchmark - for example, the filesystem cache should be full of the database table after the insertions so the queries that follow may be faster than on a system with more contention for system resources. Also, the inclusion of the call to fetchall() might be a little contentious as it incurs some Python overhead as well as the SQLite load. However, I wasn't sure whether SQLite would actually perform the entire query without it, and I thought it was fairest to fetch all results at once rather than incurring additional pure Python overhead iterating over results.
import os
import random
import sqlite3
import string
import time
NUM_RECORDS = 100000
conn = sqlite3.connect("/tmp/testdb")
cur = conn.cursor()
cur.execute("CREATE TABLE transactions ("
            " id INTEGER PRIMARY KEY,"
            " time INTEGER NOT NULL,"
            " recipient TEXT NOT NULL,"
            " amount INTEGER NOT NULL)")
# Uncomment to add index to "amount" column
# cur.execute("CREATE INDEX transactions_amount_idx ON transactions (amount)")
print "Generating recipients..."
recipients = ["".join(random.choice(string.ascii_uppercase)
                      for i in xrange(random.randint(5, 10)))
              for j in xrange(100)]
print "Generating transactions..."
data = [(int(time.time()) + i, random.choice(recipients),
         int(random.triangular(1, 10000))) for i in xrange(NUM_RECORDS)]
print "Inserting %d records in blocks of 100..." % (NUM_RECORDS,)
start = time.time()
for i, (tr_time, recipient, amount) in enumerate(data):
    cur.execute("INSERT INTO transactions (time, recipient, amount)"
                " VALUES (?, ?, ?)", (tr_time, recipient, amount))
    if i % 100 == 0:
print "Took %.2f secs" % (time.time() - start,)
print "Generating limits..."
limits = [random.triangular(100, 1000) for i in xrange(NUM_RECORDS)]
print "Performing %d random queries..." % (NUM_RECORDS,)
start = time.time()
for limit in limits:
    cur.execute("SELECT COUNT(*) FROM transactions WHERE amount > ?", (limit,))
print "Took %.2f secs" % (time.time() - start,)

On a standard (not overpowered) Linux desktop, the following representative results were observed:

Number of operations Time to insert (secs) Time to select (secs)
1000 1.49 0.15
10000 14.84 13.99
100000 159.72 1539.56

It appears that the insertion time is scaling roughly linearly with the number of operations, but the query time is increasing at a steeper rate. This probably isn't surprising since table scans are required.

Uncommenting the line to create an index in the file above has the following effect (the percentage comparisons are relative to no index):

Number of operations Time to insert (secs) Time to select (secs)
1000 1.82 +22% 0.14 -7%)
10000 21.68 +46% 11.66 -17%
100000 373.52 +134% 1150.85 -25%

As expected the insertion time increases somewhat and the query time reduces somewhat, but the effects aren't particularly pronounced until the table becomes full, and even then the acceleration of queries is somewhat limited compared to the increased overhead of inserts. In many cases, however, queries are likely to be significantly more frequent than inserts, so the trade-off may still be reasonable.

notes/sqlite_scaling.txt · Last modified: 2013/03/13 16:37 by andy