Published: Thu Oct 27 2022

Debugging database deadlocks by focusing on long transactions

In general, in a highly concurrent system (e.g. most web applications), you want your database transactions to be short and sweet. The reason for this is that long transactions hold onto database locks longer and are more likely to cause conflicts with other transactions in the form of deadlocks and lock timeouts. This causes unpredictable and seemingly random errors for end users which tend to be a pain to debug.

Aiming for short transactions is something of a balance between problem that long transactions tend to decrease overall reliability and the fact that transactions are supposed to be groups of data modifications that will be rolled back in totality (which is largely independent of how long they take to execute). There isn't a one size fits all answer to this; sometimes you need a long transaction because that's the nature of the transaction and you need it to be safely rolled back if any part of it fails. Regardless, it's still useful to know which transactions are taking long periods of time.

In even a mildly complex system, keeping mental track of the transactions your code is creating can be hard, because the lower level of the database transaction becomes a bit abstract when you're focusing on the higher level of application level logic. For the same reason, debugging deadlocks at the database level can be extremely difficult, because what's happening in raw SQL often doesn't obviously correspond to higher level application logic (especially if you've got an ORM in there generating queries for you and populating object properties on demand that look like they already exist in memory). As such, transactions can become big and slow and include waiting for things they probably shouldn't (like network traffic for API calls or disk access for file read/writes) without it always being obvious.

So you end up in a situation where you have all these long running transactions and there's a deadlock happening frequently, and it would probably be really easy to fix if you knew where it was happening, but you have a lot of code and routines accessing those tables and you don't have much idea which ones are the problem.

Therefore it's usually prudent to build something into your application that checks and logs long running transactions.

So, what does this look like? Here is an example in Node.js using MySQL.

We keep a set of active transactions and set an interval to look over them frequently and write a warning if any of them have been ongoing for more than 10 seconds. The warning includes a stack trace of where the transaction was created, which makes it very easy to determine where the problems lie.

This module exposes a single function called tx(func), through which all transactions in the application should be created. This is a very minimal implementation of a transaction wrapper function - in reality, such a wrapper should provide a retry strategy for deadlocks as well as handling nested transactions (which might not be valid in your DBMS, and therefore may lead to early commits - which is the case in MySQL).

As always, I tend to write asynchronous Node using promises and async/await because I think it makes the code easier to follow, but if you don't like that you can always use callbacks.

// 'Long' is defined here as 10s
const longTxDuration = 10 * 1000;

// A map of { id : txInfo }, where txInfo is itself a map of:
// { id: (string), start: (date), trace: (string) }
const txInfos = {};

// Checks the currently executing to find and log any that have been running for longer
// than longTxDuration
function checkLongTxs() {
    for (const [key, txInfo] in txInfos) {
        const duration = new Date() - txInfo.start;
        if (duration > longTxDuration) {
            console.warn('Long tx: ' + txInfo.id + ' ' + txInfo.trace);
        }
    }
}

setInterval(checkLongTxs, longTxDuration);


// DB code:

const mysql = require('mysql');
const { randomUUID } = require('crypto');
const { promisify } = require('util');

// Initialise connection pool
var pool = mysql.createPool({
    connectionLimit: 10,
    host: secrets.db.host,
    user: secrets.db.user,
    password: secrets.db.password,
    database: secrets.db.database,
    multipleStatements: true
});

// Utility wrapper function to get a DB connection on the pool
const getConnection = promisify(pool.getConnection.bind(pool));

// Transaction function
async function tx(func) {

    // Transaction information for debugging long transactions
    const txInfo = {
        id: randomUUID,
        start: null,
        trace: new Error().stack,
    }

    return new Promise(async (res, rej) => {
        let c = await getConnection();
        let commit = promisify(c.commit.bind(c));
        let rollback = promisify(c.rollback.bind(c));
        c.beginTransaction(async () => {
            txInfo.start = new Date();
            txInfos[txInfo.id] = txInfo;
            try {
                let ret = await func(c);
                await commit();
                res(ret);
            }
            catch (err) {
                await rollback();
                rej(err);
            }
            finally {
                delete txInfos[txInfo.id];
                c.release();
            }
        });
    });
}

exports.tx = tx;

Then usage is something like:

tx(async (connection) => {
    const query = promisify(connection.query.bind(connection));
    let myData = await query('SELECT * FROM table');

    // do something with myData, run more queries, etc
});

Mark Watkinson