Content
If you've ever seen a transactional system freeze up under heavy load, you know the kind of quiet panic that sets in. Queries stack up, CPU usage looks fine, logs seem normal, yet everything’s stuck. Deadlocks are sneaky problems that don't show themselves until they cause real damage. They don’t mess up data but they kill uptime and shake trust in the system, forcing teams to scramble to get ahead of them before things go south. A deadlock occurs when two or more transactions wait on each other’s locks in a loop. The system recognizes this cycle and terminates one transaction to break it. In systems handling tons of online transactions, a killed transaction often leads to retries, timeouts, and a cascade of failures seen by users. That’s why preventing deadlocks isn’t just a nice-to-have for performance; it’s crucial for system resilience.
Experts who design high-throughput systems agree that deadlocks usually aren’t random. Martin Kleppmann points out that subtle inconsistencies in the order locks are acquired cause most deadlocks. Pat Helland argues that systems that treat ordering and idempotency as first-class concerns rarely face persistent deadlocks. Tammy Butow adds that many companies underestimate how retry storms caused by poor backoff strategies make deadlocks worse. Their advice boils down to three things: predictable lock ordering, limited retries, and a clear understanding of how locks are granted.
So why do deadlocks happen? They aren’t accidents. Four conditions usually occur together: mutual exclusion (only one transaction can hold a lock at a time), hold and wait (transactions hold locks while waiting for others), no preemption (locks can’t be forcibly taken away), and circular wait (each transaction waits for a lock held by another in a cycle). Relational databases have all these by design. Developers can mostly control circular waits. Common causes include inconsistent lock orderings, long-running transactions, using overly strict isolation levels, and adversarial access patterns during traffic spikes. For example, if one transaction updates rows 1 then 2, and another does 2 then 1, they can deadlock when they overlap just wrong.
The real damage from deadlocks isn’t the single killed transaction; it’s the retry storm that follows. Clients retry immediately without any delay or randomness, causing retries to collide again and again. This can snowball into outages lasting minutes. One case was a payments API where an 80ms deadlock caused a five-minute partial outage because retries collided heavily. At scale, even a tiny percentage of deadlocks can cause a flood of failures that saturate connection pools. It’s way easier to stop the deadlock from happening than to contain the storm after.
To prevent deadlocks, start by enforcing strict and documented lock ordering. Every transaction type must acquire locks in the same order—by primary key, resource type, or hierarchy. Don’t let user input or dynamic conditions change that order on the fly, or you’ll invite cycles. If conditional writes are necessary, do a simple metadata read first to pick the correct ordered path.
Keeping transactions short also helps. Long transactions don’t cause deadlocks directly but increase the window for conflicts. Remove unnecessary queries and expensive operations like API calls from inside transactions. Treat atomicity like a scalpel rather than a big hammer—only protect what needs transactional guarantees.
Use the right isolation level instead of the strongest one. Serializable isolation might feel safest but often means more locks and higher deadlock risk. Pick the weakest isolation level that still protects correctness, like snapshot isolation for reads or read committed for writes with explicit checks. Higher isolation isn’t always better—just more restrictive.
Even with all that, some deadlocks will happen. So always add exponential backoff with jitter to retry loops. Don’t retry immediately after failure. Randomize delays so retries don’t bunch up and collide again. Even small jitter drastically reduces retry collisions, often saving you from going from a minor hiccup to a full-blown incident.
Lastly, track lock contention as a key metric. Most teams watch overall query latency but ignore how long queries wait on locks. Measuring blocking time, deadlock counts, and top lock waiters helps spot trouble early before it causes outages.