3 min read

The Silent Database Killer: Why Your Transactions Shouldn't Wait on the Network

14 years of experience taught me this: the network is a liar. Learn why wrapping API calls in DB transactions is a recipe for a production meltdown.
The Silent Database Killer: Why Your Transactions Shouldn't Wait on the Network
Photo by Van Tay Media / Unsplash

In this edition of Real-World Engineering, we aren’t looking at a textbook problem. We are looking at a "production-is-on-fire" problem.

I’ve spent 14 years building distributed systems and moving bits across clouds, and if there is one thing I’ve learned, it’s this: The network is a liar, and your database is a jealous lover.

Yesterday, a junior engineer on my team—talented, but still learning the scars of high-scale systems—came to me with a puzzle.

The Situation: The "Safe" Transaction

We had a service that needed to process a user’s premium upgrade. The requirements were simple:

  1. Update the user's status in the DB.
  2. Call a 3rd party Billing API to charge the card.
  3. If the charge fails, revert the DB change.

The junior, wanting to be responsible, wrapped the whole thing in a single @Transactional block.

The Logic:

  • Open Transaction.
  • Update is_premium = true.
  • Heavy Network Call to Billing Provider.
  • If success → Commit.
  • If error → Rollback.

The Problem? The system started crawling. Connection pools were exhausted. The database CPU was spiking, but there was almost no traffic.

How the Junior Approached It

When we sat down, he showed me the code with a look of pure confusion.

"I’m using a transaction to ensure data integrity," he said. "If the billing fails, I don't want the user to have premium status for free. It has to be one atomic operation."

His mental model was correct from a business perspective, but dangerous from a system perspective.

He was treating a distributed system like a local monolith. He thought he was being safe, but he was actually holding a "lock" on a database row while waiting for a packet to travel across the Atlantic, wait for a 3rd party server to wake up, and travel back.

How I Explained It: The Restaurant Analogy

I told him: "Imagine you go to a busy restaurant. You sit at a table (The DB Row), and the waiter (The Transaction) takes your order. But instead of going to the kitchen, the waiter stands at your table and calls the vegetable supplier to see if they have carrots."

"While he’s on the phone for 10 minutes, he can't serve anyone else. The table is occupied. The line outside gets longer. Eventually, the restaurant goes out of business because every waiter is just standing at a table holding a phone."

The Engineering Reality:

  • Database connections are finite. * When you start a transaction, you hold a connection.
  • If you make a network call inside that transaction, that connection sits idle but "active."
  • If the network call takes 2 seconds (not uncommon for 3rd party APIs), and you have 50 concurrent users, you’ve just locked up 50 DB connections for 2 seconds each.

How We Resolved It: The Outbox Pattern (or Post-Commit Logic)

We didn't need a database transaction to span the network. We needed Eventual Consistency.

We refactored the logic:

  1. Update the DB first with a "Pending" status.
  2. Commit the transaction immediately (releasing the connection).
  3. Make the network call outside the transaction.
  4. Update the DB again based on the result.

If the network call fails or the system crashes before step 4? We use a background worker (The Outbox Pattern) to retry the call or reconcile the "Pending" states.

How Not to Make This Mistake

If you take one thing away from my 14 years of breaking things, let it be this:

1. The Golden Rule

Never, ever, perform an I/O operation (Network, File System, External API) inside a Database Transaction.

2. Keep Transactions Short

A transaction should be "Get in, update bits, get out." It should be measured in milliseconds, not seconds.

3. Embrace "Pending" States

Instead of trying to make everything happen "now," design your state machine to handle "In-Progress" states. It makes your system resilient to network flickers.

Final Thoughts

My junior engineer didn't just fix a bug; he changed how he thinks about distributed time. In a local function, time is cheap. In a distributed system, time is the most expensive resource you have.

Don't let your database wait on the internet. The internet doesn't care about your connection pool.

I'll see you in the next one.

Happy Coding.