Cloud Infrastructure: Why the Cheapest Option Costs the Most

One team spent $45,000 a year on a managed load balancer. When they replaced it with HAProxy (decades-old technology), they didn't just save the money. They handled ten times the traffic too. The managed service had hidden throughput limits buried in a pricing tier they'd never thought to question.

Another engineering leader spent an uncomfortable board meeting explaining why queries that used to be "fast" suddenly took 30 seconds. The cause: years of treating the database like a dumping ground instead of a strategic asset. No indexes on the columns that mattered. No thought given to query patterns. Just tables filling up, silently degrading.

Then there's AWS, taken down by its own AI tooling. The outage wasn't hardware or network - it was the company's own AI-assisted deployment process malfunctioning. The company promising 99.9% availability couldn't keep its own lights on.

Three stories that look different on the surface, but share a common root: decisions that look simple in the short term are usually the most expensive in the long term.

When "Convenience" Becomes a Prison

The managed services conversation always starts the same way: "We'll just run it on AWS, minimal maintenance, and our two developers can focus on features".

In year one, that sounds right.

By year two the real cost appears: monthly bills creeping upward, the team gradually forgetting how the underlying systems actually work, and flexibility shrinking - every upgrade or change requires navigating the vendor's interface. By year three, when you want to move, you discover the entire infrastructure is written around that vendor's proprietary API.

The deeper problem: vendor lock-in doesn't just buy you maintenance. It buys you the vendor's way of thinking. Every trade-off they made in designing their service becomes your trade-off, whether it fits your situation or not.

The team that switched from ALB to HAProxy wasn't making a bold contrarian bet. They were simply deciding to understand what they were building.

Understanding your infrastructure matters beyond cost: when things break, and they will, you need engineers who know what to do. A team that never touched a load balancer because the managed service handled it invisibly is a team that can't diagnose why latency spikes at 2 AM. The cost of that ignorance shows up exactly when you can least afford it.

A Database Is Not a Dump Site

There's a mental model that developers fall into on almost every project at some point: the database is "the box where we put stuff".

A collection of tables. A place to push data.

The result: queries that get progressively slower as data grows. Indexes added reactively, after the problem has already hit production. Architecture that's hard to change because "everything is connected to everything".

Data really is the new oil, but like oil extraction, it's entirely about smart drilling strategy.

You wouldn't build an oil rig without geological surveys. So why design a schema without understanding usage patterns?

The questions worth asking at the start of every project:

  • What data will be queried most frequently?
  • Where do you need strong consistency, and where can you relax it?
  • What does this data look like when there's ten times as much of it?

Indexes aren't an implementation detail: they're part of the architecture. And schema design isn't a one-day task - it's a process that needs to evolve as your domain understanding deepens.

The engineering leader who spent that board meeting explaining 30-second queries had a root cause that predated the problem by two years. The team had added columns reactively, never thought about index coverage, and assumed the database would "handle it". By the time it couldn't, the schema was too entangled to fix quickly. Unpacking it took a quarter.

When AWS Takes Itself Down

A few weeks ago, AWS suffered multiple outages caused by its own AI tooling. This is the same company that promises 99.9% availability.

Amazon's response: "Human error, not AI error". Technically true - humans added AI tools to deployment processes before anyone had established what careful usage even looked like.

But that misses the point: when you use a managed service, you're trading control for imaginary convenience. When things break (and things always break), you're left explaining to customers and the board why your systems went down for a reason entirely outside your control.

This doesn't mean you shouldn't use managed services. It means you should know exactly what you're exchanging for what, and plan accordingly: an abstraction layer that lets you switch providers if needed, monitoring that catches problems before your vendor even knows they exist, and a backup plan that doesn't assume the vendor's uptime guarantee is ironclad.

The practical version: your monitoring should tell you about a vendor problem before the vendor's status page does. If you learn about an outage from a customer, you've already failed. An abstraction layer doesn't need to be complex - it needs to be enough that swapping one component doesn't require rewriting the entire system.

Takeaways

In infrastructure, the real trade-off isn't between fast and expensive. It's between control and ease of operation.

  1. Managed services are a trade, not a gift. You get maintenance off your plate. You give up control, flexibility, and the institutional knowledge your team would have built. Know exactly what you're exchanging before you sign up - and make sure you can exit without rewriting everything.

  2. Your database is a strategic asset, not a filing cabinet. Schema design decisions made in year one compound into constraints in year three. Treat indexes and data modeling as architecture, not implementation details - and revisit them as your domain understanding grows.

  3. Vendor reliability is not your reliability. The AWS outage was caused by AWS's own tooling. Your SLA with your customers doesn't carry an asterisk for vendor failures. Monitor independently, abstract your dependencies, and plan for failures you don't control.

The "convenient" solution you adopt today is sometimes the growth bottleneck of three years from now. Before adopting a new managed service: what do you gain? What do you give up? And what happens on the day you need to switch?

Join the discussion on socials: