Most Expensive DevOps Mistakes

A thread recently blew up on Reddit’s r/devops asking a simple question: “What’s the most expensive DevOps mistake you’ve seen or made?” The answers were painful. Not because they described exotic edge cases, but because nearly every engineer reading them thought, “Yeah, that’s us.”

Cloud waste isn’t a fringe problem. It’s the norm. Gartner estimates that organisations waste 30% or more of their cloud spend. But the Reddit thread put real numbers and real stories behind that statistic — and the patterns are worth examining if you’d rather not repeat them.

CI/CD Bills That Could Fund a Small Team

One commenter dropped a number that made people do a double-take: over $400k per year on CI/CD, with the majority of that spend going to automated smoke tests that did nothing more than check whether a website was live. Four hundred thousand dollars. To ping a URL.

Another described CI/CD runners that scaled up on demand but never scaled back down. The auto-scaling worked perfectly in one direction. Nobody had configured — or even thought about — the scale-down policy. So runners accumulated like browser tabs nobody closes, each one quietly billing by the hour.

This pattern is staggeringly common. Teams invest heavily in automation, then forget that automation itself has running costs. The pipeline becomes infrastructure, and infrastructure left unmanaged becomes waste.

The Orphaned Environment Problem

Feature branches get environments. Feature branches get merged. Environments stay running.

Multiple engineers in the thread described this exact scenario. Old preview and staging environments, spun up for branches that were merged weeks or months ago, left running because no one owned the teardown process. There was no TTL policy, no cleanup automation, no audit. Just a slowly growing fleet of zombie infrastructure.

One commenter put it bluntly: with no TTL policy for test infrastructure, they had dozens of environments running that nobody could even explain the purpose of. Each one cost money every hour of every day.

The fix seems obvious — tag resources, set expiry dates, automate cleanup. But “obvious” and “done” are different things entirely, and most teams learn the difference when the bill arrives.

Logging: The Silent Budget Killer

Someone left logging levels on “debug” in production. That alone wouldn’t be remarkable — it happens — except the resulting log volume drove their analytics bill to $120k per year. And when they finally audited what was being logged? Nothing important. Megabytes of noise per second, indexed and stored at premium rates, read by nobody.

Debug logging in production is one of those mistakes that compounds quietly. Each microservice adds its own stream. Each stream gets ingested, parsed, indexed, and retained. By the time someone notices, you’re paying more for logs than for the infrastructure generating them.

Storage Nobody Remembers Creating

One engineer discovered 16TB of data sitting in standard S3 storage, doing absolutely nothing. Not archived. Not lifecycle-managed. Just sitting in the most expensive storage tier, accumulating charges month after month. It had been there long enough that nobody on the current team knew what it was or who created it.

This is the cloud equivalent of a storage unit you forgot you’re renting. Except the storage unit bills you by the gigabyte and never sends a reminder.

Network Costs That Hide in Plain Sight

Cross-AZ data transfer is one of those costs that surprises teams who thought they understood their cloud bill. One commenter reported spending $4-5k per month on internal traffic between availability zones — services talking to each other across AZ boundaries, each request quietly incurring a transfer charge.

This isn’t a misconfiguration in the traditional sense. It’s an architectural decision (or non-decision) that happens when teams deploy across AZs for redundancy without accounting for the network cost of inter-service communication. The redundancy is correct. The bill is just larger than anyone expected.

When Humans Override Automation

Auto-scaling is only useful if you let it work. One of the more frustrating stories in the thread described teams manually overriding autoscaling, effectively disabling it. They’d bump instance counts up during a busy period and never set them back, or they’d set minimum counts so high that the “auto” in autoscaling became meaningless.

The result? Infrastructure permanently sized for peak load, paying peak prices during off-peak hours. All the cost of manual provisioning, none of the benefits of elastic scaling.

GPU Instances: The $100k Oops

Machine learning teams and GPU instances are a particularly expensive combination when oversight is thin. One commenter described a six-figure AWS bill caused by ML developers leaving GPU instances running after experiments. GPU instances can cost $10-30+ per hour. Leave a handful running over a weekend and you’ve spent more than a month of someone’s salary.

The broader lesson: any resource that’s expensive per-hour needs aggressive lifecycle management. If spinning it up is easy but spinning it down is manual, the default outcome is waste.

The Worst Kind of Mistake: Political Infrastructure

Perhaps the most jaw-dropping example wasn’t technical at all. Someone described a self-hosted SharePoint deployment costing $450k per year — built not because the organisation needed it, but because someone wanted the project on their CV for a promotion. Nearly half a million dollars annually for a vanity project dressed up as infrastructure.

And then there was the team spending $100k per month on cloud data processing that would have paid for itself in a few months if run on-prem. The cloud was the wrong tool, but nobody had the political capital (or the inclination) to argue for moving it.

Not every expensive mistake is a technical one. Some are organisational. The infrastructure just happens to be where the cost shows up.

The Common Thread: Unmanaged Complexity

Read through enough of these stories and a pattern emerges. The waste isn’t caused by incompetent engineers. It’s caused by infrastructure that demands constant attention and doesn’t get it. Runners that need scale-down policies. Environments that need cleanup automation. Storage that needs lifecycle rules. Logging that need level management. Instances that need shutdown schedules.

Every one of those requirements is reasonable. The problem is that they multiply. A team running their own CI/CD, managing their own environments, configuring their own scaling, and auditing their own storage has a full-time job just keeping costs under control — before they write a single line of product code.

The real question isn’t “how do we fix each of these problems individually?” It’s “why are we managing all of this ourselves?”

A Different Approach: Remove the Infrastructure Entirely

This is where platform-as-a-service stops being an abstract concept and starts being a financial decision.

Code Capsules takes a different approach to deployment. You push code, it runs. There are no CI/CD runners to manage or forget about. There are no orphaned environments accumulating charges — environments are tied to your deployment lifecycle, not to forgotten branches. Scaling is handled by the platform, not by humans who might override it and walk away.

The things that cause the most expensive mistakes in the Reddit thread — runners, environments, scaling, infrastructure sprawl — simply aren’t your problem on a PaaS that’s designed properly. You’re not configuring auto-scaling policies and hoping someone remembers to set the scale-down threshold. You’re not writing cleanup scripts for test environments and hoping they actually run. The platform handles it.

That doesn’t mean PaaS is right for every workload. If you’re running custom ML pipelines on GPU instances, you need direct infrastructure access. But for the application deployments that make up the bulk of most organisations’ cloud spend? The overhead of self-managing that infrastructure is where the waste lives.

Every hour an engineer spends debugging a CI/CD runner scaling issue or tracking down an orphaned environment is an hour not spent on the product. Every dollar spent on infrastructure that exists only because nobody cleaned it up is a dollar that could have gone elsewhere.

Cut the Waste, Ship the Code

The Reddit thread is worth reading in full — partly for the cautionary tales, partly for the catharsis of knowing you’re not alone. But the takeaway is straightforward: most cloud waste comes from infrastructure that teams manage themselves and manage imperfectly. Reduce the surface area of what you manage, and you reduce the surface area for expensive mistakes.

If you’re tired of babysitting infrastructure and watching money evaporate on resources nobody asked for, give Code Capsules a look . Push your code, let the platform handle the rest, and spend your budget on things that actually matter.