Recently I read a post by Girish Mathrubootham on Build vs Buy in AI. His point was simple. All the building blocks are now available. Every enterprise is rolling their own stack because there is nothing to buy yet. But this DIY phase will not last. A platform will come, absorb the complexity, and the teams who kept building their own will be stuck maintaining old systems for many years.
I want to apply this thinking to one specific area where the build instinct could be naturally high. AI agents for SRE, Cloud Ops, AIOps and Support.
They take OpenAI or Anthropic API, wire some MCP servers, push it into Slack, and call it an agent platform. It works in demo. It breaks in production. Here are 10 reasons why this approach is at a disadvantage compared to dedicated platform companies who combine deep domain understanding with AI product engineering.
1. Everyone reduces agents to ChatOps. First thing teams do is put a bot in Slack. Ask it about the cluster, ask it to summarize an incident. This is not agentic. This is a smarter search bar. Real value is in autonomous loops. Detect, diagnose, correlate signals, propose fix, execute with guardrails, learn from outcome. Chat is one interface only.
2. Script mindset, not product mindset. SRE teams are trained to write scripts and point solutions. One Lambda here, one Terraform module there. Building an agent platform needs different skills. Versioning, multi-tenancy, eval harness, telemetry, permissions, rollback. This is product engineering. Most internal teams give this work to the same people running production. Output is ten clever scripts with an LLM bolted on, not a platform.
3. Quality depends on who is typing. When the agent is chat-driven, output quality depends on the engineer's prompting skill and what context they remember to include. A senior SRE will get useful answers. A new joiner in week two will get hallucinations because she does not know what context to provide. A real platform encodes expertise so the new joiner gets senior leverage. DIY chat agents make the gap worse.
4. Nobody has full enterprise visibility. To debug a real incident, the agent needs context across application, workload, infra, network, data pipeline, deploy history, cost and dependencies. No single team has all this. App team knows the service. Platform team knows the cluster. FinOps knows the spend. Security knows IAM. Internal build starts from whatever the building team can see, which is always partial.
5. Knowledge stays in silos. Most valuable operational knowledge is in old Slack threads, postmortem docs nobody reads, senior engineer's head, and a wiki page three reorgs old. Unifying this needs a proper knowledge layer with ingestion, entity resolution, freshness scoring, access control. Internal projects rarely build this. They stub a vector DB, dump some docs, move on.
6. The narrow POC trap. Every internal project starts the same. Pick one workflow, prove it in controlled environment, demo to leadership. Demo works because data is clean, scope is narrow, prompts were tuned last week. Then production hits. Ambiguous tickets, partial telemetry, conflicting signals, multi-tenant data, ten naming conventions for the same service. Accuracy falls, latency goes up, and the POC suddenly proved almost nothing.
7. No enterprise context layer. The biggest predictor of agent accuracy is the quality of the enterprise context layer. Services, dependencies, topology, ownership, runbooks, past incidents, historical fixes. Building this properly is a multi-quarter engineering effort. Internal teams do not have bandwidth for it. They stub something basic, and accuracy suffers from day one.
8. When accuracy drops, the model gets the blame. This is the most damaging point. When usage is 50:50, adoption falls fast. Leadership asks why, and the answer is always "the model is not good enough yet, let us wait for next release". But the model is doing very little of the work that determines accuracy. Retrieval quality, context engineering, tool definitions, eval coverage, grounding strategy are doing most of the work. These are engineering investments nobody made. So the model takes the blame.
9. Cost economics catch up fast. Starter setup is always the same. Call frontier reasoning model in a loop, send big context, hope for the best. Works in demo. Will not work at enterprise volume. Not for cost, not for latency, not even for accuracy on routine tasks where reasoning models overthink. You need model routers, fine-tuned small models doing 80-90% of routine work, frontier models only for hard reasoning, prompt caching, eval-gated upgrades. Different engineering discipline altogether.
10. Tech moves every quarter, internal builds get stuck. Models change every 3 months. Agent frameworks faster. Eval tooling, memory layers, MCP standards, all moving. Orchestration written against GPT-4 looks nothing like what you write today. Platform team treats this as core competence. Internal team treats it as painful maintenance. Six months later, upgrades break too many things, so the team freezes. One year later they are two generations behind. Enthusiasm dies, the agent gets quietly deprecated.
We have seen this pattern many times.
Monitoring. Earlier we wrote our own scripts with grep and awk, built homegrown dashboards. It did not scale. Datadog, Splunk, New Relic, Grafana Cloud came in and DIY monitoring became a joke.
Kubernetes. Before managed K8s, every team ran their own control plane. Today running your own is a cost problem and a hiring problem.
Coding with LLMs. Earlier we copy-pasted from ChatGPT in the browser. Productivity gain was small. Then Claude Code, Cursor and Codex shipped real agentic harnesses around the model and productivity doubled. Notice that nobody tried to build their own Claude Code in-house. Everyone could see the gap between calling a model and engineering an agent.
The same realisation is coming to SRE and Cloud Ops. The DIY phase will end. What replaces it is platforms with a proper enterprise context layer, the right mix of frontier and small models, accurate context engineering, token economics built in, latency as a product requirement, and continuous evals across model releases.
The teams building their own today are doing real work. The question is not whether the work is good. It is. The question is whether they will still be maintaining it three years from now, while their competitors have moved on.
History tells us they will not.