CLI vs MCP: What We Use Where, and Why

Q: Q2. How do you safely let an AI agent run commands like kubectl in production?

Not by giving it a raw shell. Every kubectl call flows through a Policy-Aware Wrapper that scopes commands to RBAC, injects safety flags like --dry-run, truncates output, and logs to an audit trail. The agent thinks it is calling the CLI; it is actually calling a hardened binary that does.

Every AI agent platform hits the same question sooner or later. How should the agent actually do its work? Should it run command line tools the way a human engineer would? Should we wrap everything inside MCP servers and call them as functions? Or some mix of both?

At Nudgebee we answer this across 30+ pre-built agents and 30+ integrations. Kubernetes, Prometheus, Datadog, Splunk, Jira, ServiceNow, Slack, GitHub, Postgres, Redis, and a long list of cloud services on top of that. There is no single answer that works for all of them. Different tools need different bridges. This post is about how we decide, with examples from the agents we ship today.

Why SRE makes this harder than most domains

An on-call engineer's day is mostly small, fast decisions. List the failing pods. Check the recent deployments. Grep the logs. Compare metrics with last week. Restart a service. File a ticket. Most of these already have very good CLIs: kubectl, aws, gcloud, az, git, jq, grep, helm. The CLI ecosystem represents a decade of codified SRE workflows.

This changes the math. In a domain where no good CLI exists, you would build MCP servers for everything. But in SRE, the world is already full of capable command line tools that the LLM has seen thousands of examples of in its training data. Ignoring that is wasteful. At the same time, the world also has proprietary observability APIs, stateful ticketing systems, and dangerous destructive operations where raw shell access is the wrong answer. So the right architecture is somewhere in the middle.

Where CLI wins

There is a class of tasks where letting the agent run a real command line tool is just better. The tool exists, it works, the model knows it, and wrapping it adds no value.

Kubernetes inspection. Our Kubectl Agent runs things like kubectl get pods -n payments --field-selector=status.phase!=Running -o json against the cluster. No API we could expose through MCP would be more expressive, or better understood by the model, than kubectl itself.

Cloud account exploration. When the AI-FinOps Assistant lists unattached EBS volumes or stale snapshots, aws ec2 describe-volumes is the right tool. gcloud, aws, az; each one is individually larger than any MCP server we would realistically build. They already handle pagination, auth, region routing, rate limit backoff.

Git history and code inspection. Root cause work often comes back to "what changed". git log --since='2 hours ago', git blame, git diff. Read only, fast, easy for the model to parse. No MCP server adds value here.

Filesystem and log triage. grep, find, tail | jq. General purpose primitives that compose into a thousand different one-off queries. Pre-building a search_logs_for_pattern MCP tool gets you only one of those thousand. The composition itself is what you actually want.

Common pattern: read only or low risk operations on tools the model already understands well, where the value is the full expressive range of the CLI.

Where MCP wins

For a different set of integrations, the same argument flips. These are most of the high stakes paths through Nudgebee.

Proprietary observability APIs. Datadog, Splunk, Loki and the rest do have CLIs, but they are either thin or awkward. PromQL is powerful but unforgiving. The model writes wrong queries often enough that giving it raw curl access to /api/v1/query is a step backward. Our Prometheus Agent validates the query before sending, injects defaults for time range and step size, and formats the response in a shape the model can read easily. This is query shaping and output shaping. Without it, the model spends every call reasoning about raw response formats, and gets it wrong some of the time.

Ticketing with state. Jira and ServiceNow are not single shot APIs. Creating a ticket means resolving a project key, validating issue type, applying workflow transitions, attaching the right reporter, and often updating the same ticket later as the incident progresses. The MCP server holds that context. It knows which Jira project maps to which Kubernetes namespace, remembers the ticket ID it just created, and enforces required fields. This belongs on the server side, not in the prompt.

Messaging with credentials. Posting to Slack or Teams is not hard, but we do not want the agent holding the webhook URL or bot token in its prompt context. The MCP server keeps the credential, enforces a channel allowlist (so the agent can post to #sre-alerts but not #leadership), and stamps every message with an audit trail ID.

Database access. Our Postgres and Redis Agents are MCP services for one reason. A raw psql shell is dangerous. But read only inspection ("which queries are slow", "what is in this key") is extremely useful during incidents. So the server exposes explain_query, top_slow_queries, current_locks, get_key_ttl. Narrow tools that physically cannot run a DROP TABLE or FLUSHALL. The MCP boundary itself becomes the safety boundary.

GitHub mutations. Reading PRs with gh pr list is fine over CLI. But creating a PR that auto-remediates a misconfigured deployment, or commenting on an issue with a runbook link, goes through an MCP server. It enforces that only the bot user can act, templates the PR description, and logs every mutation centrally. Read with CLI, write with MCP.

Common pattern: state, credentials, guardrails, output shaping, or no good CLI exists in the first place.

The middle path: MCP that wraps a CLI

This is where most of our agents actually live, and it is the part most CLI vs MCP debates miss. The two patterns are not mutually exclusive. In production, we use a Policy Wrapper—a secure execution layer that sits between the agent and the shell. The agent thinks it's talking to a tool; we ensure that tool is actually a hardened wrapper that shells out to the CLI.

Our Kubectl Agent works exactly this way. The model emits something like "check whether the api-gateway pods are healthy". It reasons its way to a kubectl invocation. But that invocation does not go straight to os.exec. It goes through a wrapper that parses the command to make sure it is read only, scopes it to the agent's allowed namespaces and clusters, adds -o json if the agent forgot, truncates large output before it hits the model's context window, and logs everything to the audit trail.

We don't give our agents a raw shell. Instead, we use a Policy-Aware Wrapper. When an agent decides to run kubectl, it’s actually calling a Nudgebee-managed binary.This wrapper acts as a circuit breaker. It intercepts the command, checks it against the user's RBAC, injects safety flags (like --dry-run or -o json), and truncates the output. By the time the CLI actually hits the cluster, it's been 'vetted.' This gives us the power of the CLI with the safety of a managed API.

Heuristics we use when adding a new integration

When a new tool needs wiring in, the team walks through this list:

Does a mature CLI exist for the tool? If yes, the LLM probably already knows it. Lean toward shell out.

Is the operation read only? If yes, the risk of letting the model drive a CLI is low.

Does the tool have state across calls (tickets, sessions, cursors)? Lean toward MCP.

Are credentials sensitive enough that the agent should not see them? Always MCP.

Does the output need shaping for the model? Lean toward MCP.

Anything destructive? MCP, or MCP wrapping a CLI, with human-in-the-loop.

In practice, most of our production agents end up as MCP services that internally shell out for the parts where the CLI is the best implementation anyway. The boundary is policy and shape. The engine is whatever is already battle tested.

The most underrated option: the enterprise context layer

So far this post has been about how the agent makes the call. CLI, MCP, or some mix of the two. But there is a question that comes before any of these, and it is the one most teams skip.

Does the agent even need to make a call?

The cheapest tool call is the one you do not have to make. For an SRE assistant working on an incident, the same five facts get asked again and again. What does this service depend on? Who deployed it last? Which alert fired first? What was the metric trend yesterday? These do not need a fresh kubectl call every single time. They need a place to live.

We've found that an underlying Semantic Knowledge Graph solves the N+1 tool-call problem. The graph holds the linkages between services, deployments, alerts, tickets, logs, metrics, traces, configs, secrets, and code. Every agent runs writes back into it. So by the time an incident hits, most of the context the model needs is already correlated and ready to read, not fetched live.

The effects stack up quickly:

Fewer model calls. If the answer is in the graph, there is no tool call, no second LLM round trip to interpret the tool output, no third round trip to summarize. One call instead of five.
Lower latency. A graph lookup is in milliseconds. A kubectl call against a remote cluster is hundreds of milliseconds at best. An MCP call into a third party API is often a full second or more. Over a real incident, this is the difference between a 4 second response and a 40 second one.
Better accuracy. The model is not guessing relationships between a pod, a deployment, a recent PR, and a Jira ticket. Those linkages are already resolved in the graph. The model just reads them.
SLMs become viable. Smaller, cheaper models work well when the context is curated and tight. They fall over when they have to reason about raw API responses. The graph does the heavy lifting of pre-correlation, so an SLM is often enough.
Lower cost. Fewer tokens in, fewer tokens out, cheaper models doing the work. This adds up fast when you run thousands of incidents a month.

Most teams putting AI agents into production are focused on the tool layer. They are building MCP servers, integrating CLIs, arguing over which one is better. The teams that get good unit economics are the ones who built a context layer underneath. CLI vs MCP is a real debate. But it is the easy one. The harder, and far more valuable, decision is what your enterprise context layer looks like before any tool call is made. That is where Nudgebee actually scores.

FAQs

Q1. When should an AI agent use a CLI vs an MCP server?

Use a CLI for read-only operations on mature tools the LLM already knows (kubectl, aws, git, grep). Use an MCP server when state, credentials, output shaping, or destructive operations are involved. Most production agents combine both, with MCP wrapping CLI execution behind a policy layer.

Q2. How do you safely let an AI agent run commands like kubectl in production?

Not by giving it a raw shell. Every kubectl call flows through a Policy-Aware Wrapper that scopes commands to RBAC, injects safety flags like --dry-run, truncates output, and logs to an audit trail. The agent thinks it is calling the CLI; it is actually calling a hardened binary that does.

Q3. Why doesn't wrapping every integration in MCP work?

The LLM already knows a decade of CLI workflows from its training data, and a fixed tool like search_logs_for_pattern collapses that into one query shape. The real value of grep | jq is the composition the model invents on the fly. For read-only work on mature CLIs, the CLI is usually more expressive than any wrapper.

Q4. What about high-stakes integrations like databases or messaging?

Those go through MCP servers, where the boundary itself is the safety boundary. The Postgres Agent exposes narrow tools like explain_query and top_slow_queries that physically cannot run DROP TABLE. Slack runs server-side with a channel allowlist so the agent can post to #sre-alerts but not #leadership.

Q5. How does a knowledge graph reduce AI agent latency and cost?

By eliminating the N+1 tool-call problem. NudgeBee's Semantic Knowledge Graph pre-correlates services, deployments, alerts, tickets, and metrics so most context is ready to read instead of fetched live. Result: fewer model calls, milliseconds instead of seconds, and smaller cheaper models become viable.

CLI vs MCP at Nudgebee: what we use where, and why