Multi‑Agent AI and Network Knowledge Graphs

In network operations, change is a double‑edged sword. Necessary for progress, but often the root cause of costly downtime when it goes wrong. Cisco’s Outshift group set out to tackle this with a pragmatic mix of AI agents and a real‑time network knowledge graph.

Instead of chasing blanket automation, the team zeroed in on precise workflow friction points. The result is a system that integrates with existing IT service management tools, understands the live network state, and coordinates specialized agents to assess, test, and approve changes before they hit production.

Main Story

Ola Mabadeje framed the challenge simply: too many production failures followed network changes. The default industry response—static, rule‑based automation—was not enough. Outshift’s answer was to embed AI into the flow where it could deliver measurable value.

The architecture rests on three pillars:

A natural language interface for engineers and systems, including ServiceNow integration
A multi‑agent system with clear roles such as impact assessment, testing, and reasoning
A dynamic network knowledge graph acting as a digital twin of the production environment

“There are specific spots in the workflow where AI agents can actually help address a problem.”

Building the knowledge graph meant taming a sprawl of heterogeneous data sources: controllers, devices, configuration systems, each emitting data in YANG, JSON, or streaming telemetry. A custom ingestion pipeline normalizes these into the OpenConfig schema—chosen for its networking focus, documentation quality, and compatibility with large language models.

The graph supports multimodal data, instant querying, a unified schema for operational flexibility, vector indexing for semantic search, and multi‑vendor support. ArangoDB powers the MVP, with Neo4j under consideration for future scenarios.

Layering was key. Agents query only as deep as needed—raw config for drift detection, multiple layers for reachability testing—minimizing traversal and speeding responses.

On the agent side, Cisco collaborates with an open standards group to define interoperable agent identities, skills, and composition patterns. In practice, five agents work under a planner/assistant. A standout is the Query Agent, fine‑tuned with schema and query examples to cut token use and latency.

In a live demo, a ServiceNow ticket triggers a chain: summarization, impact assessment, test plan generation, execution in a digital twin using tools like Batfish, and result attachment back to the ticket. The system pulls proposed configs from GitHub, snapshots the live network via the graph, runs tests, and reports pass/fail outcomes with remediation advice.

Technical Considerations

For engineering leaders, several constraints and trade‑offs emerge:

Data normalization is non‑negotiable. Without a unified schema like OpenConfig, reasoning across multi‑vendor, multi‑format data is brittle
Layered graph design reduces compute and token costs by limiting query depth to task needs
Fine‑tuning agents for high‑frequency queries improves latency and reduces API costs
Vendor choices matter. ArangoDB met MVP needs for performance and flexibility; alternative graph stores may suit other workloads
Integration paths should embrace incumbents—ITSM tools, config repos, telemetry sources—to avoid workflow disruption
Latency is shaped by both graph query performance and LLM context limits; schema‑aware agents help keep context windows small
Security and privacy demand careful handling of production network data, especially when interfacing with external LLM services

Business Impact & Strategy

The approach delivers measurable operational benefits:

Reduced downtime from failed changes through pre‑deployment impact analysis
Faster approval cycles by automating assessments and attaching test results directly to tickets
Improved test planning and execution quality via coordinated agent roles
Better use of engineering time by removing repetitive, manual checks

Evaluation focuses on extrinsic metrics tied to customer outcomes—not model accuracy scores. This aligns incentives with the real cost vectors: downtime, approval delays, and manual rework.

Organizationally, success depends on cross‑functional buy‑in. Network engineers, ITSM operators, and AI/ML teams must collaborate around shared workflows and agreed data models. Risk mitigation includes starting with an MVP in a controlled environment, validating against production‑like conditions, and expanding scope as trust builds.

Key Insights

Target AI to specific workflow pain points, not the entire process
Normalize heterogeneous data into a schema that supports consistent reasoning
Layer graph data to match query depth to task needs
Fine‑tune agents for frequent, high‑cost operations
Integrate seamlessly with existing operational systems

Why It Matters

For technical leaders, the combination of agentic reasoning and a live, queryable network model represents a shift from reactive to predictive operations. It enables teams to test changes in a safe, simulated environment that mirrors production, reducing the blast radius of errors.

For business leaders, the model ties AI investment directly to operational resilience and customer outcomes. It avoids the trap of AI for AI’s sake, instead anchoring innovation in measurable gains: uptime, speed, and quality.

Conclusion

Cisco’s Outshift team shows that the real use in AI for network operations lies in precision—deploying agents where they matter most, feeding them with a live, structured view of the network, and embedding them into existing workflows. The result is less downtime, faster approvals, and a stronger operational posture.

Watch the full talk here: https://www.youtube.com/watch?v=m0dxZ-NDKHo

Main Story

Technical Considerations

Business Impact & Strategy

Key Insights

Why It Matters

Conclusion

Related Articles

Building AI Agents with a True Cognitive Core

Operationalizing AI Agents at Scale

Identity Management for AI Agents

Explore by Topic

AI-agents(3 articles)