Logo
Published on

Multi‑Agent AI and Network Knowledge Graphs

Authors
  • avatar
    Name
    Ptrck Brgr
    Twitter

In network operations, change is a double‑edged sword. Necessary for progress, but often the root cause of costly downtime when it goes wrong. Cisco’s Outshift group set out to tackle this with a pragmatic mix of AI agents and a real‑time network knowledge graph.

Instead of chasing blanket automation, the team zeroed in on precise workflow friction points. The result is a system that integrates with existing IT service management tools, understands the live network state, and coordinates specialized agents to assess, test, and approve changes before they hit production.

Main Story

Ola Mabadeje framed the challenge simply: too many production failures followed network changes. The default industry response—static, rule‑based automation—was not enough. Outshift’s answer was to embed AI into the flow where it could deliver measurable value.

The architecture rests on three pillars:

  1. A natural language interface for engineers and systems, including ServiceNow integration
  2. A multi‑agent system with clear roles such as impact assessment, testing, and reasoning
  3. A dynamic network knowledge graph acting as a digital twin of the production environment

“There are specific spots in the workflow where AI agents can actually help address a problem.”

Building the knowledge graph meant taming a sprawl of heterogeneous data sources: controllers, devices, configuration systems, each emitting data in YANG, JSON, or streaming telemetry. A custom ingestion pipeline normalizes these into the OpenConfig schema—chosen for its networking focus, documentation quality, and compatibility with large language models.

The graph supports multimodal data, instant querying, a unified schema for operational flexibility, vector indexing for semantic search, and multi‑vendor support. ArangoDB powers the MVP, with Neo4j under consideration for future scenarios.

Layering was key. Agents query only as deep as needed—raw config for drift detection, multiple layers for reachability testing—minimizing traversal and speeding responses.

On the agent side, Cisco collaborates with an open standards group to define interoperable agent identities, skills, and composition patterns. In practice, five agents work under a planner/assistant. A standout is the Query Agent, fine‑tuned with schema and query examples to cut token use and latency.

In a live demo, a ServiceNow ticket triggers a chain: summarization, impact assessment, test plan generation, execution in a digital twin using tools like Batfish, and result attachment back to the ticket. The system pulls proposed configs from GitHub, snapshots the live network via the graph, runs tests, and reports pass/fail outcomes with remediation advice.

Technical Considerations

For engineering leaders, several constraints and trade‑offs emerge:

  • Data normalization is non‑negotiable. Without a unified schema like OpenConfig, reasoning across multi‑vendor, multi‑format data is brittle
  • Layered graph design reduces compute and token costs by limiting query depth to task needs
  • Fine‑tuning agents for high‑frequency queries improves latency and reduces API costs
  • Vendor choices matter. ArangoDB met MVP needs for performance and flexibility; alternative graph stores may suit other workloads
  • Integration paths should embrace incumbents—ITSM tools, config repos, telemetry sources—to avoid workflow disruption
  • Latency is shaped by both graph query performance and LLM context limits; schema‑aware agents help keep context windows small
  • Security and privacy demand careful handling of production network data, especially when interfacing with external LLM services

Business Impact & Strategy

The approach delivers measurable operational benefits:

  • Reduced downtime from failed changes through pre‑deployment impact analysis
  • Faster approval cycles by automating assessments and attaching test results directly to tickets
  • Improved test planning and execution quality via coordinated agent roles
  • Better use of engineering time by removing repetitive, manual checks

Evaluation focuses on extrinsic metrics tied to customer outcomes—not model accuracy scores. This aligns incentives with the real cost vectors: downtime, approval delays, and manual rework.

Organizationally, success depends on cross‑functional buy‑in. Network engineers, ITSM operators, and AI/ML teams must collaborate around shared workflows and agreed data models. Risk mitigation includes starting with an MVP in a controlled environment, validating against production‑like conditions, and expanding scope as trust builds.

Key Insights

  • Target AI to specific workflow pain points, not the entire process
  • Normalize heterogeneous data into a schema that supports consistent reasoning
  • Layer graph data to match query depth to task needs
  • Fine‑tune agents for frequent, high‑cost operations
  • Integrate seamlessly with existing operational systems

Why It Matters

For technical leaders, the combination of agentic reasoning and a live, queryable network model represents a shift from reactive to predictive operations. It enables teams to test changes in a safe, simulated environment that mirrors production, reducing the blast radius of errors.

For business leaders, the model ties AI investment directly to operational resilience and customer outcomes. It avoids the trap of AI for AI’s sake, instead anchoring innovation in measurable gains: uptime, speed, and quality.

Conclusion

Cisco’s Outshift team shows that the real use in AI for network operations lies in precision—deploying agents where they matter most, feeding them with a live, structured view of the network, and embedding them into existing workflows. The result is less downtime, faster approvals, and a stronger operational posture.

Watch the full talk here: https://www.youtube.com/watch?v=m0dxZ-NDKHo