Agentic AI-Enhanced Network Performance Monitoring and Diagnostic Analysis for Site Reliability Engineering
Main Article Content
Abstract
Network performance monitoring and diagnostic analysis (NPMD) is becoming a core reliability discipline for modern distributed systems because cloud applications, hybrid connectivity, software-defined networking, and multi-region dependency chains can turn small network degradations into visible service incidents. The original paper explained NPMD through Site Reliability Engineering (SRE) concepts such as service level indicators (SLIs), service level objectives (SLOs), and non-functional requirements. This updated version expands the work with a data-driven operating model, stronger references, explicit table and figure captions, and an Agentic AI solution pattern for bounded autonomous diagnosis and remediation. The proposed approach combines telemetry pipelines, SLO evaluation, topology and change evidence, retrieval-augmented diagnostic reasoning, runbook-constrained tool execution, and human approval controls. The paper treats AI as an operational assistant rather than an uncontrolled replacement for SRE judgment: the agent can summarize evidence, correlate probable causes, recommend remediation, and execute only low-risk approved actions while preserving auditability. The result is a practical framework for reducing alert noise, improving time to detect, accelerating incident triage, and strengthening post-incident learning without relying on unsupported claims or unverifiable performance numbers.
Article Details

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
References
[1] B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Eds., Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. Available: https://sre.google/sre-book/table-of-contents/
[2] C. Jones, J. Wilkes, N. R. Murphy, and B. Beyer, Eds., The Site Reliability Workbook. O'Reilly Media, 2018. Available: https://sre.google/workbook/table-of-contents/
[3] S. Thurgood, "Example Error Budget Policy," Google SRE Workbook, 2018. Available: https://sre.google/workbook/error-budget-policy/
[4] P. Lewis et al., "Retrieval-augmented generation for knowledge-intensive NLP tasks," Advances in Neural Information Processing Systems, vol. 33, 2020. Available: https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html
[5] S. Yao et al., "ReAct: Synergizing reasoning and acting in language models," International Conference on Learning Representations, 2023. Available: https://openreview.net/forum?id=WE_vluYUL-X
[6] N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, "Reflexion: Language agents with verbal reinforcement learning," Advances in Neural Information Processing Systems, vol. 36, 2023. Available: https://arxiv.org/abs/2303.11366
[7] L. Wang et al., "A survey on large language model based autonomous agents," Frontiers of Computer Science, vol. 18, Art. no. 186345, 2024, doi: 10.1007/s11704-024-40231-1.
[8] Q. Wu et al., "AutoGen: Enabling next-gen LLM applications via multi-agent conversation," Microsoft Research, 2024. Available: https://www.microsoft.com/en-us/research/publication/autogen-enabling-next-gen-llm-applications-via-multi-agent-conversation-framework/
[9] National Institute of Standards and Technology, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, 2023, doi: 10.6028/NIST.AI.100-1.
[10] National Institute of Standards and Technology, "The NIST Cybersecurity Framework (CSF) 2.0," NIST CSWP 29, 2024, doi: 10.6028/NIST.CSWP.29.
[11] OWASP Foundation, "OWASP Top 10 for LLM Applications 2025," 2024. Available: https://owasp.org/www-project-top-10-for-large-language-model-applications/
[12] Cloud Native Computing Foundation, "OpenTelemetry becomes a CNCF incubating project," Aug. 26, 2021. Available: https://www.cncf.io/blog/2021/08/26/opentelemetry-becomes-a-cncf-incubating-project/
[13] Cloud Native Computing Foundation, "OpenTelemetry Project Journey Report," Oct. 20, 2023. Available: https://www.cncf.io/reports/opentelemetry-project-journey-report/
[14] OpenTelemetry, "OpenTelemetry specification," 2024. Available: https://github.com/open-telemetry/opentelemetry-specification
[15] Google SRE, "Service Level Objectives," in Site Reliability Engineering. Available: https://sre.google/sre-book/service-level-objectives/
[16] International Telecommunication Union, "Recommendation ITU-T Y.1541: Network performance objectives for IP-based services," 2011. Available: https://www.itu.int/rec/T-REC-Y.1541/