Aswanth Choyan

KFON · OPERATIONAL INTELLIGENCE · SYSTEM SIMULATION

Adjust inputs. Watch all four roles react — simultaneously.

Each slider changes a real variable in the FPS formula. The score updates live, and all four operator views respond to the same incident — showing exactly what each role sees and when they're triggered to act.

District

Fault type

Signal strengths — raw inputs to FPS

How strong is each incoming signal? Weak signals may not reach thresholds even with high weights.

TTemporal overlap — signals within time window

GGeographic proximity — spatial clustering

IInfrastructure overlap — shared path/node

HHistorical similarity — matches past incidents

FPS weight calibration

Σ 100%

Redistribute weight between dimensions. Doesn't need to sum to 100 — the system auto-normalises.

TTemporal weight

25%

GGeographic weight

30%

IInfrastructure weight

30%

HHistorical weight

15%

T

25

%

G

30

%

I

30

%

H

15

%

Ernakulam · Fiber cut · 24 nodes
42,000 households in range
Dispatch triggers at FPS ≥ 56 · Current: 77 ✓ dispatched

FAILURE PROBABILITY SCORE

CRITICAL

/ 100

031 MONITOR56 WARNING76 CRITICAL100

25%×72 + 30%×86 + 30%×84 + 15%×53 = 77

0–30

Info

31–55

Monitor

56–75

Warning

76+

Critical

T1

Tier 1 Agent

▲INC-2850 entered alert queue · FPS 77 · CRITICAL

→14,700 households at risk · Fiber cut pattern

◎Signals: T 72 · G 86 · I 84 · H 53

↑Escalation recommended immediately

SUP

Supervisor

▲Escalation received · CRITICAL · FPS 77

◎Blast radius: ~16,170 households · Ernakulam

→Dispatch ERN-Team-2 · available · 12km from site

◷SLA risk: avg resolution ~6h at this FPS · limit 4h

LEAD

NOC Lead

◉Ernakulam — CRITICAL · FPS 77 · 16,170 HH affected

▦Blast radius: 24 nodes monitored · 4 infra segments at risk

✓SLA compliance: AT RISK — breach likely

↻FPS model: T25% G30% I30% H15% (current weights)

FLD

Field Tech

▲Dispatched to Ernakulam · Fiber cut

→Site: ERN-C1 junction · Gate: K-4439

◎Materials: splice kit · OTDR · cable markers

!Site note: J3 splice flagged as repeat failure point — check first

!

CRITICAL — all roles on alert. field team dispatched. SLA at risk.

16,170 households potentially affected · avg historical resolution at this FPS: 6h 28min

KFON Operational Intelligence · Speculative Design · Aswanth Choyan

FPS = (T×25%) + (G×30%) + (I×30%) + (H×15%)

PROCESS & LOGIC

OPERATIONAL WORKFLOW

The system moves issues through structured operational stages, ensuring clear ownership, escalation logic, and resolution tracking from detection to closure.

Detection / Issue Intake

Issues detected proactively or reported manually

Ticket Created

Detect unusual network behavior

Initial Triage

Determine severity, impact, and routing path

Specialist Review

Investigate complex technical incidents

Escalation

Assign to responsible roles

Supervisor Review

Handle escalations and decision routing

Direct Field Dispatch

Send technician for physical network issues

On-site Investigation

Agent Review

Attempt quick resolution for known issues

Issue Found

Action/Fix

Agent

Specialist

Technician

Resolution

Issue addressed and validated

Validation

Ticket Closure

Feedback

Capture resolution insights

Knowledge Base

System Learning

Improve future detection and routing

Diagnosis

Decision & Routing

Issue Fixed

ROLE BASED INTERFACES

The system provides role-specific interfaces designed around operational responsibilities, ensuring clarity, efficiency, and accountability across the incident lifecycle

Agent

Specialist

Supervisor

Technician

Incident Resolution

Ticket / Incident System

Agent Dashboard

Ticket List

Ticket Detail

Escalation

Diagnosis Tools

Investigation View

Root Cause Analysis

Resolution Actions

Field Assignment

Site Investigation

Repair Actions

Report Upload

System Learning

Operations Dashboard

Escalation Control

SLA Monitoring

Decision Routing

ROUTING LOGIC

Early support workflows relied on linear escalation paths, where issues moved sequentially across roles regardless of complexity. This created delays and unnecessary handoffs.

To address this, the system evolved to support flexible, role-based routing, allowing issues to be resolved at the appropriate operational level.

Linear Escalation

Issues followed a fixed escalation path, leading to delays and unnecessary dependencies.

Agent

↓

Supervisor

↓

Specialist

↓

Technician

↓

Resolution

Flexible Routing

The system introduced non-linear routing, allowing issues to be resolved at different operational levels based on complexity.

Issue

↓

Agent | Specialist | Technician

↓

Resolution

Back

TELEMETRY PROTOCOLS

SNMP (Simple Network Management Protocol)

Devices expose MIBs (Management Information Bases). The monitoring system polls at configurable intervals (default: 30 seconds for edge routers, 60 seconds for aggregation switches).

Key MIB objects:

ifInErrors / ifOutErrors: interface error counters

ifInDiscards / ifOutDiscards: packet discard rates (early congestion signal)

sysUpTime: device uptime (reset = reboot event = flag)

ifOperStatus: link operational state

cpmCPUTotal5sec: CPU utilization (high CPU causes downstream drops)

SNMP Traps ingested asynchronously alongside polled data. Polling for trend analysis; traps for event detection.

Syslog — RFC 5424

Continuous real-time event stream from all network devices.

format: <PRI>TIMESTAMP HOSTNAME APPNAME MSGID STRUCTURED-DATA MSG

Severity 0–4 (Emergency → Warning) ingested into correlation pipeline.

Severity 5–7 (Notice → Debug) stored for audit, not processed for alerting.

NetFlow / IPFIX / sFlow

Flow-level traffic data enabling:

Per-link baseline traffic volume (sudden drops = upstream failure signal)

Traffic matrix analysis for routing change impact assessment

Top-talker identification during congestion events

sFlow used on high-throughput optical segments — lower precision, lower performance overhead.

BGP (Border Gateway Protocol) Monitoring

Captured via OpenBMP or ExaBGP route collectors. BGP UPDATE messages feed the correlation engine.

Events monitored:

Route flapping: BGP keep alive timeout / interface instability

Route withdrawal: destination unreachable (user-visible disruption within 30–60 seconds)

AS path changes: upstream provider routing issue

Session resets: loss of peering session

BGP events carry the highest individual confidence weight (25 points) due to their direct correlation with user-visible failures.

NORMALIZED EVENT SCHEMA

SNMP (Simple Network Management Protocol)

Devices expose MIBs (Management Information Bases). The monitoring system polls at configurable intervals (default: 30 seconds for edge routers, 60 seconds for aggregation switches).

Key MIB objects:

ifInErrors / ifOutErrors: interface error counters

ifInDiscards / ifOutDiscards: packet discard rates (early congestion signal)

sysUpTime: device uptime (reset = reboot event = flag)

ifOperStatus: link operational state

cpmCPUTotal5sec: CPU utilization (high CPU causes downstream drops)

SNMP Traps ingested asynchronously alongside polled data. Polling for trend analysis; traps for event detection.

Syslog — RFC 5424

Continuous real-time event stream from all network devices.

format: <PRI>TIMESTAMP HOSTNAME APPNAME MSGID STRUCTURED-DATA MSG

Severity 0–4 (Emergency → Warning) ingested into correlation pipeline.

Severity 5–7 (Notice → Debug) stored for audit, not processed for alerting.

NetFlow / IPFIX / sFlow

Flow-level traffic data enabling:

Per-link baseline traffic volume (sudden drops = upstream failure signal)

Traffic matrix analysis for routing change impact assessment

Top-talker identification during congestion events

sFlow used on high-throughput optical segments — lower precision, lower performance overhead.

BGP (Border Gateway Protocol) Monitoring

Captured via OpenBMP or ExaBGP route collectors. BGP UPDATE messages feed the correlation engine.

Events monitored:

Route flapping: BGP keep alive timeout / interface instability

Route withdrawal: destination unreachable (user-visible disruption within 30–60 seconds)

AS path changes: upstream provider routing issue

Session resets: loss of peering session

BGP events carry the highest individual confidence weight (25 points) due to their direct correlation with user-visible failures.

Research and Understanding

Method 01

Desk Research

Network Operations Centre (NOC) Design Standards

NOC environments are documented in operations literature as high-stress, cognitively demanding workplaces. Design considerations unique to NOC contexts include: high ambient screen time (operators may monitor 8–16 displays simultaneously), time-pressured decision making, shift-based rotations (introducing handover risk), and the critical importance of information hierarchy — displaying the most urgent information first, not the most information first. including general cleanings and checkups, fillings, crowns, bridges, root canals, tooth extractions, and cosmetic procedures like teeth whitening and veneers.

SLA Terminology and Regulatory Obligations

Signal Correlation Theory

Alert Fatigue Research

ITIL v4 Incident Management Framework

Method 02

ISP Operational Research

Network Event Classification

Fiber Optic-Specific Failure Modes

Traffic Pattern Baselines

Last-Mile vs. Backbone Failures

Method 03

NOC Workflow Analysis

The Handover Problem

Cognitive Load and Information Hierarchy

Escalation Decision Latency

Field Technician Information Needs

Method 04

Systems Mapping

The Signal Propagation Map

The Role Dependency Map

The Blast Radius Model

The Feedback Loop Audit

Method 05

Market Research

PagerDuty

Event Intelligence (AIOps Layer)

Urgency and Severity Scoring

On-Call Schedule Management

Post-Incident Reviews

Method 05

Market Research

Datadog

Metrics

Logs

Traces

Map-Based Infrastructure Visualisation

Anomaly Detection Widgets

Method 05

Market Research

Google Site Reliability Engineering (SRE)

The Four Golden Signals

Service Level Objectives (SLOs)

Error Budgets

Toil Reduction

Blameless Postmortems

Method 05

Market Research

Atlassian Incident Management

Incident Classification and Routing

SLA Tracking and Breach Prevention

Opsgenie's Alert Routing Logic

Statuspage for Stakeholder Communication