Detecting Internet Issues Before It Affects People
Redesigning how Kerala's public internet infrastructure detects and responds to failures, before users notice them.
KFON
Kerala Fiber Optic Network
DURATION
5 Weeks
ROLE
Solo - End to End
DOMAIN
Public Infrastructure / ISP
TYPE
Service Design / Enterprise UX
Status
Speculative Design
At a Glance
CHALLENGE
KFON's Network Operations Centre (NOC) teams operated reactively, identifying infrastructure failures only after users experienced service disruption.
APPROACH
Designed a proactive incident intelligence system combining signal correlation, FPS scoring, and role-based operational dashboards to support faster, more coordinated incident response.
OUTCOME
A speculative operational intelligence system that transforms fragmented monitoring signals into proactive, coordinated infrastructure response workflows.

Infrastructure Context
THE SCALE
20 Lakh
Households
Families for whom KFON is the only affordable path to the internet
375+
Network Nodes
Each one a point of dependency. Each one a potential point of failure.
14
Districts
From Kasaragod in the north to Thiruvananthapuram in the south
30,000+ Institutions
Every Govt. hospital, school, panchayat office, and police station in Kerala
NETWORK ARCHITECTURE OVERVIEW
Core
State-level backbone connecting district headquarters and major exchange points
Distribution
District-level nodes routing traffic to block and panchayat level infrastructure
Access
Last-mile fibre and associated CPE connecting households and institutions
Problem Space
THE PROBLEM
KFON, like most public infrastructure networks in India, operates a support system that is fundamentally built around one trigger:
"The Complaint Call"
"The Complaint Call"
The Failure Timeline (Current State)

3–6 hrs
Total Time From Failure To Field Action
6–14 hrs
Total Time From Failure To Resolution
WHY DO REACTIVE OPERATIONS PERSIST?
Signal Volume Without Correlation
Tool Silos
No Predictive or Probabilistic Scoring
Escalation Path Rigidity
PROBLEM STATEMENT
Network incidents are detected late and handled reactively, due to fragmented visibility and unstructured escalation across teams.
HOW MIGHT WE?

Reduce operational response latency by removing manual correlation overhead from NOC operators
Surface correlated weak signals before outages escalate, enabling intervention before user impact begins


Replace complaint-driven escalation with system-initiated, evidence-grounded incident intelligence
Create role-based visibility that delivers the right info to the right person at the right time


Design for high-stress operational environments where cognitive clarity is a product requirement
Research and Understanding
WHAT THE EVIDENCE SAYS
“Major outages are often preceded by weak signals—latency spikes, partial failures, or localized disruptions—that go unnoticed.”
(SRE Workbook)
“Collecting metrics is not the same as understanding system health; lack of correlation leads to delayed detection.”
— Datadog
(Monitoring & Observability Guide)
“Lack of end-to-end visibility across network layers significantly delays root cause identification during outages.”
— IBM
(IT Operations Analytics / AIOps)
“Escalation paths are often reactive and depend on manual intervention, leading to increased mean time to resolution.”
— Atlassian
(Incident Management Handbook)
“Incident response is frequently slowed by siloed teams and unclear ownership across operational domains.”
— PagerDuty
(Incident Response Report)
“In several major outages, initial anomalies were observed minutes before escalation, but were not flagged due to fragmented monitoring systems.”
— Google SRE Book
(Monitoring Distributed Systems)
PHASE 1
DOMAIN IMMERSION
"The Complaint Call"
01 - Desk Research
Domain fluency
Revealed
ITIL Incident Lifecycle
Alert Fatigue Research
TRAI SLA Obligations
02 - ISP Operational Research
Fiber Failure Modes
Revealed
Physical Cut Patterns
Alert Fatigue Research
TRAI SLA Obligations
03 - NOC Workflow Analysis
Domain fluency
Revealed
Handover Risk
Escalation Latency of 12-18 min
Field Dispatch Info Gaps
04 - Systems Mapping
Fiber Failure Modes
Revealed
Signal Fragmentation between monitoring and Ops Stack
Invisible SLA Gap
PHASE 2
INDUSTRY BENCHMARKING
"The Complaint Call"

PagerDuty
Adapted
Event Grouping Model +
FPS Tier Logic +
Escalation Policy Structure

DataDog
Adapted
3-pillar Observability (metrics, logs and traces) +
Dashboard Hierarchy

Google SRE
Adapted
4 Golden Signals +
Error Budget as Reliability Indicator +
Toil Reduction

Atlassian JSM
Adapted
Persistent SLA Clock Patterns +
Opsgenie Routing Rule Logic
PHASE 3
SYNTHESIS
"The Complaint Call"
CRITICAL INSIGHT
"The issue is not lack of data. It is lack of correlation."
Revealed
Revealed
KFON's monitoring stack already generates sufficient signal volume to detect failures proactively. What it lacks is the intelligence layer that connects those signals that is weighing temporal overlap, geographic proximity, infrastructure topology, and historical similarity, into actionable incidents. That gap became the entire design brief.
Role-based Dashboard
Failure Probability Scoring
Signal Correlation Engine
Dynamic Escalation Routing
Persistent SLA Visibility
PHASE 4
Comparitive Analysis
Capability
PagerDuty
Datadog
Atlassian JSM
KFON
Proposed
Proactive signal detection
Partial (via integrations)
Yes (monitoring-native)
No
No
Native Correlation Engine
Multi-signal correlation
Yes (ML-based grouping)
Yes (metrics + logs + traces)
No
No
Temporal + Geo + Infra + Historical
Geographic proximity scoring
No
Limited (region tags)
No
No
District-level Blast Radius Model
Infrastructure topology awareness
No
Partial (SNMP maps)
No
No
Tiered Node Hierarchy
Role-differentiated dashboards
Partial (team views)
Partial (custom dashboards)
Partial (agent/supervisor views)
No
5 distinct role views
Persistent SLA visibility
No (per-incident only)
No
Yes (per-ticket)
No
Globally Persistent
Dynamic escalation routing
Yes (on-call policies)
No
Partial (routing rules)
No
FPS + Infra Type + Availability
Field technician mobile view
Yes (mobile app)
Limited
Yes (mobile app)
No
Designed as Future Scope
Complaint-driven fallback
N/A
N/A
Yes (ITSM ticketing)
No
Ingest Pathway Maintained
Designed for public infra/ISP
No
Proactive signal detection
No
No
Purpose-built
Failure Probability Scoring
No (binary alert model)
Partial (anomaly scoring)
No
No
0–100 Composite FPS
Environmental signal integration
No
Limited
No
No
Future Scope (weather, power grid)
THE EXISTING KFON OPERATIONAL STACK
Monitoring Layer

SNMP Traps
Automated alerts triggered when a device crosses a threshold (CPU, temperature, link status)

Syslog Collection
Continuous log stream from routers, switches, and servers that records every state change

Network Monitoring Sys
Tracks live status of nodes, links, and devices across the network
Operational Layer

Trouble Ticketing System
Logs complaints, creates tickets, tracks escalation status

SLA Monitoring Dashboard
Tracks service availability against committed uptime targets

Historical Incident Archive
Past incidents with metadata such as region, time, fault type, resolution time
RESEARCH INSIGHT
The issue is not lack of data. The issue is lack of correlation between existing operational signals.
KFON's existing monitoring stack generates sufficient signal volume to detect most network failures proactively. The infrastructure for proactive operations already exists. What is missing is the intelligence layer that connects those signals, weights their significance, and surfaces actionable incidents to the right operational role at the right time.
WHY DO REACTIVE OPERATIONS PERSIST?
Signal Volume Without Correlation
Tool Silos
No Predictive or Probabilistic Scoring
Escalation Path Rigidity
WHO WE'RE DESIGNING FOR
Design Process

THE DESIGN BRIEF
Faster, Smarter Incident Response
"A system that helps detect issues early, guide decisions, and resolve them faster."

Structure

Visibility

Accountability
Transform call-based requests into structured, traceable support tickets.
Enable real-time insight into issue status, ownership, and operational workload.
Define clear escalation paths & SLA expectations across support roles.
System Logic
FROM WAITING FOR COMPLAINTS TO, TACKLING THEM EARLY
the length of the bar represents the time taken, these are not fully measured, the reactive one is based on available data from TRAI and for the proactive one its speculated.
WHY IS THE SHIFT, SO SIGNIFICANT
Incidents detected proactively are resolved 60–70% faster than incidents detected through user complaints, primarily because investigation can begin before the problem worsens.
— Average ISP network fault resolution time in India is 6.2 hours
TRAI Annual Report 2022–23
CORRELATION ENGINE
The Four Dimensions
Multiple small issues are compared together to understand whether they are part of a larger network failure.
FAILURE PROBABILITY SCORE
Multiple small issues are compared together to understand whether they are part of a larger network failure.
Failure Probability Score (FPS)

Geographic Proximity
30%

Infrastructure Overlap
30%

Temporal
Overlap
25%

15%
Historical
Incident
Similarity
Geographic and Infrastructure are weighted based on the predominant physical failure patterns are predominantly physical. Temporal (25%) reflects that signal clustering within a short window is a strong but non-decisive indicator as coincidental spikes also exist. Historical (15%) is lowest because KFON's incident archive is likely sparse and may not yet be reliable enough to carry heavy predictive weight at launch and this weight is designed to increase over time as the model is calibrated.
25%
Temporal
Overlap
30%
Geographic Proximity
30%
Infrastructure Overlap
15%
Historical Incident Similarity
FPS = (0.25 × T) + (0.30 × G) + (0.30 × I) + (0.15 × H)
FPS = (0.25 × T) + (0.30 × G) + (0.30 × I) + (0.15 × H)
Signal cluster does not meet incident threshold. May indicate minor device events, routine maintenance triggers, or baseline noise.
Escalation Thresholds
FPS Range
Classification
System Behaviour
Operational Interpretation
0-30
Informational
Logged silently to audit trail; no active notification
Signal cluster does not meet incident threshold. May indicate minor device events, routine maintenance triggers, or baseline noise.
31-55
Monitoring
Surfaced on Tier 1 Agent dashboard as a tracked signal cluster; no escalation
Emerging pattern worth monitoring. Tier 1 can tag as "monitoring" and watch for FPS progression. Does not consume SLA clock.
56-75
Warning
Incident created; Supervisor notified; SLA clock starts; Tier 1 assigned
Pattern indicates likely service degradation. Requires active investigation. Field dispatch may be warranted depending on investigation findings.
75-100
Critical
Incident escalated directly to NOC Lead; all relevant roles notified; priority field dispatch triggered
High-confidence infrastructure failure with significant blast radius likely. All-hands operational response. SLA breach risk high.
This is not an AI making autonomous decisions. It is a system helping experienced operators see connections, that they would see themselves, if they had time to look at everything simultaneously.
SYSTEM ARCHITECTURE

WHO USES THIS SYSTEM
Absence of Proactive Detection
The system has no mechanism to act on early signals. Monitoring tools generate data but nothing connects that data to intent. By the time an alert becomes a ticket, the window for early intervention has already closed.
Latency Spike
15
Live Alerts
Escalate
Dispatch Field
Regional View
Configure
Tier 1 Agent





Supervisor





Field Technician





NOC Lead





Admin






Full

View Only

None
Fragmented Operational Visibility
Each team: support agents, supervisors, field technicians, operates with a partial view. There is no shared source of truth. Decisions get made on incomplete information. Handoffs happen over phone calls and Slack messages, not structured workflows.
Tier 1 Agent
Supervisor
Field Technician
NOC Lead
Operations Admin
Interface Design
CORE DESIGN PRINCIPLES
Operational Clarity
Progressive Disclosure
Reduced Cognitive Load
Priority-Based Colour Semantics
Dark Operational Environment
High Scanability Under Stress
TYPOGRAPHY
Inter
Inter - Bold
Critical Incident IDs
Inter
Inter - Semibold
Section Headers
Inter
Inter - Medium
Operational Metrics
Inter
Inter - Regular
Supporting Context
Inter
Inter - Neutral
Passive Telemetry
OPERATIONAL COLOR LOGIC
#FF5A5A
Critical
Confirmed outage, urgent escalation
#F5A623
Warning
SLA risk, degradation, investigation state
#22C55E
Stable
Recovery, success verification
#3B82F6
Informational
Infrastructure context, neutral visibility
#0B1220
Ambient Surface
Passive monitoring regions
#111827
Elevated Surface
Active investigation containers
ALERT STATES -STATUS CHIPS

Passive
Muted visibility for stable monitoring and non-actionable telemetry.

Active
Elevated visibility for investigation, degradation, and operational attention.

Critical
Persistent visibility for confirmed outages and urgent escalation.
OPERATIONAL WALKTHROUGH
The following walkthrough simulates how a potential infrastructure failure progresses through the proposed operational system.
Type Hierarchy

KEY DESIGN DECISIONS
Role-Based Views Over Role-Based Permissions
AI as Suggestion, Not Automation
Persistent SLA Visibility
Flexible Routing Over Linear Escalation
EXPECTED IMPACT
Metric
Current State
Projected State
Mean Time To Detect (MTTD)
35–240 minutes
5–15 minutes
Mean Time To React (MTTR)
2–6 hours
1–3 hours
Proactive Detection Rate
~8% of incidents detected before user complaint
65–75% of incidents detected before first user complaint within 12 months of full deployment.
SLA Breach Rate
Unknown (true SLA performance masked by complaint-initiated clock).
30–40% reduction in actual SLA breaches through earlier detection and escalation.
NOC Operator Alert Volume
High volume, low signal-to-noise ratio.
90%+ reduction in actionable alert volume through FPS filtering
Field Dispatch Efficiency
Technicians dispatched with incomplete information; high on-site diagnostic time.
25–35% reduction in average on-site resolution time through pre-staged investigation briefs.
Based on PagerDuty's published deployment case studies showing 55–70% proactive detection improvement in enterprise NOC environments within 6–18 months of AIOps layer deployment, adjusted upward for KFON's smaller, more geographically concentrated network topology.

