Back

Detecting Internet Issues Before It Affects People

Redesigning how Kerala's public internet infrastructure detects and responds to failures, before users notice them.

See The UI Flow

KFON

Kerala Fiber Optic Network

DURATION

5 Weeks

ROLE

Solo - End to End

DOMAIN

Public Infrastructure / ISP

TYPE

Service Design / Enterprise UX

Status

Speculative Design

At a Glance

CHALLENGE

KFON's Network Operations Centre (NOC) teams operated reactively, identifying infrastructure failures only after users experienced service disruption.

APPROACH

Designed a proactive incident intelligence system combining signal correlation, FPS scoring, and role-based operational dashboards to support faster, more coordinated incident response.

OUTCOME

A speculative operational intelligence system that transforms fragmented monitoring signals into proactive, coordinated infrastructure response workflows.

Infrastructure Context

THE SCALE

20 Lakh
Households

Families for whom KFON is the only affordable path to the internet

375+
Network Nodes

Each one a point of dependency. Each one a potential point of failure.

14
Districts

From Kasaragod in the north to Thiruvananthapuram in the south

30,000+ Institutions

Every Govt. hospital, school, panchayat office, and police station in Kerala

NETWORK ARCHITECTURE OVERVIEW

Core

State-level backbone connecting district headquarters and major exchange points

Distribution

District-level nodes routing traffic to block and panchayat level infrastructure

Access

Last-mile fibre and associated CPE connecting households and institutions

Problem Space

THE PROBLEM

KFON, like most public infrastructure networks in India, operates a support system that is fundamentally built around one trigger:

"The Complaint Call"

The Failure Timeline (Current State)

3–6 hrs

Total Time From Failure To Field Action

6–14 hrs

Total Time From Failure To Resolution

WHY DO REACTIVE OPERATIONS PERSIST?

Signal Volume Without Correlation

A single KFON network node may generate hundreds of SNMP traps and syslog entries per day under normal operating conditions. Without a correlation layer, NOC operators face an impossible signal-to-noise problem. Research from IT operations management firm EMA found that the average enterprise NOC receives over 2,000 alerts per day per operator, of which over 99% require no immediate action. Operators develop "alert fatigue" — a well-documented cognitive phenomenon in which high alert volumes cause critical signals to be missed or dismissed.

A single KFON network node may generate hundreds of SNMP traps and syslog entries per day under normal operating conditions. Without a correlation layer, NOC operators face an impossible signal-to-noise problem. Research from IT operations management firm EMA found that the average enterprise NOC receives over 2,000 alerts per day per operator, of which over 99% require no immediate action. Operators develop "alert fatigue" — a well-documented cognitive phenomenon in which high alert volumes cause critical signals to be missed or dismissed.

Tool Silos

KFON's monitoring stack spans multiple independent systems: SNMP trap receivers, syslog aggregators, and NMS dashboards. These tools do not share data or context with each other, nor with the ticketing system. An operator investigating a complaint must manually correlate findings across three or four separate interfaces — a cognitively expensive process that slows investigation and increases the risk of missed connections.

KFON's monitoring stack spans multiple independent systems: SNMP trap receivers, syslog aggregators, and NMS dashboards. These tools do not share data or context with each other, nor with the ticketing system. An operator investigating a complaint must manually correlate findings across three or four separate interfaces — a cognitively expensive process that slows investigation and increases the risk of missed connections.

No Predictive or Probabilistic Scoring

Current alerting is binary: a threshold is crossed, an alert fires. There is no mechanism for evaluating whether a pattern of sub-threshold signals — minor latency increases across several nodes in a geographic cluster — represents an emerging failure. The predictive layer that converts weak signals into early warnings does not exist.

Current alerting is binary: a threshold is crossed, an alert fires. There is no mechanism for evaluating whether a pattern of sub-threshold signals — minor latency increases across several nodes in a geographic cluster — represents an emerging failure. The predictive layer that converts weak signals into early warnings does not exist.

Escalation Path Rigidity

Escalation follows a linear path: Tier 1 → Supervisor → NOC Lead. This structure is appropriate for complaint-driven workflows but is poorly suited to a proactive model, where multiple incident types may need to route simultaneously and differently depending on infrastructure type, geographic scope, and severity. Linear escalation creates bottlenecks and delay.

Escalation follows a linear path: Tier 1 → Supervisor → NOC Lead. This structure is appropriate for complaint-driven workflows but is poorly suited to a proactive model, where multiple incident types may need to route simultaneously and differently depending on infrastructure type, geographic scope, and severity. Linear escalation creates bottlenecks and delay.

PROBLEM STATEMENT

Network incidents are detected late and handled reactively, due to fragmented visibility and unstructured escalation across teams.

HOW MIGHT WE?

Reduce operational response latency by removing manual correlation overhead from NOC operators

Surface correlated weak signals before outages escalate, enabling intervention before user impact begins

Replace complaint-driven escalation with system-initiated, evidence-grounded incident intelligence

Create role-based visibility that delivers the right info to the right person at the right time

Design for high-stress operational environments where cognitive clarity is a product requirement

Research and Understanding

WHAT THE EVIDENCE SAYS

“Major outages are often preceded by weak signals—latency spikes, partial failures, or localized disruptions—that go unnoticed.”

— Google

(SRE Workbook)

“Collecting metrics is not the same as understanding system health; lack of correlation leads to delayed detection.”

— Datadog

(Monitoring & Observability Guide)

“Lack of end-to-end visibility across network layers significantly delays root cause identification during outages.”

— IBM

(IT Operations Analytics / AIOps)

“Escalation paths are often reactive and depend on manual intervention, leading to increased mean time to resolution.”

— Atlassian

(Incident Management Handbook)

“Incident response is frequently slowed by siloed teams and unclear ownership across operational domains.”

— PagerDuty

(Incident Response Report)

“In several major outages, initial anomalies were observed minutes before escalation, but were not flagged due to fragmented monitoring systems.”

— Google SRE Book

(Monitoring Distributed Systems)

PHASE 1

DOMAIN IMMERSION

"The Complaint Call"

01 - Desk Research

Domain fluency

Revealed

ITIL Incident Lifecycle
Alert Fatigue Research
TRAI SLA Obligations

02 - ISP Operational Research

Fiber Failure Modes

Revealed

Physical Cut Patterns
Alert Fatigue Research
TRAI SLA Obligations

03 - NOC Workflow Analysis

Domain fluency

Revealed

Handover Risk
Escalation Latency of 12-18 min
Field Dispatch Info Gaps

04 - Systems Mapping

Fiber Failure Modes

Revealed

Signal Fragmentation between monitoring and Ops Stack
Invisible SLA Gap

PHASE 2

INDUSTRY BENCHMARKING

"The Complaint Call"

PagerDuty

Adapted

Event Grouping Model +

FPS Tier Logic +

Escalation Policy Structure

DataDog

Adapted

3-pillar Observability (metrics, logs and traces) +

Dashboard Hierarchy

Google SRE

Adapted

4 Golden Signals +

Error Budget as Reliability Indicator +

Toil Reduction

Atlassian JSM

Adapted

Persistent SLA Clock Patterns +

Opsgenie Routing Rule Logic

PHASE 3

SYNTHESIS

"The Complaint Call"

CRITICAL INSIGHT

"The issue is not lack of data. It is lack of correlation."

Revealed

KFON's monitoring stack already generates sufficient signal volume to detect failures proactively. What it lacks is the intelligence layer that connects those signals that is weighing temporal overlap, geographic proximity, infrastructure topology, and historical similarity, into actionable incidents. That gap became the entire design brief.

Role-based Dashboard

Failure Probability Scoring

Signal Correlation Engine

Dynamic Escalation Routing

Persistent SLA Visibility

PHASE 4

Comparitive Analysis

Capability

PagerDuty

Datadog

Atlassian JSM

KFON

Proposed

Proactive signal detection

Partial (via integrations)

Yes (monitoring-native)

Native Correlation Engine

Multi-signal correlation

Yes (ML-based grouping)

Yes (metrics + logs + traces)

Temporal + Geo + Infra + Historical

Geographic proximity scoring

Limited (region tags)

District-level Blast Radius Model

Infrastructure topology awareness

Partial (SNMP maps)

Tiered Node Hierarchy

Role-differentiated dashboards

Partial (team views)

Partial (custom dashboards)

Partial (agent/supervisor views)

5 distinct role views

Persistent SLA visibility

No (per-incident only)

Yes (per-ticket)

Globally Persistent

Dynamic escalation routing

Yes (on-call policies)

Partial (routing rules)

FPS + Infra Type + Availability

Field technician mobile view

Yes (mobile app)

Limited

Yes (mobile app)

Designed as Future Scope

Complaint-driven fallback

N/A

Yes (ITSM ticketing)

Ingest Pathway Maintained

Designed for public infra/ISP

Proactive signal detection

Purpose-built

Failure Probability Scoring

No (binary alert model)

Partial (anomaly scoring)

0–100 Composite FPS

Environmental signal integration

Limited

Future Scope (weather, power grid)

THE EXISTING KFON OPERATIONAL STACK

Monitoring Layer

SNMP Traps

Automated alerts triggered when a device crosses a threshold (CPU, temperature, link status)

Syslog Collection

Continuous log stream from routers, switches, and servers that records every state change

Network Monitoring Sys

Tracks live status of nodes, links, and devices across the network

Operational Layer

Trouble Ticketing System

Logs complaints, creates tickets, tracks escalation status

SLA Monitoring Dashboard

Tracks service availability against committed uptime targets

Historical Incident Archive

Past incidents with metadata such as region, time, fault type, resolution time

RESEARCH INSIGHT

The issue is not lack of data. The issue is lack of correlation between existing operational signals.

KFON's existing monitoring stack generates sufficient signal volume to detect most network failures proactively. The infrastructure for proactive operations already exists. What is missing is the intelligence layer that connects those signals, weights their significance, and surfaces actionable incidents to the right operational role at the right time.

WHY DO REACTIVE OPERATIONS PERSIST?

Signal Volume Without Correlation

A single KFON network node may generate hundreds of SNMP traps and syslog entries per day under normal operating conditions. Without a correlation layer, NOC operators face an impossible signal-to-noise problem. Research from IT operations management firm EMA found that the average enterprise NOC receives over 2,000 alerts per day per operator, of which over 99% require no immediate action. Operators develop "alert fatigue" — a well-documented cognitive phenomenon in which high alert volumes cause critical signals to be missed or dismissed.

A single KFON network node may generate hundreds of SNMP traps and syslog entries per day under normal operating conditions. Without a correlation layer, NOC operators face an impossible signal-to-noise problem. Research from IT operations management firm EMA found that the average enterprise NOC receives over 2,000 alerts per day per operator, of which over 99% require no immediate action. Operators develop "alert fatigue" — a well-documented cognitive phenomenon in which high alert volumes cause critical signals to be missed or dismissed.

Tool Silos

KFON's monitoring stack spans multiple independent systems: SNMP trap receivers, syslog aggregators, and NMS dashboards. These tools do not share data or context with each other, nor with the ticketing system. An operator investigating a complaint must manually correlate findings across three or four separate interfaces — a cognitively expensive process that slows investigation and increases the risk of missed connections.

KFON's monitoring stack spans multiple independent systems: SNMP trap receivers, syslog aggregators, and NMS dashboards. These tools do not share data or context with each other, nor with the ticketing system. An operator investigating a complaint must manually correlate findings across three or four separate interfaces — a cognitively expensive process that slows investigation and increases the risk of missed connections.

No Predictive or Probabilistic Scoring

Current alerting is binary: a threshold is crossed, an alert fires. There is no mechanism for evaluating whether a pattern of sub-threshold signals — minor latency increases across several nodes in a geographic cluster — represents an emerging failure. The predictive layer that converts weak signals into early warnings does not exist.

Current alerting is binary: a threshold is crossed, an alert fires. There is no mechanism for evaluating whether a pattern of sub-threshold signals — minor latency increases across several nodes in a geographic cluster — represents an emerging failure. The predictive layer that converts weak signals into early warnings does not exist.

Escalation Path Rigidity

Escalation follows a linear path: Tier 1 → Supervisor → NOC Lead. This structure is appropriate for complaint-driven workflows but is poorly suited to a proactive model, where multiple incident types may need to route simultaneously and differently depending on infrastructure type, geographic scope, and severity. Linear escalation creates bottlenecks and delay.

Escalation follows a linear path: Tier 1 → Supervisor → NOC Lead. This structure is appropriate for complaint-driven workflows but is poorly suited to a proactive model, where multiple incident types may need to route simultaneously and differently depending on infrastructure type, geographic scope, and severity. Linear escalation creates bottlenecks and delay.

WHO WE'RE DESIGNING FOR

Arjun

Tier 1 NOC Operator

"I see 200 alerts and I don't know which one is about to become a crisis."

Mental model: reactive triage, complaint-volume as severity signal.

Constraints: alert fatigue, shift handover gaps

Ramesh

NOC Lead

"I need to know the blast radius before I pick up the phone to brief the director."

Mental model: strategic triage, geography as primary lens.

Ajith

Field Technician

"I only find out what's actually broken when I'm already there."

Constraints: intermittent connectivity in field.

Design Process

THE DESIGN BRIEF

Faster, Smarter Incident Response

"A system that helps detect issues early, guide decisions, and resolve them faster."

Structure

Visibility

Accountability

Transform call-based requests into structured, traceable support tickets.

Enable real-time insight into issue status, ownership, and operational workload.

Define clear escalation paths & SLA expectations across support roles.

System Logic

FROM WAITING FOR COMPLAINTS TO, TACKLING THEM EARLY

the length of the bar represents the time taken, these are not fully measured, the reactive one is based on available data from TRAI and for the proactive one its speculated.

Early Warning Signs

Reactive

5h 31m

Proactive

2h 6m

Instead of waiting for user complaints, the system detects unusual network behavior early and groups related signals before disruption spreads.

Finding Connected Issues

Reactive

5h 42m

Proactive

1h 41m

Instead of manually searching through tickets, the system automatically connects related issues across regions and infrastructure.

When Investigation Begins

Reactive

4h 19m

Proactive

2h 53m

Instead of waiting for escalation, investigation begins while the issue is still developing.

Collecting Information

Reactive

4h 41m

Proactive

2h 6m

Instead of manually collecting logs and alerts from different tools, important information is attached automatically during escalation.

Escalation Process

Reactive

5h 24m

Proactive

2h 31m

Instead of waiting for complaint volume and manual coordination, escalation begins early using confidence from connected signals.

Field Team Dispatch

Reactive

5h 38m

Proactive

2h 46m

Instead of waiting for widespread disruption, field teams receive earlier warnings and clearer operational context.

Service Disruption

Reactive

Proactive

1h 26m

Instead of reacting after users face service disruption, intervention begins before most users notice the issue.

WHY IS THE SHIFT, SO SIGNIFICANT

Incidents detected proactively are resolved 60–70% faster than incidents detected through user complaints, primarily because investigation can begin before the problem worsens.

— Average ISP network fault resolution time in India is 6.2 hours
TRAI Annual Report 2022–23

CORRELATION ENGINE

The Four Dimensions

Multiple small issues are compared together to understand whether they are part of a larger network failure.

Temporal Overlap

Multiple issues appearing together

Geographic Proximity

Multiple issues appearing together

Infrastructure Overlap

Multiple issues appearing together

Historical Incident Similarity

Multiple issues appearing together

FAILURE PROBABILITY SCORE

Multiple small issues are compared together to understand whether they are part of a larger network failure.

Failure Probability Score (FPS)

Geographic Proximity

30%

Infrastructure Overlap

30%

Temporal
Overlap

25%

15%

Historical
Incident
Similarity

Geographic and Infrastructure are weighted based on the predominant physical failure patterns are predominantly physical. Temporal (25%) reflects that signal clustering within a short window is a strong but non-decisive indicator as coincidental spikes also exist. Historical (15%) is lowest because KFON's incident archive is likely sparse and may not yet be reliable enough to carry heavy predictive weight at launch and this weight is designed to increase over time as the model is calibrated.

25%

Temporal
Overlap

30%

Geographic Proximity

30%

Infrastructure Overlap

15%

Historical Incident Similarity

FPS = (0.25 × T) + (0.30 × G) + (0.30 × I) + (0.15 × H)

Signal cluster does not meet incident threshold. May indicate minor device events, routine maintenance triggers, or baseline noise.

Escalation Thresholds

FPS Range

Classification

System Behaviour

Operational Interpretation

0-30

Informational

Logged silently to audit trail; no active notification

Signal cluster does not meet incident threshold. May indicate minor device events, routine maintenance triggers, or baseline noise.

31-55

Monitoring

Surfaced on Tier 1 Agent dashboard as a tracked signal cluster; no escalation

Emerging pattern worth monitoring. Tier 1 can tag as "monitoring" and watch for FPS progression. Does not consume SLA clock.

56-75

Warning

Incident created; Supervisor notified; SLA clock starts; Tier 1 assigned

Pattern indicates likely service degradation. Requires active investigation. Field dispatch may be warranted depending on investigation findings.

75-100

Critical

Incident escalated directly to NOC Lead; all relevant roles notified; priority field dispatch triggered

High-confidence infrastructure failure with significant blast radius likely. All-hands operational response. SLA breach risk high.

This is not an AI making autonomous decisions. It is a system helping experienced operators see connections, that they would see themselves, if they had time to look at everything simultaneously.

SYSTEM ARCHITECTURE

Check Out the Simulated Workflow Using Claude

WHO USES THIS SYSTEM

Absence of Proactive Detection

The system has no mechanism to act on early signals. Monitoring tools generate data but nothing connects that data to intent. By the time an alert becomes a ticket, the window for early intervention has already closed.

Latency Spike

Live Alerts

Escalate

Dispatch Field

Regional View

Configure

Tier 1 Agent

Supervisor

Field Technician

NOC Lead

Admin

Full

View Only

None

Fragmented Operational Visibility

Each team: support agents, supervisors, field technicians, operates with a partial view. There is no shared source of truth. Decisions get made on incomplete information. Handoffs happen over phone calls and Slack messages, not structured workflows.

Tier 1 Agent

Supervisor

Field Technician

NOC Lead

Operations Admin

Interface Design

CORE DESIGN PRINCIPLES

Operational Clarity

Every visual element in the interface must earn its place by serving a specific operational function. Decorative elements, complex gradients, animation beyond status indication, and non-functional visual hierarchy are eliminated. Research from NASA human factors studies of mission control environments found that operators in visually complex environments show measurably higher error rates and slower decision times than those in high-contrast, simplified environments. KFON's NOC environment applies this finding: the visual system is stripped to functional essentials.

Every visual element in the interface must earn its place by serving a specific operational function. Decorative elements, complex gradients, animation beyond status indication, and non-functional visual hierarchy are eliminated. Research from NASA human factors studies of mission control environments found that operators in visually complex environments show measurably higher error rates and slower decision times than those in high-contrast, simplified environments. KFON's NOC environment applies this finding: the visual system is stripped to functional essentials.

Progressive Disclosure

Information is presented in layers corresponding to the depth of attention the operator can give it at any moment. The primary display layer contains only the information needed for ambient scanning — the operational state at a glance. Clicking into any element reveals the secondary layer — investigation context and decision support. Further drilling reveals the full data layer — raw signals, timelines, historical records. This three-layer model prevents information overload at the ambient level while ensuring full data access is never more than two interactions away.

Information is presented in layers corresponding to the depth of attention the operator can give it at any moment. The primary display layer contains only the information needed for ambient scanning — the operational state at a glance. Clicking into any element reveals the secondary layer — investigation context and decision support. Further drilling reveals the full data layer — raw signals, timelines, historical records. This three-layer model prevents information overload at the ambient level while ensuring full data access is never more than two interactions away.

Reduced Cognitive Load

The design systematically identifies decisions that the operator was previously making manually — and automates them or pre-stages them. The most significant examples: the correlation engine eliminates the manual signal-correlation task; the historical similarity display eliminates the manual incident history review; the pre-staged technician brief eliminates the manual information-gathering step before field dispatch. The interface is designed around what the operator needs to decide, not what they need to know in aggregate.

The design systematically identifies decisions that the operator was previously making manually — and automates them or pre-stages them. The most significant examples: the correlation engine eliminates the manual signal-correlation task; the historical similarity display eliminates the manual incident history review; the pre-staged technician brief eliminates the manual information-gathering step before field dispatch. The interface is designed around what the operator needs to decide, not what they need to know in aggregate.

Priority-Based Colour Semantics

Colour is used as a primary information carrier — not a decorative layer. The four-level colour system maps directly to incident severity:
1) Critical (Red #DC2626), 2) Warning (Amber #D97706), 3) Stable (Green #16A34A), 4) Informational (Blue #0EA5E9).   This four-level system deliberately avoids purple, pink, or other colours that carry no established operational semantic meaning and may be ambiguous for colour-deficient operators. The semantic assignments are consistent with international standards for control room display design (IEC 60073: Basic and Safety Principles for Man-Machine Interface, Marking and Identification).  

Colour is used as a primary information carrier — not a decorative layer. The four-level colour system maps directly to incident severity:
1) Critical (Red #DC2626), 2) Warning (Amber #D97706), 3) Stable (Green #16A34A), 4) Informational (Blue #0EA5E9).   This four-level system deliberately avoids purple, pink, or other colours that carry no established operational semantic meaning and may be ambiguous for colour-deficient operators. The semantic assignments are consistent with international standards for control room display design (IEC 60073: Basic and Safety Principles for Man-Machine Interface, Marking and Identification).  

Dark Operational Environment

The interface uses a dark background environment — a deliberate choice for operational display contexts. Dark mode for NOC interfaces is well-supported in the literature:
Reduced eye strain during long shifts (operators may spend 8–12 hours monitoring screens)
Higher contrast ratio for critical alerts on dark backgrounds versus light
Reduced light emission in shared NOC spaces where multiple operators share a darkened environment
Alignment with operator expectation — NOC professionals expect dark-environment tools

The interface uses a dark background environment — a deliberate choice for operational display contexts. Dark mode for NOC interfaces is well-supported in the literature:
Reduced eye strain during long shifts (operators may spend 8–12 hours monitoring screens)
Higher contrast ratio for critical alerts on dark backgrounds versus light
Reduced light emission in shared NOC spaces where multiple operators share a darkened environment
Alignment with operator expectation — NOC professionals expect dark-environment tools

High Scanability Under Stress

Under high-pressure conditions, the human visual system shifts toward peripheral scanning rather than focused reading. Interface elements must be designed to communicate their operational status before the operator focuses on them. This is achieved through size hierarchy (larger = more important), position consistency (critical elements are always in the same position), and pre-attentive attributes — visual properties that the brain processes before conscious attention is applied (colour, motion, shape contrast, spatial grouping).

Under high-pressure conditions, the human visual system shifts toward peripheral scanning rather than focused reading. Interface elements must be designed to communicate their operational status before the operator focuses on them. This is achieved through size hierarchy (larger = more important), position consistency (critical elements are always in the same position), and pre-attentive attributes — visual properties that the brain processes before conscious attention is applied (colour, motion, shape contrast, spatial grouping).

TYPOGRAPHY

Inter

Inter - Bold

Critical Incident IDs

Inter

Inter - Semibold

Section Headers

Inter

Inter - Medium

Operational Metrics

Inter

Inter - Regular

Supporting Context

Inter

Inter - Neutral

Passive Telemetry

OPERATIONAL COLOR LOGIC

#FF5A5A

Critical

Confirmed outage, urgent escalation

#F5A623

Warning

SLA risk, degradation, investigation state

#22C55E

Stable

Recovery, success verification

#3B82F6

Informational

Infrastructure context, neutral visibility

#0B1220

Ambient Surface

Passive monitoring regions

#111827

Elevated Surface

Active investigation containers

ALERT STATES -STATUS CHIPS

Passive

Muted visibility for stable monitoring and non-actionable telemetry.

Active

Elevated visibility for investigation, degradation, and operational attention.

Critical

Persistent visibility for confirmed outages and urgent escalation.

OPERATIONAL WALKTHROUGH

The following walkthrough simulates how a potential infrastructure failure progresses through the proposed operational system.

Type Hierarchy

KEY DESIGN DECISIONS

Role-Based Views Over Role-Based Permissions

The standard enterprise approach to multi-role systems is a shared interface with permission layers — the same dashboard, with some features hidden for lower-permission roles. This approach is efficient to build but operationally costly: it forces every operator to navigate an interface designed for the broadest use case, filtering out information irrelevant to their role mentally.

Research from human factors studies of complex operational environments consistently shows that mental filtering under time pressure is unreliable — operators under stress tend to attend to familiar or prominent elements regardless of their role relevance, and miss unfamiliar or subtly presented elements regardless of their operational importance. Role-specific interfaces eliminate this problem by only presenting role-relevant information.

The cost of this decision is design and engineering complexity — five interfaces rather than one. The operational benefit — reduced cognitive load, faster decision times, role-appropriate information hierarchy — justifies this cost. The shared data architecture (correlation engine, FPS model, incident registry) ensures consistency across all five views without requiring five separate data systems.

The standard enterprise approach to multi-role systems is a shared interface with permission layers — the same dashboard, with some features hidden for lower-permission roles. This approach is efficient to build but operationally costly: it forces every operator to navigate an interface designed for the broadest use case, filtering out information irrelevant to their role mentally.

Research from human factors studies of complex operational environments consistently shows that mental filtering under time pressure is unreliable — operators under stress tend to attend to familiar or prominent elements regardless of their role relevance, and miss unfamiliar or subtly presented elements regardless of their operational importance. Role-specific interfaces eliminate this problem by only presenting role-relevant information.

The cost of this decision is design and engineering complexity — five interfaces rather than one. The operational benefit — reduced cognitive load, faster decision times, role-appropriate information hierarchy — justifies this cost. The shared data architecture (correlation engine, FPS model, incident registry) ensures consistency across all five views without requiring five separate data systems.

AI as Suggestion, Not Automation

Early design explorations included a fully automated routing model — the system not only correlates signals and scores incidents, but also dispatches field technicians and manages the incident lifecycle without human decision points. This was rejected on three grounds:

Operational Accountability: In public infrastructure operations, decisions about resource dispatch have operational and financial consequences. Automated dispatch without human oversight removes the accountability that operational governance requires.
Model Reliability: An ML-based correlation engine will make errors, particularly in early deployment when historical data is limited. Operators must be able to override, reject, or modify system suggestions. An automation-first model reduces operator engagement with the system's outputs and atrophies the human judgment that catches model errors.
Trust and Adoption: Research in human-AI collaboration (particularly in high-stakes operational contexts) consistently shows that operators are more likely to adopt and trust systems that support their decisions rather than replace them. A suggestion-based AI layer builds operator familiarity with the system's reasoning; a fully automated layer may be accepted during normal operations but rejected during incidents, exactly when the system is most needed.

The designed AI layer suggests: preliminary incident classification, historical analogue identification, field dispatch recommendations, and resolution pathway options. All suggestions are labelled as system-generated and require operator confirmation. The system tracks operator agreement/disagreement with suggestions, which feeds the model improvement loop.

Early design explorations included a fully automated routing model — the system not only correlates signals and scores incidents, but also dispatches field technicians and manages the incident lifecycle without human decision points. This was rejected on three grounds:

Operational Accountability: In public infrastructure operations, decisions about resource dispatch have operational and financial consequences. Automated dispatch without human oversight removes the accountability that operational governance requires.
Model Reliability: An ML-based correlation engine will make errors, particularly in early deployment when historical data is limited. Operators must be able to override, reject, or modify system suggestions. An automation-first model reduces operator engagement with the system's outputs and atrophies the human judgment that catches model errors.
Trust and Adoption: Research in human-AI collaboration (particularly in high-stakes operational contexts) consistently shows that operators are more likely to adopt and trust systems that support their decisions rather than replace them. A suggestion-based AI layer builds operator familiarity with the system's reasoning; a fully automated layer may be accepted during normal operations but rejected during incidents, exactly when the system is most needed.

The designed AI layer suggests: preliminary incident classification, historical analogue identification, field dispatch recommendations, and resolution pathway options. All suggestions are labelled as system-generated and require operator confirmation. The system tracks operator agreement/disagreement with suggestions, which feeds the model improvement loop.

Persistent SLA Visibility

SLA countdown timers in conventional ITSM tools are per-incident displays — visible when you are viewing that specific ticket. This means an operator focused on a complex incident investigation may not notice that a second, less severe incident is approaching its SLA breach threshold.

KFON's design makes SLA status globally persistent: regardless of what view the operator is in or what incident they are investigating, the system-wide SLA risk state is continuously visible. The top-right region of every role interface displays a real-time SLA status indicator — green (all incidents within SLA), amber (one or more incidents at risk), red (one or more incidents in breach). This requires only 20–30 pixels of persistent display space but eliminates the category of SLA breach caused by operator attention being elsewhere.

SLA countdown timers in conventional ITSM tools are per-incident displays — visible when you are viewing that specific ticket. This means an operator focused on a complex incident investigation may not notice that a second, less severe incident is approaching its SLA breach threshold.

KFON's design makes SLA status globally persistent: regardless of what view the operator is in or what incident they are investigating, the system-wide SLA risk state is continuously visible. The top-right region of every role interface displays a real-time SLA status indicator — green (all incidents within SLA), amber (one or more incidents at risk), red (one or more incidents in breach). This requires only 20–30 pixels of persistent display space but eliminates the category of SLA breach caused by operator attention being elsewhere.

Flexible Routing Over Linear Escalation

Linear escalation (Tier 1 → Supervisor → NOC Lead) is simple to understand but operationally inefficient. In a linear model, every incident passes through Tier 1 even if its FPS score at first detection already warrants NOC Lead attention. Critical infrastructure failures, which frequently require immediate expert response, are delayed by mandatory staging through lower escalation tiers.

KFON's dynamic routing logic allows incidents to skip escalation tiers based on FPS score, infrastructure type, and operational risk index. A critical FPS (76–100) incident affecting hospital infrastructure routes directly to the NOC Lead and field engineering simultaneously, bypassing Tier 1 and Supervisor routing entirely. This is consistent with PagerDuty's multi-tier escalation policy model and Google SRE's incident command structure, which both allow direct escalation to technical leads for P1-level incidents.

Linear routing is retained as a fallback for incidents without sufficient metadata to determine an appropriate dynamic route — ensuring no incident falls through the system.

Linear escalation (Tier 1 → Supervisor → NOC Lead) is simple to understand but operationally inefficient. In a linear model, every incident passes through Tier 1 even if its FPS score at first detection already warrants NOC Lead attention. Critical infrastructure failures, which frequently require immediate expert response, are delayed by mandatory staging through lower escalation tiers.

KFON's dynamic routing logic allows incidents to skip escalation tiers based on FPS score, infrastructure type, and operational risk index. A critical FPS (76–100) incident affecting hospital infrastructure routes directly to the NOC Lead and field engineering simultaneously, bypassing Tier 1 and Supervisor routing entirely. This is consistent with PagerDuty's multi-tier escalation policy model and Google SRE's incident command structure, which both allow direct escalation to technical leads for P1-level incidents.

Linear routing is retained as a fallback for incidents without sufficient metadata to determine an appropriate dynamic route — ensuring no incident falls through the system.

EXPECTED IMPACT

Metric

Current State

Projected State

Mean Time To Detect (MTTD)

35–240 minutes

5–15 minutes

Mean Time To React (MTTR)

2–6 hours

1–3 hours

Proactive Detection Rate

~8% of incidents detected before user complaint

65–75% of incidents detected before first user complaint within 12 months of full deployment.

SLA Breach Rate

Unknown (true SLA performance masked by complaint-initiated clock).

30–40% reduction in actual SLA breaches through earlier detection and escalation.

NOC Operator Alert Volume

High volume, low signal-to-noise ratio.

90%+ reduction in actionable alert volume through FPS filtering

Field Dispatch Efficiency

Technicians dispatched with incomplete information; high on-site diagnostic time.

25–35% reduction in average on-site resolution time through pre-staged investigation briefs.

Based on PagerDuty's published deployment case studies showing 55–70% proactive detection improvement in enterprise NOC environments within 6–18 months of AIOps layer deployment, adjusted upward for KFON's smaller, more geographically concentrated network topology.

FUTURE SCOPE

Phase 02
Validation and Refinement (Months 1–12 Post-Launch)

Real NOC Operator Testing

FPS Model Calibration

Integration Engineering

Phase 03
Intelligence Deepening (Months 12–24)

Technician Mobile Workflows

Explainable AI Reasoning
Layer

Environmental Signal Integration

Phase 04
Predictive Operations (24+ Months)

Preventive Maintenance Intelligence

Feedback Learning
Systems

Cross-Operator Incident Intelligence

KEY LEARNINGS

Learning 01:
Enterprise UX is Fundamentally About Operational Clarity

Learning 02:
Visibility Without Context Creates Noise

Learning 03:
Systems Thinking is a Design Prerequisite, Not an Add-On

Learning 04:
Correlation is More Valuable Than Volume

Learning 05:
High-Pressure Interfaces Require Prioritisation Over Completeness

PROJECT REFLECTION

This was a project that started from a simple website redesign idea for my academic project. On talk with the authority side of KFON, found out the details of how it actually works and engages with the user issues and how it tackles these issues. Which, helped me in framing this project overall, Coming from Planning background working on and for public infrastructure was not new, but in design field 'Yes'. Thus this project beagan

Public infrastructure operates at the intersection of operational complexity and civic obligation. KFON is not trying to improve an experience, it is trying to maintain a service that schools, hospitals, and households depend on. When KFON's network fails, children lose examination sessions, patients lose connected care, and citizens lose access to government services. The stakes are not abstract.

This project never exactly went as i wanted this to go, be it data collection, design, the process, it was always messy, but it helped really learn many things the hard way, thanks to my seniors who were working in the field i could get feedbacks and suggestions that helped me reach the final outcome/current state. But the project is not yet complete. Im trying to evolve this further and further as much as possible. The aim was to see what would be the ideal dashboard/design system that could solve or at least provide the resolution before it ever became a problem (would love to hear thoughts on this and part learn, how to make it. Thank You.

View the Process and Mapping

View the Detailed Technical Document

KFON

DURATION

ROLE

DOMAIN

TYPE

Status

At a Glance

CHALLENGE

APPROACH

OUTCOME

Infrastructure Context

THE SCALE

20 LakhHouseholds

Families for whom KFON is the only affordable path to the internet

375+Network Nodes

Each one a point of dependency. Each one a potential point of failure.

14 Districts

From Kasaragod in the north to Thiruvananthapuram in the south

30,000+ Institutions

Every Govt. hospital, school, panchayat office, and police station in Kerala

NETWORK ARCHITECTURE OVERVIEW

Core

State-level backbone connecting district headquarters and major exchange points

Distribution

District-level nodes routing traffic to block and panchayat level infrastructure

Access

Last-mile fibre and associated CPE connecting households and institutions

Problem Space

THE PROBLEM

WHY DO REACTIVE OPERATIONS PERSIST?

PROBLEM STATEMENT

Network incidents are detected late and handled reactively, due to fragmented visibility and unstructured escalation across teams.

HOW MIGHT WE?

Research and Understanding

WHAT THE EVIDENCE SAYS

“Major outages are often preceded by weak signals—latency spikes, partial failures, or localized disruptions—that go unnoticed.”

“Collecting metrics is not the same as understanding system health; lack of correlation leads to delayed detection.”

“Lack of end-to-end visibility across network layers significantly delays root cause identification during outages.”

“Escalation paths are often reactive and depend on manual intervention, leading to increased mean time to resolution.”

“Incident response is frequently slowed by siloed teams and unclear ownership across operational domains.”

“In several major outages, initial anomalies were observed minutes before escalation, but were not flagged due to fragmented monitoring systems.”

PHASE 1

PHASE 2

PHASE 3

PHASE 4

THE EXISTING KFON OPERATIONAL STACK

Monitoring Layer

Operational Layer

RESEARCH INSIGHT

The issue is not lack of data. The issue is lack of correlation between existing operational signals.

WHY DO REACTIVE OPERATIONS PERSIST?

WHO WE'RE DESIGNING FOR

Arjun

Ramesh

Ajith

Design Process

THE DESIGN BRIEF

Faster, Smarter Incident Response

System Logic

FROM WAITING FOR COMPLAINTS TO, TACKLING THEM EARLY

the length of the bar represents the time taken, these are not fully measured, the reactive one is based on available data from TRAI and for the proactive one its speculated.

WHY IS THE SHIFT, SO SIGNIFICANT

Incidents detected proactively are resolved 60–70% faster than incidents detected through user complaints, primarily because investigation can begin before the problem worsens.

— Average ISP network fault resolution time in India is 6.2 hoursTRAI Annual Report 2022–23

CORRELATION ENGINE

Temporal Overlap

Multiple issues appearing together

Geographic Proximity

Multiple issues appearing together

Infrastructure Overlap

Multiple issues appearing together

Historical Incident Similarity

Multiple issues appearing together

FAILURE PROBABILITY SCORE

Geographic Proximity

30%

Infrastructure Overlap

30%

TemporalOverlap

25%

20 Lakh
Households

375+
Network Nodes

14
Districts

— Average ISP network fault resolution time in India is 6.2 hours
TRAI Annual Report 2022–23

Temporal
Overlap

Historical
Incident
Similarity

Temporal
Overlap

Phase 02
Validation and Refinement (Months 1–12 Post-Launch)

Phase 03
Intelligence Deepening (Months 12–24)

Explainable AI Reasoning
Layer