← Back to Portfolio

Observability Platform — Grafana + Zabbix

Unified monitoring, alerting, and dashboards for infrastructure and service health.

Repository

View on GitHub

Problem

Monitoring was fragmented across tools and teams, creating alert noise, poor visibility, and slow incident response.

Scope: 50+ hosts, 12 critical services
Timeline: 4 weeks
Stack: Zabbix, Grafana, Linux, Alerting
Role: DevOps Engineer

Architecture Diagram

This platform centers on a Zabbix server that collects metrics from lightweight agents, feeds Grafana dashboards, and drives a clear alert flow to the on-call team.

Observability platform architecture diagram showing Zabbix agents, server, Grafana dashboards, and alert flow.
Architecture: agents → Zabbix server → Grafana dashboards → alert flow.
Hosts / Services
      |
  Zabbix Agents
      |
 Zabbix Server
      | \
      |  -> Alerting (Email / Chat / On-call)
      |
Grafana (Zabbix Datasource)
      |
Dashboards + SLA Reports
      

Setup Steps

  1. Deploy Zabbix server and agents with standardized templates.
  2. Connect Grafana to Zabbix datasource and validate metrics.
  3. Create dashboards for infra health, services, and SLA reporting.
  4. Configure alert rules, routing, and maintenance windows.

Dashboard Walkthrough

Alert Example

Incident Response Runbook (Example)

  1. Alert received: Critical trigger fires (CPU saturation or service down).
  2. Triage: Check Grafana dashboard, confirm host status, inspect recent changes.
  3. Escalation: Page service owner if SLA is at risk or incident persists >15 minutes.
  4. Resolution: Apply mitigation, validate recovery, close alert with notes.

Incident Story

Zabbix triggered a critical alert for elevated latency on the checkout service during peak traffic. Grafana dashboards showed CPU saturation and a sharp rise in database wait time. The on-call engineer scaled the database read replicas and restarted a stuck worker queue, restoring normal latency within 20 minutes. A follow-up action adjusted alert thresholds and added a dashboard panel to track queue depth.

Screenshots

Dashboard Placeholder
Grafana dashboards: service SLOs, latency, and error rates in one view.
Infrastructure Placeholder
Zabbix host overview: inventory health, agent status, and key metrics.
Alert Placeholder
Alert configuration: trigger thresholds and escalation paths.

Metrics

Suggested validation sources: Zabbix alert history, Grafana dashboards, incident reports.

Before vs After

Monitoring Setup

Before: Multiple tools with isolated views and no shared dashboards.

After: Unified Zabbix + Grafana stack with standardized dashboards.

Alert Volume

Before: ~120 alerts/day with frequent duplicates.

After: ~75 alerts/day with tuned thresholds and routing.

Response Time

Before: MTTR ~45 minutes.

After: MTTR ~25 minutes.

Lessons Learned