आप किसी एप्लिकेशन के लिए शुरू से (scratch से) monitoring कैसे डिज़ाइन करेंगे?

Question

Accepted Answer

infrastructure से bottom-up शुरू करने के बजाय **top-down — यानी users को क्या महसूस होता है, वहाँ से** शुरू करें। अगर requests fail हो रही हैं तो सबसे विश्वसनीय host fleet भी बेकार है, इसलिए user-facing **SLIs** — **latency**, **error rate**, **availability** — से शुरुआत करें, फिर four golden signals जोड़ें, और infra metrics सबसे अंत में।

## परत-दर-परत, user से अंदर की ओर

```text
1. USER-FACING SLIs   → what the user experiences (latency, errors, availability)
2. GOLDEN SIGNALS     → latency, traffic, errors, saturation per service
3. INFRA METRICS      → CPU, memory, disk, network (causes, not symptoms)
```

अगर आप केवल CPU और disk देखते हैं (bottom-up), तो सब कुछ पूरी तरह green दिख सकता है जबकि users को 500s मिल रहे हों। पहले SLIs देखना (top-down) मतलब आप उन **symptoms पर alert करते हैं जो users वास्तव में महसूस करते हैं**, फिर कारण खोजने के लिए golden signals और infra में drill down करते हैं।

## pipeline: instrument → collect → dashboard → alert

```text
INSTRUMENT  app emits metrics/logs/traces (e.g. request_duration_seconds histogram)
   ↓
COLLECT     a TSDB scrapes/ingests them (Prometheus, Datadog agent)
   ↓
DASHBOARD   visualize SLIs + golden signals (Grafana) for humans to read
   ↓
ALERT       fire on SLO violations / burn rate, routed to on-call
```

## एक ठोस शुरुआती बिंदु

```promql
# Availability SLI: fraction of requests that succeed
sum(rate(http_requests_total{status!~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Latency SLI: p99 request latency
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
```

हर SLI पर एक **SLO** परिभाषित करें (जैसे 99.9% availability, p99 < 300ms), उन्हें dashboard करें, और जब SLO जोखिम में हो तब alert करें — हर छोटी-मोटी गड़बड़ी पर नहीं।

## यह क्यों महत्वपूर्ण है

bottom-up बना monitoring आपको बताता है कि disk 80% भर गई है, पर यह नहीं कि ग्राहक checkout नहीं कर पा रहे। user-facing SLIs से शुरू करना हर dashboard और alert को वास्तविक user impact से जोड़ता है, noise कम रखता है, और जब कुछ टूटता है तो एक स्पष्ट drill-down path देता है (symptom → golden signal → infra cause)।