आप ऐसे सिस्टम कैसे डिज़ाइन करते हैं जो विफलताओं को सुचारू रूप से संभालें?

Question

Accepted Answer

बड़े पैमाने पर, **विफलताएं अपरिहार्य हैं** — सर्वर क्रैश होते हैं, नेटवर्क विफल होते हैं, dependencies अनुपलब्ध हो जाती हैं। विफलता के लिए डिज़ाइन करने का अर्थ है ऐसे सिस्टम बनाना जो सब कुछ काम करने की धारणा के बजाय विफलताओं को **सुचारू रूप से सहन और पुनर्प्राप्त** करें। यह विश्वसनीय सिस्टम के लिए आवश्यक है।

## विफलता के लिए डिज़ाइन (मानसिकता)

```text
ASSUME things WILL fail → at scale, failures are NORMAL, not exceptional:
  → servers crash, networks partition, disks fail, dependencies go down, traffic spikes
  → design systems to EXPECT and HANDLE failures gracefully (not assume everything works)
→ "everything fails all the time" → build resilience in.
```

## Resilience तकनीकें

```text
✓ REDUNDANCY → multiple instances, no single point of failure (failover to healthy ones)
✓ RETRIES (with backoff) → retry transient failures (with exponential backoff + jitter to
  avoid overwhelming a recovering service)
✓ TIMEOUTS → don't wait forever for a failing dependency (fail fast)
✓ CIRCUIT BREAKERS → stop calling a failing service temporarily (prevent cascading failures;
  give it time to recover) → fail fast and fall back
✓ GRACEFUL DEGRADATION → reduced functionality vs total failure (e.g. show cached/partial
  data if a service is down)
✓ FALLBACKS → a default/alternative when something fails
✓ BULKHEADS / isolation → contain failures (one part failing doesn't sink everything)
```

## Cascading failures से बचना

```text
⚠️ CASCADING failures → one failure triggers others (e.g. a slow service exhausts callers'
  resources → they fail too → spreads)
→ prevent with: timeouts, circuit breakers, isolation/bulkheads, load shedding, backpressure
✓ MONITORING/alerting → detect failures fast; test failure scenarios (chaos engineering)
```

## यह क्यों महत्वपूर्ण है

विफलताओं को सुचारू रूप से संभालने वाले सिस्टम डिज़ाइन करना समझना मूल्यवान है क्योंकि **बड़े पैमाने पर विफलताएं अपरिहार्य हैं**, और उनके लिए डिज़ाइन करना विश्वसनीय सिस्टम के लिए आवश्यक है, इसलिए यह महत्वपूर्ण system-design ज्ञान है।

मौलिक मानसिकता — **यह मानना कि चीजें विफल होंगी** (क्योंकि बड़े पैमाने पर, विफलताएं सामान्य हैं, असाधारण नहीं — सर्वर क्रैश होते हैं, नेटवर्क partition होते हैं, dependencies down हो जाती हैं) और सब कुछ काम करने की धारणा के बजाय सिस्टम को विफलताओं की अपेक्षा और उन्हें सुचारू रूप से संभालने के लिए डिज़ाइन करना — विश्वसनीय सिस्टम बनाने की नींव है, जो इस सिद्धांत में निहित है कि "everything fails all the time"। **Resilience तकनीकों** को समझना मुख्य व्यावहारिक ज्ञान है: **redundancy** (कोई single point of failure नहीं), **retries with backoff** (transient विफलताओं को संभालना, recovering सेवाओं को अभिभूत करने से बचने के लिए exponential backoff और jitter के साथ), **timeouts** (हमेशा प्रतीक्षा करने के बजाय fail fast), **circuit breakers** (cascading विफलताओं को रोकने और सेवा को पुनर्प्राप्त होने देने के लिए विफल हो रही सेवा को कॉल करना रोकना), **graceful degradation** (पूर्ण विफलता के बजाय कम कार्यक्षमता, जैसे cached डेटा दिखाना), **fallbacks**, और **bulkheads/isolation** (विफलताओं को सीमित करना)।

ये तकनीकें वह हैं जिनसे सिस्टम उन विफलताओं को सहन और पुनर्प्राप्त करते हैं जो अनिवार्य रूप से होती हैं।

**Cascading failures से कैसे बचें** यह समझना — जहां एक विफलता दूसरों को ट्रिगर करती है (एक धीमी सेवा callers' संसाधनों को समाप्त करती है, विफलता को फैलाती है), जिसे timeouts, circuit breakers, isolation, load shedding, और backpressure से रोका जाता है — विशेष रूप से महत्वपूर्ण है, क्योंकि cascading विफलताएं छोटी समस्याओं को बड़े outages में बदल देती हैं।

monitoring और विफलता परिदृश्यों के परीक्षण (chaos engineering) की भूमिका को समझना तस्वीर को पूरा करता है।

चूंकि बड़े पैमाने पर विफलताएं अपरिहार्य हैं और उनके लिए डिज़ाइन करना (design-for-failure मानसिकता और resilience तकनीकों के साथ) विश्वसनीय सिस्टम के लिए आवश्यक है, और चूंकि मानसिकता, तकनीकों, और cascading-failure रोकथाम को समझना मजबूत सिस्टम बनाने के लिए महत्वपूर्ण है, विफलता के लिए डिज़ाइन करना समझना मूल्यवान, व्यावहारिक रूप से महत्वपूर्ण system-design ज्ञान है — बड़े पैमाने पर अपरिहार्य विफलताओं को सहन करने वाले विश्वसनीय सिस्टम बनाने के लिए आवश्यक, redundancy, retries, circuit breakers, और graceful degradation के माध्यम से resilience के लिए केंद्रीय, और उस design-for-failure मानसिकता को दर्शाता है जो मजबूत सिस्टम को नाजुक सिस्टम से अलग करती है।