Best Practices for Incident Response

In complex products and systems, failures are inevitable. When incidents occur, we must not only act swiftly to restore business continuity but also continually improve and extract lessons to prevent recurrence. This article summarizes a practical “best response strategy” aimed at providing actionable emergency handling and post-mortem frameworks for development teams.

1. Golden Rules of Incident Handling

1.1 Stopping the Bleeding Takes Top Priority

In emergency response, the primary goal is to restore product functionality as quickly as possible—similar to the “stop the bleeding” principle in first aid. Root cause analysis can wait; the priority is immediate recovery.

1.2 Identify the Triggering Variables

Response measures must support phased rollouts to avoid expanding the scope of the problem. The execution of the plan should be efficient yet cautious, ensuring no additional risks are introduced.

Variables are often the trigger point of failures: These are typically the first suspects and relatively easy to spot.
Analyze variables for quick containment: Focus your investigation on the variables to locate the issue and take immediate action.

1.3 Careful and Efficient Execution of Containment Plans

While executing containment measures, avoid making the situation worse. Balance speed with thoroughness.

2. Strengthening Incident Response Capabilities

2.1 Effective Communication

During emergency handling, the product owner should oversee the entire situation, while team members must quickly synchronize their findings and divide responsibilities to narrow down the problem scope.

2.2 Sharpen the Basics

Improve familiarity with business logic
Build a toolkit of handy scripts and utilities
Establish streamlined troubleshooting processes

2.3 Proactive Measures in Feature Development

It’s essential to enforce the following during development:

Gray release support
Monitoring capability
Rollback readiness

Avoid “wishful thinking” and “low-value tasks”; even if it requires extra effort, product quality and safety must not be compromised.

2.4 Learn from Excellent Postmortems

Study postmortems from leading companies like Cloudflare, to inspire fresh thinking and continuous improvement.

2.5 Mindset Adjustment

Incident response is not an exam. Teams should maintain a constructive mindset, focusing on problem-solving and learning valuable lessons from each event.

3. Postmortem Analysis

3.1 Core Objectives

Prevent recurrence of the incident

3.2 Key Considerations

Ensure thorough resolution of the issue
Document the incident timeline and root causes
Implement targeted improvement actions
Establish guidelines and systems to guard against similar problems
Maintain a holistic view
Ensure high-quality execution of action items
Integrate temporary fixes into long-term improvements

4. Accountability

4.1 Attending the Incident Review Meeting

The purpose of the review is to acknowledge issues and extract lessons—not simply to assign blame. Taking responsibility is both a duty and a growth opportunity.

4.2 Mindset Adjustment

The team should maintain a proactive attitude, learn from mistakes, and avoid repeating them. And if worst comes to worst and the issue proves unsolvable—well, sometimes you have to “grab your bucket and leave” (just kidding!).