<CrowdStrike> Navigating System Failures

Kovid Batra - Aug 1 - - Dev Community

The only problem with troubleshooting is that sometimes trouble shoots back.

This week, we spoke with Nitish Goyal, Principal Engineer at AWS, who shared his insights on navigating such challenges associated with troubleshooting. It's the 'sometimes' that creates most of the worries for engineering leaders. System failures are inevitable, but how we respond to them can define our success as tech leaders.

Now, over to him to delve deeper into this topic.

We all know what recently happened with Crowdstrike - a faulty patch rolled out to Windows, bringing the dreaded ‘Blue Screen of Death’ and causing global outage. So I thought it would be a perfect time for me to share my own technical crisis & strategies that worked for navigating system failures.

What to do during a System Failure?
Stay calm

When a system failure occurs, it's crucial to remain calm. Panic can lead to hasty decisions that may aggravate the problem. Take a deep breath and assess the situation. Gather as much information as possible to understand the scope and impact of the failure.

Over communicate

Communication is key. Inform your team and stakeholders about the issue promptly. Transparency builds trust and ensures everyone is on the same page. It might require holding a quick all-hands meeting to explain the situation, your immediate action plan, and expected timelines.

Have a response plan ready

Having a predefined incident response plan is essential. This plan should outline roles, responsibilities, and steps to be taken during a system failure. It may include notifying your support team, initiating your backup procedures, and communicating with your customers.

Execute efficiently

Let not people get lost in the heat of the moment - quickly prioritize tasks based on their impact and urgency. Delegate responsibilities to your team members, ensuring everyone knows their role in the resolution process.

Document everything

This is the key! Documenting every action taken during a system failure is crucial for post-incident analysis. This helps in identifying what went wrong and how similar issues can be prevented in the future.

Voila! These critical aspects keep you going during a system failure & help you tackle them in an objective way.

But, is there a better way to prepare ourselves for system failures?

  1. Audit systems regularly Conduct regular audits of your systems to identify potential vulnerabilities. Routine maintenance can prevent many issues from escalating into full-blown failures.

  2. Have a backup of a backup Ensure you have a robust backup and recovery plan. Regularly test your backups to make sure they can be restored quickly and without issues.

  3. Run simulations Conduct regular incident response drills to ensure your team is prepared for real-world scenarios. These drills can highlight gaps in your plan and improve overall readiness.

  4. Implement real-time monitoring/alerting tools Invest in comprehensive monitoring and alerting tools that can detect issues early and alert the relevant teams promptly.

  5. Analyse with brutal honesty Be brutally honest with analysing a failure. Conduct post-incident reviews to identify improvements and share learnings with the entire team. Promote a culture where it can be seen as a learning opportunity.

Conclusion
System failures are part and parcel of the tech world. By staying calm, communicating effectively, and having a solid plan in place, you can navigate these challenges successfully. Preparing in advance through regular audits, backups, and simulations can mitigate risks and ensure your team is ready for any eventuality.

Remember, it's not about avoiding failures altogether but about how you respond to and learn from them that defines your success as a CTO.

. . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player