Microsoft Meltdown and Fragile Systems: Lessons Learned

The global outage that impacted Microsoft systems around the world on 20th July cannot be considered as an error, simple failure or outlier event.

It was in fact an absolutely inevitable consequence of the increasingly complex interactions and inter-dependencies that characterise the global networks on which we are dependent, and which are only increasing in their complexity and dependencies on a daily basis.

There are usually one of three reasons why a complex system will fail – systems failure, human error or malicious attack. Whilst we cannot influence the possibility or likelihood of these events happening, we should be aware of how they will impact us on a local, networked and global level.

The ISRM has been dedicated to exploring these issues from three perspectives – the academic, the policy and the practitioner, and this global event, which was characterised by instantaneous failure, cascading consequences, highly impactful disruption and significant time to recovery, is exactly the sort of systemic failure we have been discussing from multiple perspectives and in multiple forums.

From an academic standpoint, this event and others like it will be immediately recognisable in terms of Perrow’s concept of Normal Accident Theory, Weick’s models of High Reliability Organizations, Lagadec’s approach to extreme complexity and the need for a new cosmology of risks and crises, Klein’s model of Recognition Primed Decision Making, Rittel and Weber’s description of Wicked Problems and much more. 

 



You are invited to join us for an exploration of what the Microsoft outage means for all of us from a resilience / business continuity perspective, the lessons that need to be learned in terms of our global inter-connected systems management and an opportunity to be part of an on-going global dialogue that will have increasing urgency and importance.