The primary reason we should push for a focus on operability is that poorly operating software costs a huge amount of money, both in unplanned development activity (bug fixing) and also in repetitive operational ‘fire-fighting’. In his influential book Professional Software Development, Steve McConnell documents the massive returns on investment seen by large companies that decided to invest in higher quality, more operable software: ROIs of 150-200% within 1 year are common, and some organisations achieved ROI of 700% or over.
The book Patterns for Performance and Operability contains more detailed suggestions and patterns for making the business case for
operability (p. 214-23).
Towards a culture of operability
Shared goals and incentives
A sustainable approach to operability (and better-working software systems) likely requires changes to the way in which the development, testing, and operations teams relate: their incentives, performance targets, and certainly to expectations about how they interact. Barriers to regular, open communication without prior approval must be removed, to allow teams to collaborate on operational issues when needed.
Responsibility for responding to, and addressing incidents and problems in, Production must be shared between development, testing, and operations teams, even if operations still lead the incident management. If operations remain the people who ‘get it in the neck’ if the software system goes wrong, then the developers have little incentive to improve it. Conversely, if the operations teams work around a problem by (say) putting in a load balancer fix but do not tell the development team, then the development team has no chance to improve the software. A feedback loop from Production back to development needs to be in place.
In practice, a decent alignment in goals between the development and operations teams/departments is needed; if development is rewarded largely on user story delivery, and operations rewarded largely on uptime of the existing systems, the space and time to collaborate on operational criteria is going to be difficult to find.
Organisational response to ‘failure’
The way in which an organisation treats ‘failures’ can have a marked effect on the effectiveness of the software delivery and operation effort. If every failure in Production leads to ever-increasing additional checks, tests, and (most destructively) blame, then future failures end up being more (not less) likely, as people retreat into the ‘safety’ of minimal effort and fear of change.
The PPO authors rightly urge us to treat failures as ‘canaries in a coal mine’, alerting us to bigger problems. W. Edwards Deming advised us to avoid a blame culture based on fear of failure, and so to set up our delivery processes and prac- tices so that we treat failures as an opportunity for learning, not for retribution and blame. The true failure is not allowing teams to learn from incidents; the blame- less post-mortem review is a crucial part of helping that organisational learning to take place (p. 272, Patterns for Performance and Operability).