Last century we used to do this thing called waterfall. A software developer who was building a feature was responsible for making it working according to a specification on her machine. The magic was supposed to happen so the feature will get tested, released and loved by customers. Well, sometimes this happened and sometimes it did not.
With introduction of services and faster release cadence many teams decided to shift to more agile process that made developers responsible for quality and the live site. This era is called DevOps or combined engineering. It was quite a change for many developers and teams are still learning how to do this right.
What if we build high quality software with 99.999999% uptime that is released every minute… and nobody uses it? This is a problem. We need to push the boundary further. A successful feature must be used by hundreds, thousands, millions, billions (depending on your ambition) and must be fast and reliable in production. There are so many different hardware configurations, network conditions and customer scenarios that it is impractical to simulate them all in the lab. Data-driven engineering requires developers to analyze telemetry data that is collected by software as customers are using it.
And even this is not enough to stay competitive in the world where telemetry data is fast and cheap and statistics wisdom is packaged in a set of convenient tools. Long term success is built on measuring how much each released feature impacts metrics that the team cares about. Not estimating, not guessing but actually measuring. This is done by running A/B experiments that give different experiences to random subsets of users and measure changes in user behavior. This is the only way to distinguish between correlation and causation and prove that releasing a feature changes a metric X by Y.
Many teams think that they are on the stage 3 or 4 while in reality they are still on the stage 2. Having some data and throwing statistics around is not the same thing as using it efficiently. Here are some questions to diagnose a stage for your team.
- How many users does this feature have?
- What are the main entry points and segments?
- Do we measure success based on telemetry data?
- Do we learn about issues from telemetry data and not from upset customers?
- How do we measure feature impact?
- How do we make release/cut decisions?
- How do we prioritize performance against feature work?