The Eternal Sunshine of the Toil-less Prod

A presentation at QConSF in October 2022 in San Francisco, CA, USA by Sasha Czarkowski (Rosenbaum)

Slide 1

Toil Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 1

Slide 2

@DivineOps SRE teams usually aim for under 50% toil

Slide 3

@DivineOps If an employee is told that 50% of their work has no enduring value, how does this affect their productivity and job satisfaction? - Byron Miller

Slide 4

The eternal sunshine of the toil-less prod Sasha Rosenbaum @DivineOps

Slide 5

Dev B.Sc. in C.S. Ops MBA Dev + Ops Cloud Consulting DevRel Technical Sales Sasha Rosenbaum @DivineOps

Slide 6

@DivineOps And you?

Slide 7

Slide 8

Red Hat OpenShift - Kubernetes Experience 2014 2015 2016 Nearly 100 Kubernetes Distributions In the Market OpenShift v4 (2019-2020) OpenShift v3 (2015) Data Center 8 Industry Leader 2nd Largest Contributor Most Enterprise Customers Most Multi-Cloud Deployments

Slide 9

A consistent experience no matter where you run it Developer Efficiency Business Productivity Enterprise Ready Red Hat OpenShift On-premises Red Hat OpenShift Service on AWS Azure Red Hat OpenShift Red Hat Managed Joint offerings with Cloud Provider 9 Red Hat OpenShift on IBM Cloud Red Hat OpenShift Dedicated Red Hat Managed OpenShift Container Platform OCP Customer Managed

Slide 10

Our Scale Our Experience 4 Public Clouds 100,000+ Clusters 60+ Regions 2M+ Develop ers 6000+ New Users 2 Million+ Hours of SRE Every Week experience

Slide 11

@DivineOps Products to Services

Slide 12

@DivineOps Products and Services

Slide 13

@DivineOps SRE

Slide 14

@DivineOps What is the most important and innovative thing about SRE discipline?

Slide 15

@DivineOps SRE is about explicit agreements that align incentives

Slide 16

@DivineOps SLA, SLI, SLO

Slide 17

@DivineOps SLA = Financially-backed availability

Slide 18

@DivineOps Monthly downtime > 1.5 days means 100% refund

Slide 19

@DivineOps SLAs are about aligning incentives between Vendor & Customer

Slide 20

@DivineOps 99% => 99.99%

Slide 21

@DivineOps • SLA usually includes a single metric • For financial and reputational reasons, companies prefer to under promise and overdeliver

Slide 22

@DivineOps SLO = Targeted reliability

Slide 23

Our Journey Service Level Objectives: What we care about? Availability OAuth Server Availability Registry Success rate Cluster installation time (includes external dependencies) Builds Cluster Provisioning Success rate Critical rollout time Availability Cluster Upgrade Availability measured from customer perspective (closed box monitoring) API Server Can customers run their CI/CD? Periodically run synthetic builds Router OpenShift Console Availability measured from customer perspective (closed box monitoring)

Slide 24

@DivineOps SLI = Actual reliability

Slide 25

@DivineOps Monitoring

Slide 26

@DivineOps Without monitoring, you have no way to tell whether your service even works!

Slide 27

@DivineOps Good Monitoring

Slide 28

@DivineOps Without good monitoring, you don’t know that the service does what users expect it to do!

Slide 29

@DivineOps Signal to noise ratio

Slide 30

@DivineOps Early on, one of the major monitoring problems we had is alerts on customer clusters that were intentionally taken offline

Slide 31

@DivineOps Without good monitoring, your SRE is potentially overloaded with unwarranted emergencies and blindsided by real incidents

Slide 32

@DivineOps Periodically incidents may be caught by internal users, rather than the monitoring system We aim to implement monitoring improvements that will catch future problems of the same kind

Slide 33

@DivineOps SLO

Slide 34

@DivineOps SLO = Business-approved reliability

Slide 35

100% reliability is… •Unattainable •Unnecessary •Extremely expensive

Slide 36

The five nines 99.999% 5.26 mins / year

Slide 37

@DivineOps Will your users even notice?

Slide 38

@DivineOps The ISP background error rate is 0.01% - 1%

Slide 39

@DivineOps SLOs are about explicitly aligning incentives between Business & Engineering

Slide 40

Error Budgets Acceptable level of unreliability Error budget = 1 - SLO EB = 1 – 99.99% = 0.01% ≃ 13 mins /quarter

Slide 41

@DivineOps Error budgets are about aligning incentives between Dev & Ops

Slide 42

@DivineOps If developers are measured on the same SLO, then when the error budget is drained developers shift focus from delivering new features to improving reliability

Slide 43

@DivineOps So, we’ve written things down

Slide 44

@DivineOps Are we there yet?

Slide 45

The future is already here. It’s just not evenly distributed ~ William Gibson

Slide 46

@DivineOps Awkward Segue

Slide 47

@DivineOps What we all got wrong

Slide 48

@DivineOps

Slide 49

@DivineOps “SRE is what happens when you ask a software engineer to design an operations team.” - Google SRE book, 2017

Slide 50

@DivineOps Is it though?

Slide 51

@DivineOps DevOps

Slide 52

DevOpsDays Ghent 2009

Slide 53

@DivineOps Automate ourselves out of a job!

Slide 54

@DivineOps So why didn’t we do it?

Slide 55

@DivineOps Effective automation requires consistent APIs

Slide 56

@DivineOps OS-level APIs

Slide 57

2000 27% of server market 41% of server market • File-based OS • Maintains configuration in files • Every device is a file • Executable-based OS • Maintains configuration in registry • Every device has a different driver mechanism

Slide 58

@DivineOps PowerShell (Windows) configuration management framework, CLI, and scripting language GA: 2006 Jeffrey Snover

Slide 59

@DivineOps DevOpsDays Seattle 2019: Thriving Through Transitions by Jeffrey Snover

Slide 60

@DivineOps Every wave of automation Enables the next wave of automation

Slide 61

@DivineOps Infrastructure-level APIs

Slide 62

@DivineOps Borg cluster manager “Central to its success - and its conception - was the notion of turning cluster management into an entity for which API calls could be issued” - Google SRE book, 2017

Slide 63

@DivineOps Amazon Web Services: 2002 Amazon Cloud Computing: 2006 Azure Cloud Services: 2008

Slide 64

@DivineOps 2005 Infrastructure as code 2012 2009

Slide 65

@DivineOps We did NOT suddenly get the idea of infrastructure & platform automation 65

Slide 66

@DivineOps We did NOT suddenly get the idea of infrastructure & platform automation We gradually built the tools required to make it happen 66

Slide 67

@DivineOps Why does this matter?

Slide 68

@DivineOps If we get the origin story wrong, we end up working to solve the wrong problem!

Slide 69

@DivineOps Corollary 1: Hiring developers to do operations work ≠ effective SRE

Slide 70

@DivineOps We have seen success from hiring the skillsets across the entire landscape, hiring well-rounded folks with understanding of Ops and Dev concepts, as opposed to just Dev experience

Slide 71

@DivineOps Corollary 2: The desire to automate the infrastructure & platform operations is insufficient

Slide 72

@DivineOps Corollary 2: The desire to automate the infrastructure & platform operations is insufficient We need consistent APIs and reliable monitoring to unblock the automation

Slide 73

@DivineOps Early on, we had to move the Cloud Services Build system from on-prem to the Cloud, because it was not meeting our reliability and agility targets

Slide 74

@DivineOps What we all got wrong

Slide 75

@DivineOps Toil

Slide 76

Toil Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. 76

Slide 77

@DivineOps SRE teams usually aim for under 50% toil

Slide 78

@DivineOps So, are we striving for a human-less system?

Slide 79

@DivineOps Second law of thermodynamics

Slide 80

@DivineOps With time, the net entropy (degree of disorder) of any isolated system will increase

Slide 81

@DivineOps

Slide 82

Entropy always wins

Slide 83

Slide 84

People working above the line of representation continuously build and refresh their models of what lies below the line. That activity is critical to the resilience of Internetfacing systems and the principal source of adaptive capacity. - Dr. Richard Cook

Slide 85

Slide 86

Resilience Velocity 2012: Richard Cook, “How Complex Systems Fail”

Slide 87

@DivineOps What we call toil is a major part of resilience and adaptive capacity

Slide 88

@DivineOps Perhaps we need a better way to look at toil

Slide 89

@DivineOps SRE folks worry that if they spend significant parts of their day focusing on toil, it will negatively affect their bonuses, chances of promotions etc.

Slide 90

@DivineOps If an employee is told that 50% of their work has no enduring value, how does this affect their productivity and job satisfaction? - Byron Miller

Slide 91

@DivineOps SRE Work Allocation

Slide 92

@DivineOps Work Allocations: SRE P and SRE O

Slide 93

Traditional IT dev ops wall of confusion

Slide 94

@DivineOps Work Allocations: On-call once a month

Slide 95

@DivineOps SRE teams asked management for more on-call, as they were losing their “Ops muscle”

Slide 96

@DivineOps Work Allocations: Rotate engineers working on toilreduction tasks

Slide 97

@DivineOps Lack of continuity severely impacted team’s ability to deliver

Slide 98

@DivineOps Work Allocations: The search for the perfect system is still in progress!

Slide 99

@DivineOps Where do we go from here?

Slide 100

@DivineOps Let’s look at some of the insights from the talk:

Slide 101

@DivineOps Effective automation requires consistent APIs

Slide 102

@DivineOps Cloud

Slide 103

@DivineOps Cloud provides an industry standard for consistent infrastructure-level APIs

Slide 104

@DivineOps Are you in the datacenter management business?

Slide 105

@DivineOps Kubernetes

Slide 106

85% of global IT leaders agree that Kubernetes is key to cloud-native application strategies Source: Red Hat State of Open Source Report 2021 Source: Red Hat State of Open Source Report 2021

Slide 107

@DivineOps Kubernetes could provide the industry standard for consistent platform-level APIs

Slide 108

@DivineOps If building PaaS isn’t your company’s core business

Slide 109

@DivineOps If building PaaS isn’t your company’s core business Allow your provider to toil for you

Slide 110

A consistent experience no matter where you run it Developer Efficiency Business Productivity Enterprise Ready Red Hat OpenShift On-premises Red Hat OpenShift Service on AWS Azure Red Hat OpenShift Red Hat OpenShift on IBM Cloud Joint offerings with Cloud Provider Offered as a Native Console offering on equal parity with cloud provider Kubernetes service 110 or OCP Customer Managed Red Hat OpenShift Dedicated Red Hat Managed OpenShift Container Platform OCP Customer Managed