ModernOps with Beth Long: Operational Ownership

Transcript

Speaker: 00:00:00

Modern applications require modern operations, and modern

Speaker: 00:00:03

operations requires a new definition for ownership that

Speaker: 00:00:07

most classical organizations must provide.

Speaker: 00:00:10

Today I continue my discussion on modern ops with Beth Long.

Speaker: 00:00:14

Are you ready? Let's go.

Speaker: 00:00:18

This is the Modern Digital Business Podcast, the technical

Speaker: 00:00:22

leader's guide to modernizing your applications and digital business.

Speaker: 00:00:26

Whether you're a business technology leader or a small business

Speaker: 00:00:29

innovator, keeping up with the digital business revolution is a

Speaker: 00:00:32

must. Here to help make it easier with actionable insights and

Speaker: 00:00:36

recommendations, as well as thoughtful interviews with industry experts.

Speaker: 00:00:40

Lee Atchison in this episode of Modern Digital Business,

Speaker: 00:00:44

I continue my conversation on Modern Operations with my good

Speaker: 00:00:48

friend SRE engineer and operations manager Beth

Speaker: 00:00:51

Long. This conversation, which focuses on service

Speaker: 00:00:54

ownership and measurement, is a continuation of our

Speaker: 00:00:57

conversation on SLAs in Modern Applications.

Speaker: 00:01:02

In a previous episode, we talked about Stosa, and this fits very much into

Speaker: 00:01:05

that idea is the idea that how you organize

Speaker: 00:01:09

your teams so that each team has a certain

Speaker: 00:01:13

set of responsibilities. We won't go into all the details of Stosa, but bottom

Speaker: 00:01:17

line is ownership is critical to the

Speaker: 00:01:20

Stosa model. Ownership is critical towards all DevOps

Speaker: 00:01:23

models. If you own a service, you're responsible

Speaker: 00:01:27

for how that service performs, because other teams are depending on

Speaker: 00:01:31

you to perform what those performance,

Speaker: 00:01:35

what it means to perform. The definition of what it

Speaker: 00:01:39

means to perform is what an SLA is all about.

Speaker: 00:01:43

Yeah. So what does a good SLA look like?

Speaker: 00:01:48

Beth that's a great question. Let's get to the measurement.

Speaker: 00:01:53

It does get into measurement.

Speaker: 00:01:59

That is always a hard question to answer.

Speaker: 00:02:03

If you look at the textbook

Speaker: 00:02:07

discussions of Slis and SLOs and

Speaker: 00:02:10

SLAs in particular, you'll often see references

Speaker: 00:02:14

to a lot of the things that are measurable. So you'll

Speaker: 00:02:17

have your golden signals of error, rate,

Speaker: 00:02:21

latency, saturation. So you have

Speaker: 00:02:25

these things that allow you to say, okay,

Speaker: 00:02:29

we're going to tolerate this many errors,

Speaker: 00:02:32

or this many of this type of error, this much

Speaker: 00:02:36

latency. But all of that is kind of trying

Speaker: 00:02:40

to distill down the customer experience

Speaker: 00:02:44

into these things that can be measured and

Speaker: 00:02:47

put on a dashboard. The term smart goals comes

Speaker: 00:02:51

to mind, right. That I think, is a good

Speaker: 00:02:55

measure. I know the idea of smart goals really hasn't been tied to

Speaker: 00:02:59

SLAs too closely, but I think there's a lot of similarities here. So

Speaker: 00:03:03

smart goals are five specific criteria. They're specific

Speaker: 00:03:07

measurable, attainable,

Speaker: 00:03:10

relevant, and time bound. So

Speaker: 00:03:14

now I think all five of those actually apply here

Speaker: 00:03:18

as well. Too right. When you create your SLAs,

Speaker: 00:03:21

they have to be specific. You can't say, yeah, we'll meet your

Speaker: 00:03:25

needs. That's not a good experience. But

Speaker: 00:03:29

in my mind, a good measurement is something

Speaker: 00:03:32

like, we will maintain

Speaker: 00:03:36

five milliseconds latency on average

Speaker: 00:03:40

for 90% of all requests that come in.

Speaker: 00:03:44

And I also like to put in an assuming.

Speaker: 00:03:47

Assuming you meet these criteria, such

Speaker: 00:03:50

as amount of traffic, the traffic load is

Speaker: 00:03:54

less than X, number of requests permitted or whatever the

Speaker: 00:03:57

criteria is. So in my mind, it's a specific

Speaker: 00:04:01

measurement with bounds for what that

Speaker: 00:04:05

means. Under assumptions. And these are the

Speaker: 00:04:08

assumptions. So something like five

Speaker: 00:04:12

milliseconds average latency for 90% of requests,

Speaker: 00:04:16

assuming the request rate is less than

Speaker: 00:04:20

5000 requests per second,

Speaker: 00:04:23

and assuming both those things occur. And you could also have assuming the

Speaker: 00:04:27

request rate is at least 100 /second because

Speaker: 00:04:30

caching can warming caches can have an effect there too. And things

Speaker: 00:04:34

like that. So you can have both bounded numbers. There

Speaker: 00:04:38

something like that is a very specific it's specific. It's

Speaker: 00:04:42

measurable. All of those numbers I specified are all things you could

Speaker: 00:04:45

measure. They're something you could see. Specific

Speaker: 00:04:49

measurable. You want to make sure they're attainable within

Speaker: 00:04:52

the service. That's your responsibility as the owner of a

Speaker: 00:04:56

service. If another team says, I need

Speaker: 00:05:00

this level of performance, it is your responsibility as the owner. Before

Speaker: 00:05:04

you accept that is to say yes, I can do that. So they have to

Speaker: 00:05:07

be attainable to you. And this actually gets at something very

Speaker: 00:05:11

important in implementing these sorts of things, which is to make sure that

Speaker: 00:05:15

you are starting with goals that are near what you're currently

Speaker: 00:05:19

actually doing and step your way towards

Speaker: 00:05:22

improvement instead of setting impossible goals. And then

Speaker: 00:05:26

punishing teams when they don't achieve something that was so far outside of

Speaker: 00:05:30

their ability. Oh absolutely there's two things that make a

Speaker: 00:05:33

goal bad. One is when the goal is so easy that

Speaker: 00:05:37

it's irrelevant. The other one is when it's so difficult that it's never

Speaker: 00:05:41

set never hit. You should set

Speaker: 00:05:45

goals that are in the case of

Speaker: 00:05:48

SLAs, your goal needs to hit the

Speaker: 00:05:51

SLA 100% of the time, but it

Speaker: 00:05:55

can't be three times what you are ever

Speaker: 00:05:59

going to see. Because giving you plenty of room

Speaker: 00:06:03

to have all sorts of problems because then that doesn't make it relevant to

Speaker: 00:06:06

the consumer of the goal. They need something better than that. That's

Speaker: 00:06:10

where the attainable and that's where relevant comes in. And

Speaker: 00:06:14

relevant is so important because it's so tempting. This is where when

Speaker: 00:06:17

it's the engineers that set those goals those

Speaker: 00:06:21

objectives in isolation you tend to get things that are

Speaker: 00:06:24

measurable and specific and

Speaker: 00:06:28

attainable but not relevant, right? I will

Speaker: 00:06:31

guarantee my service will have a latency of less than

Speaker: 00:06:35

37 seconds for this simple request guaranteed I

Speaker: 00:06:39

can promise you that, right? And the consumer will say

Speaker: 00:06:43

well I'm sorry I need ten milliseconds 37 seconds doesn't

Speaker: 00:06:47

that sounds an absurd number but you and I have both

Speaker: 00:06:51

heard numbers like that right? Where they're so far out of bounds they're

Speaker: 00:06:54

totally irrelevant, they're not worth even discussing.

Speaker: 00:06:58

Yes and a sneakier example would be something

Speaker: 00:07:01

like setting an objective

Speaker: 00:07:04

around how your infrastructure is behaving in ways that

Speaker: 00:07:08

don't translate directly to

Speaker: 00:07:12

the benefit to the customer. If you own a web

Speaker: 00:07:15

service that is serving directly to end

Speaker: 00:07:18

users. And your primary measures of

Speaker: 00:07:22

system health are around

Speaker: 00:07:26

CPU and I

Speaker: 00:07:29

O. Well, those might tell you something about what's

Speaker: 00:07:32

happening, but they are not directly

Speaker: 00:07:36

relevant to the customer. You need to have those on your dashboards for when

Speaker: 00:07:40

you're troubleshooting, when there is a problem, but that's not indicating the health

Speaker: 00:07:44

of the system. Right. So specific measurable

Speaker: 00:07:47

attainable relevant. So relevant

Speaker: 00:07:51

means the consumer of your service has to find them

Speaker: 00:07:55

to be useful. Attainable means that you as provider

Speaker: 00:07:59

of the service, need to be able to meet them. Measurable

Speaker: 00:08:02

means need to be measurable specific.

Speaker: 00:08:06

They can't be general purpose and ambiguous. They have to

Speaker: 00:08:10

be very specific. So all those make sense. Does time bound really apply

Speaker: 00:08:14

here? I think it does, but in the sense

Speaker: 00:08:17

that when you're setting these agreements,

Speaker: 00:08:22

you tend to say, this is my commitment, and

Speaker: 00:08:26

you tend to measure over a span of time and

Speaker: 00:08:30

there is a sense of the clock getting reset.

Speaker: 00:08:33

That's true. We'll handle this much traffic

Speaker: 00:08:37

over this period of time. You're right. That's a form of time bound. I think

Speaker: 00:08:41

when you talk about smart goals, they're really talking about the time

Speaker: 00:08:44

when you'll accomplish the goal. And what we're saying

Speaker: 00:08:48

is the time you accomplish the goal is now. It's

Speaker: 00:08:52

not really a goal, it's an agreement as far

Speaker: 00:08:55

as it's a habit. Rather than a habit.

Speaker: 00:09:01

And that's actually a good point. These aren't goals.

Speaker: 00:09:06

I'm going to try to make this no, this is what you're

Speaker: 00:09:10

going to be performing to and you can change them and improve them over time.

Speaker: 00:09:14

You can have a goal that says I'm going to improve my

Speaker: 00:09:17

SLA over time and make

Speaker: 00:09:21

my SLA twice as good by the state.

Speaker: 00:09:25

That's a perfectly fine goal. But that's what a goal is

Speaker: 00:09:29

versus an SLA, which says your SLA is

Speaker: 00:09:33

something like five millisecond latency

Speaker: 00:09:36

with less than 10,000 requests. And you can say, that's

Speaker: 00:09:40

great, I have a goal to make. It a two millisecond latency

Speaker: 00:09:44

with 5000 requests, and by this time next

Speaker: 00:09:48

quarter, and at that point in time then your SLA is now two

Speaker: 00:09:52

milliseconds. But the SLA is what it is and

Speaker: 00:09:55

what you're agreeing to, committing to now, it's a

Speaker: 00:09:59

failure if you don't meet it right

Speaker: 00:10:03

now. As opposed to a goal, which is what you're striving towards.

Speaker: 00:10:07

Yeah, towards completing something. Right.

Speaker: 00:10:12

One anecdote. That a well known anecdote that I

Speaker: 00:10:16

think is interesting to talk about. Here is

Speaker: 00:10:20

the example that Google gave. This is in the SRE

Speaker: 00:10:24

book of actually

Speaker: 00:10:28

overshooting and having a service that

Speaker: 00:10:32

was too reliable. I can't remember which service it was

Speaker: 00:10:36

off the top of my head, but they actually had a service that they did

Speaker: 00:10:39

not want to guarantee 100% uptime, but they ended up

Speaker: 00:10:43

getting over delivering on quality for a while.

Speaker: 00:10:46

And when that service did fail,

Speaker: 00:10:50

users were incensed because there was sort of this

Speaker: 00:10:54

implicit SLA. Well, it's been performing so well.

Speaker: 00:10:58

And so what I love about that story is that they ended

Speaker: 00:11:01

up deliberately introducing failures into the system

Speaker: 00:11:05

so that users would not become accustomed to too high of

Speaker: 00:11:09

a performance level. And what this

Speaker: 00:11:12

underscores is how much this is about

Speaker: 00:11:16

ultimately the experience of whatever person it is

Speaker: 00:11:20

that needs to use your service. This is not a purely

Speaker: 00:11:23

technical problem. This is very much about understanding

Speaker: 00:11:27

how your system can be maximally healthy

Speaker: 00:11:31

and maximally serve

Speaker: 00:11:35

whoever it is that's using it. So I love that story. I

Speaker: 00:11:38

didn't know that story before, but it plays very well into

Speaker: 00:11:43

the Netflix Chaos Monkey approach to testing. And that is

Speaker: 00:11:46

the idea that the way you ensure

Speaker: 00:11:50

your system as a whole keeps performing is you keep causing it to fail on

Speaker: 00:11:54

a regular basis to make sure that you can handle those failures.

Speaker: 00:11:58

So what the Chaos Monkey does, and I'm sure at some point in time we're

Speaker: 00:12:01

going to do an episode on Chaos Monkey. Matter of fact, we should add it

Speaker: 00:12:04

to our list. What Chaos Monkey is all about is the idea

Speaker: 00:12:07

that you intentionally insert faults into your system

Speaker: 00:12:14

at irregular times so that you can

Speaker: 00:12:20

verify that the

Speaker: 00:12:23

response your application is supposed to have to self heal around the

Speaker: 00:12:27

problems that are occurring can be tested to make sure they

Speaker: 00:12:31

occur. Now, you don't do this in staging, you don't do this in

Speaker: 00:12:34

dev, you do it in production. But you do it in production

Speaker: 00:12:38

during times when people are around. So that if

Speaker: 00:12:42

it does cause a real problem, if you turn off the service

Speaker: 00:12:45

and that causes a real problem and customers are really affected,

Speaker: 00:12:49

everyone's on board and you can solve the problem right away as opposed

Speaker: 00:12:53

to the exact same thing happening by happen chance at

02 00:12:56

00 in the morning when everyone's drowsy and sleeping and

02 00:13:00

not knowing what's going on. You can address the problem right there

02 00:13:04

right then as opposed to later on. And the other

02 00:13:08

thing it helps with is this problem that you were addressing which

02 00:13:11

is getting too

02 00:13:15

used to things working. So if you deploy a new

02 00:13:19

change and let's say I own a service, and one of the

02 00:13:22

things I'm doing service A and I call Service B and I need to

02 00:13:26

expect a service B will fail occasionally, well, I'm going to write

02 00:13:30

code into Service A to do different things. If Service B

02 00:13:33

doesn't work well, what if I introduce an error in that

02 00:13:37

code that I'm not aware of and then I deploy my

02 00:13:41

code? Well it's going to function, it's going to work,

02 00:13:44

everything's going to be fine until Service B fails and Service A is also going

02 00:13:48

to fail. But if Service B is regularly

02 00:13:52

failing, you're going to notice that a

02 00:13:56

lot sooner, perhaps immediately after deployment,

02 00:13:59

and you're going to be able to fix that problem, roll it back if necessary,

02 00:14:03

or roll forward with a fix to it to

02 00:14:06

get the situation resolved. The more

02 00:14:10

chaotic you put code into, the more stable the

02 00:14:13

code is going to be. It's a weird thought

02 00:14:17

to think that way, but the more chaotic a system, the

02 00:14:21

more stable the code that's in that system behaves

02 00:14:25

over the long term. I'm so glad you bring this up. And what I

02 00:14:28

love about this is that we're really touching

02 00:14:32

on similar themes in different contexts

02 00:14:35

because both Chaos Engineering and the DevOps

02 00:14:39

approach are really about

02 00:14:43

understanding that we don't just have a technical system,

02 00:14:46

we have a sociotechnical system. We have this intertwined human and

02 00:14:50

technology system. And so with DevOps, one

02 00:14:54

of the advantages of DevOps is that it changes the behavior of the people

02 00:14:58

who are creating the system itself. Because

02 00:15:01

again, if you're going to deploy code

02 00:15:05

and you know that if something goes wrong, it's going to wake up that person

02 00:15:08

over there that you don't even know.

02 00:15:12

You just build your services differently.

02 00:15:16

You're not as rigorous as

02 00:15:19

when you know you're going to be the one woken up at 02:00 A.m.. And

02 00:15:23

similarly with chaos engineering, if you know that

02 00:15:26

service B is going to fail absolutely in the coming

02 00:15:30

week, you're just going to be like, well, I may as well deal with this

02 00:15:34

now. As opposed to like, well, I'm under deadline. Service b is usually

02 00:15:37

stable. I'm just going to run the risk and we'll deal with it later.

02 00:15:41

So it really drives behavior inserted into

02 00:15:44

systems. Right. And the other thing

02 00:15:48

I love about how you kind of unpacked chaos

02 00:15:52

Engineering is it does work

02 00:15:56

on this very counterintuitive idea that you should be

02 00:15:59

running towards incidents and problems

02 00:16:03

instead of running away from them, you should embrace them.

02 00:16:06

And that will actually help you, as you said,

02 00:16:10

make the system more stable because you

02 00:16:13

are proactively encountering those issues rather

02 00:16:17

than letting them come to you. Yeah, that's absolutely great.

02 00:16:21

That's great. Yeah, you're right.

02 00:16:25

We're not talking about coding. We're talking about social systems here. We're

02 00:16:29

talking about systems of people that happen to include

02 00:16:32

code as opposed to systems of code. And that the vast

02 00:16:36

majority of incidents that happen have a

02 00:16:40

socio component to. It not just a code

02 00:16:43

problem. It's someone who said this is good

02 00:16:47

enough or someone who didn't spend the time

02 00:16:50

to think about whether or not it would be good enough or not and

02 00:16:54

therefore missed something. Right. And these aren't bad

02 00:16:58

people doing bad things. These are good people that are making mistakes that

02 00:17:01

are caused by the environment of which they're

02 00:17:05

working. And that's why environment and

02 00:17:09

systems of people and how they're structured and how they're organized

02 00:17:12

is so important. I keep hearing people

02 00:17:16

say how you

02 00:17:20

organize your companies irrelevant. Right? It shouldn't matter.

02 00:17:24

Nothing could be further from the truth. It matters the

02 00:17:28

way you organize a company.

02 00:17:32

I hate saying it this way because I don't always work in this one, but

02 00:17:34

how clean your desk is is a good indication of how clean the system

02 00:17:38

is. And I don't mean that literally because I've had dirty

02 00:17:42

desks too, but it really is a good indication

02 00:17:45

here. It's how well you organize your

02 00:17:49

environment, how well you organize your team,

02 00:17:52

how well you organize your organization,

02 00:17:57

gives an indication for how well you're going to perform as a

02 00:18:00

company from the standpoint. Yes,

02 00:18:06

when we look at the realm of incidents which

02 00:18:10

are messy and frustrating and scary and expensive,

02 00:18:14

and every tech company knows that they

02 00:18:17

are probably one really

02 00:18:21

bad incident away from going out of

02 00:18:24

business, every company knows

02 00:18:28

that there's that really bad

02 00:18:32

thing that could collapse the whole

02 00:18:36

structure. And so incidents are really high

02 00:18:39

stakes, but

02 00:18:43

that drives us to look for certainty and look for clarity. And

02 00:18:46

so we look to a lot of these things that people have been talking

02 00:18:50

about for years around incident metrics. So you've

02 00:18:54

got your mean time metrics, what's your mean time to resolution

02 00:18:58

or your mean time between failure and it's? This attempt

02 00:19:01

to bring some kind of order

02 00:19:06

and sense to this very scary and chaotic world

02 00:19:10

of incidents. But so

02 00:19:13

many of those, what are now often being called shallow

02 00:19:17

incident metrics end up giving short

02 00:19:21

shrift to what we were just talking about, which is that

02 00:19:25

this is a very complex system.

02 00:19:29

The technology itself is very complex. The

02 00:19:33

sociotechnical system is complex.

02 00:19:36

We're trying to kind of get a handle on

02 00:19:40

how do you surface those complexities and make them

02 00:19:43

intelligible and make them sensible without

02 00:19:47

falling back to some of these shallow metrics. That

02 00:19:52

Niall Murphy, who was back to SRE, one of the authors of

02 00:19:55

the original SRE book, had a paper out recently where he kind

02 00:19:59

of unpacks the ways that these mean time

02 00:20:03

and other shallow metrics aren't

02 00:20:06

statistically meaningful and

02 00:20:09

aren't helping us make good decisions

02 00:20:13

in the wake of these incidents. And so much of what we're talking

02 00:20:17

about is SLAs are how do

02 00:20:20

you make decisions about what work you're going to do and how

02 00:20:24

much you invest in reliability versus new features

02 00:20:28

and incident follow up is so much about what

02 00:20:31

decisions do we make based on what we learned

02 00:20:35

in this event. Yeah, you add a whole new dimension

02 00:20:39

here to the metric discussion here, because

02 00:20:43

it's so easy to think about metrics along the line of

02 00:20:46

how we're performing and when we don't perform, it's a failure

02 00:20:50

oops, but there's a lot of data in the

02 00:20:53

Oops, and you're right. Things like meantime

02 00:20:57

to detect and meantime to resolution. And those are important,

02 00:21:01

but they're very superficial compared to the depth that you

02 00:21:05

can get. And I'm not talking about Joe's team caused five

02 00:21:08

incidents last week. That's a problem for Joe. I'm not talking about

02 00:21:12

that. I'm talking about the

02 00:21:15

undercovering,

02 00:21:18

the sophisticated connection between

02 00:21:22

things that can cause problems to occur.

02 00:21:28

Thank you for tuning in to Modern Digital Business. This

02 00:21:31

podcast exists because of the support of you, my listeners.

02 00:21:35

If you enjoy what you hear, will you please leave a review on Apple

02 00:21:38

podcasts or directly on our website at MDB

02 00:21:42

FM slash Reviews If you'd like to suggest a topic for an

02 00:21:46

episode or you are interested in becoming a guest, please contact

02 00:21:50

me directly by sending me a message at MDB FM

02 00:21:53

contact. And if you'd like to record a quick question or

02 00:21:57

comment, click the microphone icon in the lower right hand corner of our

02 00:22:00

website. Your recording might be featured on a future

02 00:22:04

episode. To make sure you get every new episode when they become

02 00:22:08

available, click subscribe in your favorite podcast player

02 00:22:11

or check out our website at MDB FM. If

02 00:22:15

you want to learn more from me, then check out one of my books, courses

02 00:22:18

or articles by going to Lee Atchison.com and

02 00:22:22

all of these links are included in the show. Notes thank you for

02 00:22:26

listening and welcome to the world of the modern digital business.

ModernOps with Beth Long: Operational Ownership

About Lee

Looking to modernize your application organization?

Don't Miss Out!

Listen On

Recent Episodes

Guests Episodes

Security Episodes

Cloud Architecture Episodes

Tech Tapas Tuesday Episodes

ModernOps Episodes

Special Editions Episodes

Modern Applications Episodes

Scale & Availability Episodes

Browse episodes by category