Welcome to Modern Digital Business!
Nov. 9, 2023

ModernOps with Beth Long: Talking SLAs

Welcome back to another episode of Modern Digital Business! In today's episode, we delve deeper into the world of modern operations with our special guest, Beth Long. We explore the essential role of service level agreements (SLAs) in managing complex, multi-service modern applications.

As we unravel the differences between DevOps and SREs (Site Reliability Engineers), Beth sheds light on the origins and practices behind these two distinct approaches. We also discuss the significance of SLIs (Service Level Indicators), SLOs (Service Level Objectives), and SLAs in ensuring the stability and reliability of large-scale web operations.

Join us as we navigate the complexities of modern operations and gain valuable insights and recommendations from Beth, a seasoned SRE engineer and Operations manager. Stay tuned for an enlightening conversation on SLAs in our quest to modernize your applications and thrive in the digital business revolution. Let's dive in!

 

Today on Modern Digital Business

Thank you for tuning in to Modern Digital Business. We typically release new episodes on Thursdays. We also occasionally release short-topic episodes on Tuesdays, which we call Tech Tapas Tuesdays.

If you enjoy what you hear, will you please leave a review on Apple Podcasts, Podchaser, or directly on our website at mdb.fm/reviews?

If you'd like to suggest a topic for an episode or you are interested in being a guest, please contact me directly by sending me a message at mdb.fm/contact.

And if you’d like to record a quick question or comment, click the microphone icon in the lower right-hand corner of our website. Your recording might be featured on a future episode!

To ensure you get every new episode when they become available, please subscribe from your favorite podcast player. If you want to learn more from me, then check out one of my books, courses, or articles by going to leeatchison.com.

Thank you for listening, and welcome to the modern world of the modern digital business!

 

Useful Links

 

 

About Lee

Lee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.

Take a look at Lee's many books, courses, and articles by going to leeatchison.com.

 

Looking to modernize your application organization?

Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers.

 

Don't Miss Out!

Subscribe here to catch each new episode as it becomes available.

Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books, and courses from Lee. Don't worry, we won't send you spam, and you can unsubscribe anytime.

Mentioned in this episode:

Architecting for Scale

What does it take to operate a modern organization running a modern digital application? Read more in my O’Reilly Media book Architecting for Scale, now in its second edition. Go to: leeatchison.com/books or mdb.fm/afs.

Architecting for Scale

Transcript

Speaker:

What are service level agreements and why are they absolutely

 

Speaker:

essential to managing complex, multi service

 

Speaker:

modern applications? Today I continue my discussion on

 

Speaker:

modern ops with Beth Long. Are you ready? Let's

 

Speaker:

go. This is the Modern

 

Speaker:

Digital Business podcast, the technical leader's guide to

 

Speaker:

modernizing your applications and digital business. Whether you're a

 

Speaker:

business technology leader or a small business innovator, keeping

 

Speaker:

up with the digital business revolution is a must. Here to help make

 

Speaker:

it easier with actionable insights and recommendations, as well as

 

Speaker:

thoughtful interviews with industry experts. Lee Atchison

 

Speaker:

in this episode of Modern Digital Business, I continue my conversation on

 

Speaker:

Modern operations with my good friend SRE

 

Speaker:

engineer and Operations manager Beth Long. So, Beth,

 

Speaker:

great to see you again today. And today we wanted to

 

Speaker:

talk about SRE terminology and

 

Speaker:

measurements, and it's fantastic that we have a

 

Speaker:

SRE in our miss in order to do that. So I'm glad you're

 

Speaker:

here. Great. Let's get started on

 

Speaker:

this. So SRE cyber liability Engineer

 

Speaker:

is tied very closely to the concept of DevOps,

 

Speaker:

but they're really not the same thing. Can you start out

 

Speaker:

by telling us what's the difference between DevOps and SREs?

 

Speaker:

I love this question. I've talked about this a number of times. And I'm going

 

Speaker:

to get back at you for asking me this by flipping it around and asking

 

Speaker:

you the same thing in a minute, but I'll take a stab at it. So

 

Speaker:

SRE site reliability engineering originated out of

 

Speaker:

Google, gee, almost 20 years ago now, I

 

Speaker:

guess. Yeah. And it was

 

Speaker:

really a discipline that was a

 

Speaker:

response to the

 

Speaker:

pressures of managing technology

 

Speaker:

at Google scale. And so

 

Speaker:

a lot of the practices that are associated with site reliability

 

Speaker:

engineering are the things that Google

 

Speaker:

developed internally to help them manage their scale as

 

Speaker:

they grew and then began to evangelize out to the wider

 

Speaker:

community. And so now a lot of those practices have been adopted more

 

Speaker:

widely and have been iterated upon. But

 

Speaker:

that's the origins of site reliability engineering

 

Speaker:

and the origins of DevOps are

 

Speaker:

a little bit more cross

 

Speaker:

cutting, a little bit more democratic, I

 

Speaker:

guess, and came out of

 

Speaker:

people around the same time

 

Speaker:

realizing that the

 

Speaker:

siloing of development and operations

 

Speaker:

was leading to unhealthy

 

Speaker:

patterns in the software engineering.

 

Speaker:

So people like John ALSPAW, who we'll talk about a

 

Speaker:

little bit later probably, if we touch on incidents at all, were

 

Speaker:

prominent in kind of saying, let's rethink how we're doing

 

Speaker:

the software engineering practice. So DevOps really focused

 

Speaker:

on integrating development and operations so

 

Speaker:

that those functions were

 

Speaker:

shared more as opposed to completely siloed. And

 

Speaker:

site reliability engineering was a set of practices

 

Speaker:

around maintaining stability and reliability

 

Speaker:

of large scale web operations. And so there

 

Speaker:

are some foundational topics that I'd like to ask you about actually,

 

Speaker:

around things like service level indicators, objectives and

 

Speaker:

agreements and a wide number of other

 

Speaker:

practices. So this is a very wandering answer to say

 

Speaker:

that the major difference between the two I think is really kind

 

Speaker:

of one of ancestry and how they started and

 

Speaker:

SRE sort of being a set of

 

Speaker:

practices and DevOps being more of a

 

Speaker:

philosophy and an approach to the development

 

Speaker:

environment. Yeah, it's almost like the

 

Speaker:

SRE is a practice that occurs within a DevOps model, but

 

Speaker:

it exists independently as well too. But it's a role

 

Speaker:

within DevOps. But not the only role within DevOps. Right?

 

Speaker:

Yeah. Now what's interesting is you

 

Speaker:

hear both DevOps and SRE talked about as

 

Speaker:

practices but also you'll hear about SREs talked

 

Speaker:

about as a profession, but yet

 

Speaker:

you don't talk about DevOps as a profession.

 

Speaker:

And in fact people do, but usually it's considered a

 

Speaker:

negative. I'm a DevOps engineer. No, there's no such thing as a

 

Speaker:

DevOps engineer. So is that

 

Speaker:

also part come from the historical

 

Speaker:

nature of where it came from or is there really is a difference there that

 

Speaker:

matters? This is a great question and something I hoped we

 

Speaker:

would touch on because I still kind of cringe a little bit when I see

 

Speaker:

DevOps engineer, but I've come to understand why

 

Speaker:

that job title has meaning. Because

 

Speaker:

there are organizations that for a number of reasons,

 

Speaker:

including the size of the organization, the history

 

Speaker:

of it, its composition, sometimes it does make

 

Speaker:

sense to have people focus

 

Speaker:

on the kinds of things that happen at the

 

Speaker:

boundary of development and operations.

 

Speaker:

And so you'll get DevOps engineers who focus on internal

 

Speaker:

tooling build and deploy pipelines

 

Speaker:

that of

 

Speaker:

activity. Yeah, I always hate the word DevOps engineer applying

 

Speaker:

to that as opposed to like infrastructure engineer or tooling

 

Speaker:

engineering. But you're right, you do hear that, you hear that

 

Speaker:

apply there. What it almost seems like though is you

 

Speaker:

hear a large organization say DevOps is good,

 

Speaker:

we need to go to DevOps. Okay, you and you are now DevOps

 

Speaker:

engineers. Exactly. And that's not the way it's

 

Speaker:

done, of course. And often they become the ones then that

 

Speaker:

focus on the tooling and kind of become those tooling engineers and keep

 

Speaker:

the DevOps title. And it's not always a good

 

Speaker:

history that brings you to that situation.

 

Speaker:

Yeah. And to answer your original question, I

 

Speaker:

think

 

Speaker:

there's a little bit of a CRISPR definition around

 

Speaker:

what a site reliability engineer does,

 

Speaker:

but there's still a lot of fuzz in the definition and there's a lot

 

Speaker:

of range in if someone says they're an SRE, what they actually do. It's

 

Speaker:

still going to be quite a wide range of options. But

 

Speaker:

the origins of site reliability engineering go back to

 

Speaker:

bringing the software engineering discipline

 

Speaker:

into the operations realm. And so again you see this sort

 

Speaker:

of both SRE and DevOps are really about

 

Speaker:

crossing that boundary, but it's

 

Speaker:

almost. In the opposite direction of what exactly. Yeah, exactly.

 

Speaker:

DevOps is more about bringing Ops into dev and

 

Speaker:

SRE is more about bringing the processes

 

Speaker:

of development into operations. Right. And so you are much more

 

Speaker:

likely to end up with an SRE group

 

Speaker:

that is sort of helping the whole organization level up with those

 

Speaker:

things. Whereas a DevOps organization,

 

Speaker:

at least in the way that I tend to use DevOps and I think you

 

Speaker:

and I are similar in this a DevOps organization is going to

 

Speaker:

be you're on call for your own services rather. Than having

 

Speaker:

an operations center and some of those things that are more at the

 

Speaker:

organizational scale as opposed to SRE tending

 

Speaker:

to be more likely that you're going to have a group of

 

Speaker:

people that are bringing those things to the organization. Yeah,

 

Speaker:

in that manner, SRE group or SRE

 

Speaker:

engineers is more akin to like an architecture group and

 

Speaker:

architects, they're assigned to individual parts of the

 

Speaker:

project, but they also have some global responsibilities as

 

Speaker:

well and shared knowledge. And whether they're in

 

Speaker:

one group or distributed is a

 

Speaker:

much more fluid question. That depends on the

 

Speaker:

organization versus a clear cut who should be in which group.

 

Speaker:

Sort of a model that is more akin to what

 

Speaker:

happens in DevOps. I like that distinction. So now

 

Speaker:

we know SRE is not the same as DevOps and we understand the difference

 

Speaker:

between them. That's great. So you bring me, get me back on something. Now

 

Speaker:

you had said, I'm not really looking

 

Speaker:

forward to that, whatever that is.

 

Speaker:

If there's one thing that's iconically associated with

 

Speaker:

SRE, I think it's fair to say that it's service level indicators

 

Speaker:

and service level objectives and service level agreements.

 

Speaker:

Slis, SLOs, SLAs. The acronyms

 

Speaker:

confuse everybody, even those who have been using them for years.

 

Speaker:

And I know that you have a very pragmatic approach

 

Speaker:

to kind of tackling some of these questions. So I'd love first for folks that

 

Speaker:

aren't deeply aware of those, maybe kind of set the scene and then I'd

 

Speaker:

love to kind of hear your take on how you can implement those.

 

Speaker:

Well, sure. I even confuse Slis and

 

Speaker:

SLOs. And so I'm going to need help with the definition if we're going

 

Speaker:

to define what the three are. But I'd almost prefer to avoid the

 

Speaker:

definitions and talk about what the problem is that's going on there. What the

 

Speaker:

problem is, is what all of them are trying to indicate

 

Speaker:

is the health of something, the health of a code base,

 

Speaker:

the health of a service, the health of an application.

 

Speaker:

Now historically the word SLA service

 

Speaker:

level.

 

Speaker:

Agreement. Agreement? Yeah.

 

Speaker:

SLA service level agreement comes

 

Speaker:

from inter customer connections.

 

Speaker:

So you have a provider of a service, of an

 

Speaker:

application that has a customer, and that customer says, we'll buy

 

Speaker:

your service, but I need a SLA service level

 

Speaker:

agreement that specifies how well or

 

Speaker:

what your service is going to do for me. And often those

 

Speaker:

agreements are around things like uptime

 

Speaker:

latency, how fast the application will work,

 

Speaker:

how many users can be connected to it. There could be a

 

Speaker:

thousand different dimensions on how it's measured, but it's usually some form of

 

Speaker:

measurement of a guarantee to the customer

 

Speaker:

of what the service or application that

 

Speaker:

the provider of that will guarantee in

 

Speaker:

exchange for usually money in the case of a customer

 

Speaker:

relationship. So an SLA has a very long history.

 

Speaker:

It's been around for a long time. The word SLA probably goes back

 

Speaker:

long before either one of us were born, because it applies

 

Speaker:

to contract work in general, not just software or

 

Speaker:

computer work. And so it's been around for a long time. But

 

Speaker:

what's happened in I believe it was Google

 

Speaker:

is the one who started the Slo or the SLAI model. I believe they're

 

Speaker:

the ones that did part of the SRE revolution. It included with

 

Speaker:

them. But what was decided was

 

Speaker:

we need some way to at a smaller scale as we

 

Speaker:

take this large application and now internally

 

Speaker:

divide it into services and into microservices and into

 

Speaker:

its various components. And especially in DevOps

 

Speaker:

models, we needed a way to say this part of the

 

Speaker:

service has requirements that it must

 

Speaker:

perform to. It has obligations it

 

Speaker:

needs to be able to handle in order to serve the needs

 

Speaker:

of the other services around it.

 

Speaker:

And so Google created new terms called Slis and SLOs

 

Speaker:

in order to distinguish them from

 

Speaker:

SLAs for how you

 

Speaker:

measure those parts of the application. And the

 

Speaker:

idea is Slis and SLOs are internal

 

Speaker:

measurements for internal customers, and SLAs were

 

Speaker:

external measures for external customers. That's where I have

 

Speaker:

my problem, because in my mind, in a service oriented

 

Speaker:

architecture, in a service oriented

 

Speaker:

team model, if you own a service

 

Speaker:

and other services depend on you, those other teams

 

Speaker:

are your customers. The fact that they sit down the hall from

 

Speaker:

you or right next to you, or on another floor, but in the same

 

Speaker:

company is irrelevant. They're still your customers.

 

Speaker:

Whether they're an internal customer or an external customer doesn't matter.

 

Speaker:

They're your customers. You need to keep them happy for

 

Speaker:

your application to perform as expected. So when

 

Speaker:

you provide a service that someone else is depending

 

Speaker:

on, and you specify what the requirements are for running that

 

Speaker:

service, those are service level agreements. Those are the

 

Speaker:

agreements that you have with the other service owners

 

Speaker:

of how your service will behave. There's no

 

Speaker:

difference between those SLAs as the external ones. So

 

Speaker:

don't call them something different, because that implies there's something less.

 

Speaker:

Right? An Slo implies an

 

Speaker:

internal agreement, which of course, internal agreements aren't

 

Speaker:

official agreements. Well, they're not as important, right?

 

Speaker:

SLA implies an external agreement, which is important because we're

 

Speaker:

talking about customers here. They're all customers. They're all

 

Speaker:

external. They're all SLAs. When you make an

 

Speaker:

agreement that your application performs a certain way, there's

 

Speaker:

no difference in whether or not that agreement is made with another team within your

 

Speaker:

organization or to an external customer. They're just as important

 

Speaker:

because guess what? If you break your agreement

 

Speaker:

for how your service performs with another team. That's not going to

 

Speaker:

just affect the other team. That's going to affect all the teams that they depend

 

Speaker:

on, and ultimately it's going to affect the customer. So it all

 

Speaker:

matters. They're all just as important. So let's not invent

 

Speaker:

new terms to describe them. In my mind, they're all

 

Speaker:

SLAs. So if you have 100 service teams

 

Speaker:

within your organization and they have their

 

Speaker:

criteria for how they are expected to perform

 

Speaker:

to support the other service owners, those

 

Speaker:

are SLAs. Those expectations are

 

Speaker:

service level agreements. They need to be treated at the same level of

 

Speaker:

importance as the customer level service level

 

Speaker:

agreements. I find that really interesting because you're getting

 

Speaker:

at the fact that words matter and what we

 

Speaker:

call things matter, because I think there are a lot of

 

Speaker:

really interesting organizational challenges with implementing

 

Speaker:

SLOs and SLAs effectively. And one of them sort of on the

 

Speaker:

flip side, is that when teams

 

Speaker:

talk about service level objectives,

 

Speaker:

they often sort of set them arbitrarily based

 

Speaker:

on, okay, these are the things that I can measure and these are

 

Speaker:

my objectives as the owner. And

 

Speaker:

what you're getting at is the fact that these really need to be

 

Speaker:

agreements. They need to be hashed out with product

 

Speaker:

owners and technical leads and people who are

 

Speaker:

deeply familiar with the customer, whether that customer is internal or

 

Speaker:

external. Stay tuned for our next Modern Ops segment

 

Speaker:

when Beth and I continue our discussion on modern application

 

Speaker:

operations by talking about ownership in a modern

 

Speaker:

operations world. Thank you for

 

Speaker:

tuning in to Modern Digital Business. This podcast exists

 

Speaker:

because of the support of you, my listeners. If you enjoy what you

 

Speaker:

hear, will you please leave a review on Apple podcasts or

 

Speaker:

directly on our website at MDB FM. Slash

 

Speaker:

Reviews if you'd like to suggest a topic for an episode or

 

Speaker:

you're interested in becoming a guest, please contact me directly by

 

Speaker:

sending me a message at MDB FM contact.

 

Speaker:

And if you'd like to record a quick question or comment, click the

 

Speaker:

microphone icon in the lower right hand corner of our website.

 

Speaker:

Your recording might be featured on a future episode. To

 

Speaker:

make sure you get every new episode when they become available, click

 

Speaker:

subscribe in your favorite podcast Player, or check out our website at

 

Speaker:

MDB FM. If you want to learn more from me,

 

Speaker:

then check out one of my books, courses or articles by going to Lee

 

Speaker:

Atchison.com, and all of these links are included in the show.

 

Speaker:

Notes. Thank you for listening and welcome to the world of the

 

Speaker:

Modern Digital Business.

 

 

Beth LongProfile Photo

Beth Long

Product Manager

I write stories for humans and code for machines. I'm preoccupied with the entire ecosystem of modern technology: code, data, infrastructure, and the clever, perplexed humans who make it all work.