Welcome to Modern Digital Business!
Sept. 26, 2022

ModernOps with Beth Long: Transferring Operational Expertise to the Cloud

Today on Modern Digital Business, we continue our highly successful series called ModernOps. ModernOps is a series of interviews co-hosted with a good friend of mine, Beth long, who is the head of product at jeli.io, an incident analysis company.

This will be our second in a series of episodes. In the first episode, we talked about how the experience using the cloud varies from large companies to small companies. In this episode, we talk about transferring operational expertise 📍 from an on-premise data center to a cloud centric infrastructure.

Today on Modern Digital Business.

 

About Lee

Lee Atchison is a software architect, author, public speaker, and recognized thought leader on cloud computing and application modernization. His most recent book, Architecting for Scale (O’Reilly Media), is an essential resource for technical teams looking to maintain high availability and manage risk in their cloud environments. Lee has been widely quoted in multiple technology publications, including InfoWorld, Diginomica, IT Brief, Programmable Web, CIO Review, and DZone, and has been a featured speaker at events across the globe.

Take a look at Lee's many books, courses, and articles by going to leeatchison.com.

Looking to modernize your application organization?

Check out Architecting for Scale. Currently in it's second edition, this book, written by Lee Atchison, and published by O'Reilly Media, will help you build high scale, highly available web applications, or modernize your existing applications. Check it out! Available in paperback or on Kindle from Amazon.com or other retailers.

 

Don't Miss Out!

Subscribe here to catch each new episode as it becomes available.

Want more from Lee? Click here to sign up for our newsletter. You'll receive information about new episodes, new articles, new books and courses from Lee. Don't worry, we won't send you spam and you can unsubscribe at any time.

Mentioned in this episode:

LinkedIn Learning Courses

Are you looking to become an architect? Or perhaps are you looking to learn how to drive your organization towards better utilization of the cloud? Are you you looking for ways to help you utilize a Cloud Center of Excellence in your organization? I have a whole series of cloud and architecture courses available on LinkedIn Learning. For more information, please go to leeatchison.com/courses or mdb.fm/courses.

Courses by Lee Atchison

O'Reilly Media - Building a Cloud Roadmap

Have you struggled with the cloud migration? Then you'll appreciate my live training course, Building a Cloud Roadmap presented by O'Reilly Media. Live on October 5th at 9:00 AM PDT. For more information, go to mdb.fm/roadmap or leeatchison.com/roadmap. But hurry seats are limited.

Transcript

Lee:

Today on Modern Digital Business, we continue our highly successful series called ModernOps. ModernOps is a series of interviews co-hosted with a good friend of mine, Beth long, who is the head of product at jeli.io, an incident analysis company. This will be our second in a series of episodes. In the first episode, we talked about how the experience using the cloud varies from large companies to small companies. In this episode, we talk about transferring operational expertise from an on-premise data center to a cloud centric infrastructure. Are you ready? Let's go. My co-host for this series is Beth Long. Beth is head of product at jeli.io, an incident analysis and management platform that combines comprehensive data from multiple sources to help identify problems and proactive solutions. Beth and I worked together in New Relic where Beth was heavily involved in the product operational management of a highly scaled and fast growing application. Beth has a strong background in IT operations. Given my cloud and IT management expertise in her, IT operations expertise, Beth and I joined together to create this series of episodes that we call ModernOps. We recorded this content back in the spring of 2021, right in the middle of the pandemic, but we never published it until now. In this second episode, Beth and I are talking about operational expertise and how that changes when you move from an on-premise data center to a cloud centric infrastructure. How do you manage changing expertise requirement's across your organization as you make the transition to the cloud. I hope you enjoy. So a related topic here, I think is the question about what does AWS or GCP or Azure, whoever, but what does AWS provide? You as a small company compared to a larger company, like a new Relic or even larger companies that are cloud native, what type of support do they provide to you? And does that impact your ability to use the cloud or to, to leverage what AWS provides?

Beth:

Certainly there is a different stated offering in terms of level of support as an engineer at a tiny startup. I probably can't get a rep on the phone if something goes wrong. Yeah. I'm gonna be looking on stack overflow. That's my that's probably gonna be my primary support as opposed to being able to actually talk to someone when there is an issue.

Lee:

So when you read new Relic, you were able to talk to someone directly. If you needed to,

Beth:

to an extent there, there were people who could talk directly to an AWS. if there was a major problem, but I think

Lee:

so there were, were the cases where that

Beth:

helped that's what I was gonna get to. Okay. It, it usually trying to think of specific examples and outcomes, the, my memory of some of the events where that happened was that. helped us to identify what we could do better on our end to mitigate the issue, but it typically didn't lead to a change in what was happening on the AWS side. Like we typically had to just wait out what, whatever the issue was. So we did get more visibility, but didn't necessarily get an accelerated resolution on the AWS side. And I'm thinking of big issues like, uh, issues with, with network links and that sort of thing.

Lee:

So there is a perceived value, but not necessarily a practical value. Interesting question then. So since in the realm where you are like jelly, where you have to depend on the stack overflow, because you're not getting the support from AWS, are you getting better support from stack overflow than you got at directly from AWS? Is this a blessing in disguise? I, I guess is what I'm saying.

Beth:

Yeah. It certainly means that you're planning for reality a little bit more. There's not this idea that you can call someone and get help. So you know that you have to plan for that. It's hard to compare because the scale of issue that we're dealing with is itself. So different, right? And the kinds of problems that we tend to have are the sorts of things that you could. If you're having an issue with your, with your database instance, you could fail over to another one. And at new Relic scale, when you run into big issues, they tend to be so thorny that you don't have a lot of escape patches.

Lee:

but that was as much by the new Relic architecture and scale than that. It was because of AWS that's. Yeah. So was it the architecture and scale or was it the architecture and, and this isn't a knock on architecture is a knock on starting over again with a company like jelly and who has a young immature architecture compared to. Architecture like new Relic, which has been established for many years and there's pros and cons to that, right. There's obviously a maturity has value, but maturity also has Scruff mm-hmm and have to deal with all of those sorts of issues. Mm-hmm how much of that was the complexity that was added with new Relic? Was it because of AWS and the complexity of using the cloud and how much of it was. The Scruff of the architecture from the maturity and how much of it was the scale involved?

Beth:

Certainly there's an element that was scale, but any architecture is gonna by definition, make various trade offs and when your architecture. Was designed to optimize for the trade offs of a different world. Then as you try to bend that into the cloud, I think sometimes you run up against spots where you've optimized for a different landscape.

Lee:

So that actually brings, us back to talking about the infrastructure versus the infrastructure you don't know. Mm. And that is when you are building your infrastructure from scratch. Like you are with jelly, you build knowledge of the infrastructure and you, that knowledge grows as your application grows. And as the application scales, and in theory, your knowledge grows at the same rate and the same time. And as you need more expertise, you have more expertise and everything's good and wonderful. I make it sound a lot. Yeah. Right. More simple than it really is. I, I get that, but, but there's, I think there's a generalization that does apply the, the same process occurred by the way, with new Relic, as they grew from a small company into a larger company. Um, with their on-prem data centers and the knowledge and expertise grew, the maturity grew. They knew what, how it worked and how it, how different things would respond and knew this type of problem. And when this sort of thing happened, they knew how to respond to it. And all that expertise grew in and got to the point where they were a very large company or relatively speaking, a large company and knew that level of expertise. Now, take that and move to a completely different infrastructure. All that expertise goes out the window and you have to start over again yet. You still have the same scale and level of responsibility that you did previously that makes migration hard. Talk about that a little bit. And how you know, that was obviously when you were in new Relic, you ran into that. I'm sure. Talk about that a little bit. Sure.

Beth:

Yeah, that, that is a great point. And, and even at jelly, even though we have the access to all of the AWS managed infrastructure and services, we've still elected to use the things that we know how to use. We've still used, like we're big proponents of boring tech. And so even within the set of options, we're not using things. That folks haven't used at least some version of that in production, as new Relic moved into the cloud, a number of things happened. One is for example, we had, we had a team of really strong network engineers that knew how to manage networks had really deep expertise in, in that their expertise became. Redundant. And they all either changed roles or moved on. And where before we had people who could trace the packet all the way through the entire system, we, a lot of that, a lot of that expertise became redundant and we shifted to teams that say had previously owned, bare metal, having to learn how to manage AWS infrastructure. So a lot of the, a lot of the expertise in managing what was in the cloud, kind of centered on the team container fabric that owned our containerization platform. So they were juggling, both owning the, the old stuff and keeping that running while also spinning up the new stuff and learning the new ecosystem. And so there was the challenge, not just of building up that new expertise, but also of pulling a critical team very thin during that whole phase.

Lee:

So you, when you think about layered expertise, right, you. Network engineers and knew how packets worked all the way up to application expertise that knew how UIs worked. Mm-hmm I'm just making this up, but there's a whole level of expertise in the middle there. And as you made this move, you were making several changes at the same time. One is you were containerizing and virtualizing where containers were stored along with moving to the cloud. You can do those two things independently, but so they were both going on at the same time. So the, the net result of all that was you were adding required layers of expertise in the stack, as well as completely removing specific layers, right? You didn't need network engineers anymore, or at least network engineers doing the same thing that they were before doing exactly a different level. Issues. They were dealing maybe with more security or routing issues than they were with, you know, just packet tracking and, and is this cable broken and those sorts of issues. And so it's, it's not saying that all network engineering went away. Absolutely didn't but a certain class of network engineering went away and another class was heightened and a different class of service fabric went away. And another class of service service fabric. Appeared and changed overall. Do you think you need your, the size and complexity of the layers when you moved to the cloud decreased or stayed the same or increased? That's a great change, but did you really need fewer or did you really need less expertise and less people implementing that expertise? Once you moved to the cloud than you did before

Beth:

the cloud, it, we didn't need less expertise. I'd say if anything, we needed more. We just, it just changed shape. We needed expertise in new areas and we needed expertise. That was more about navigating the AWS ecosystem. and the things that were built on top of it. I'd say that we had more total layers and what happened was some of the layers at the very bottom, we no longer had access to directly. And so the people that were dealing solely with those bottom layers, Their, their expertise became redundant to the organization because that was hidden beneath these other layers. But then we added this whole new set of layers at the top that, that we had to, or I shouldn't say at the top, but we added this whole new set of layers and the complexity of interaction between those layers increased

Lee:

complexity of interaction between layers increases. So the. There's more need for fungible expertise.

Beth:

Exactly.

Lee:

So you actually needed more experts, but more fungible experts than you ever did before. Mm-hmm that makes total sense. Mm-hmm yeah. Yeah. You, you could, you needed someone who knew exactly the size and shape of ethernet packets before. Exactly. You don't need that at all now. No, that expertise is irrelevant now. But being able to go up and down to stack and dealing with container fabric up to security firewall configuration issues. Yep. That variability and that was critical. Right.

Beth:

And it was less, it's becoming less about knowing, obviously it's important to know a lot of fundamental principles, but it's less about knowing those fundamental sort of less changeable principles and more about maintaining a current vocabulary because all of these systems that we're using in the cloud are changing and evolving so quickly. So knowing what are the most recent releases, what are the most recent features that have come out? How does that impact what understanding AWS itself as a changing ecosystem, where you can estimate. The speed that you need to be moving at so that you'll land in the right place based on what's going to happen in six months, that all becomes much more important as you begin to navigate that world. That is, that is so constantly evolving.

Lee:

I hope you enjoyed ModernOps with my co-host Beth Long. ModernOps will be a regular series that will appear occasionally on Modern Digital Business. If you enjoyed this episode, let me know so we can make sure to include more conversations like this in future episodes. You can reach me via the links in the show notes, or sign up for more content leeatchison.com/follow.

Beth LongProfile Photo

Beth Long

Product Manager

I write stories for humans and code for machines. I'm preoccupied with the entire ecosystem of modern technology: code, data, infrastructure, and the clever, perplexed humans who make it all work.