Into the Unknown Unknowns: Observability with Charity Majors

The speed of development and change in software engineering enables exponential growth and innovation. It can also give rise to ever-increasing complexity and the creation of systems that are as critical as they are fragile. As a result, in the real world of legacy tooling within complex distributed systems, it's simply not possible to monitor for every possible failure. An alternative approach that has been gaining attention is the goal of achieving observability.

Charity Majors is the co-founder and CTO of Honeycomb. On this week’s episode, Charity discusses observability not only as a practical tool for creating performant and reliable production services, but also as part of a fundamental shift in the practice of software development as a whole. Listen to the full conversation to learn more about observability, how it benefits teams, and how you can start implementing observability in your own work.

In addition to her work with Honeycomb, Charity is the author of O'Reilly books on observability and database reliability engineering. You can find Charity at charity.wtf, Honeycomb's blog, and on Twitter.

Context Links:

The Engineer/Manager Pendulum

Subscribe to Collaborative Craft

If you'd like to receive new episodes as they're published, please subscribe to Collaborative Craft in Apple Podcasts, Google Podcasts, Spotify, or wherever you get your podcasts. If you enjoyed this episode, please consider leaving a review in Apple Podcasts. It really helps others find the show.

This podcast was produced by Dante32.

Episode Transcript

Into the Unknown Unknowns: Observability with Charity Majors

[00:00:00] Jerome: 3, 2, 1. Hey, everyone. I'm Jerome Goodrich.

[00:00:05] Thomas: And I'm Thomas Countz.

[00:00:07] Jerome: And you're listening to Collaborative Cla- [laughter]

[00:00:11] Thomas: It's not just me! [laughter]

[00:00:15] Jerome: I spit all over my mic. Oh my god.

[00:00:19] Thomas: Oh, man.

[00:00:23] Jerome: Hey everyone. I'm Jerome Goodrich

[00:00:27] Thomas: And I'm Thomas Countz.

[00:00:28] Jerome: And you're listening to Collaborative Craft, a podcast brought to you by 8th Light.

[00:00:34] Thomas: So, Jerome, who are we talking to today?

[00:00:38] Jerome: Thomas, you will not believe it. We are talking to the Charity Majors.

[00:00:49] Thomas: [Muffled scream] Little excited...

[00:00:53] Jerome: Charity is the founder and CTO of honeycomb.io, which is an observability platform. And we'll talk a little bit more about what that means. Thomas, why are you excited to talk to Charity?

[00:01:06] Thomas: Me?

[00:01:07] Jerome: Yes.

[00:01:08] Thomas: Okay. Full disclosure, before we really started diving into- or before we knew that Charity had agreed to come talk with us. I only a little bit knew what observability was. I definitely didn't know what O11y was, thank you for explaining that to me, Jerome.

[00:01:27] Jerome: Is it [pronouncing O11y phonetically] Olly or Oh-eleven-why?

[00:01:31] Thomas: It's called o11ycast, Charity's podcast, so I'm a stick with O11y.

[00:01:35] Jerome: Okay. Fair Enough.

[00:01:37] Thomas: I trust her. But yeah, I didn't really know what it was. I knew it was something about high cardinality, high dimension, cross slicing dicing events.

[00:01:47] Jerome: Buzzword, buzzword.

[00:01:49] Thomas: Yeah. But I didn't really know. And then lo and behold, just like Charity generously agreed to talk with us. She shares so generously across the internet, across Twitter, across her podcast and across the website. I mean, it's just a wealth of information. So, what I quickly learned was, oh, I have the problems, that observability is here to solve.

And yeah, I think a lot of the questions that we would have otherwise asked Charity were answered. So I'm really excited because I think we have an opportunity to ask some questions that she may not have heard before.

[00:02:32] Jerome: Well let's hope.

[00:02:33] Thomas: How about you? What's your background with O11y? And why are you excited to talk with Charity Majors today?

[00:02:43] Jerome: I mean...

[00:02:44] Thomas: I feel like I have to say her last name. Sorry. Charity Majors.

[00:02:46] Jerome: Yeah, no, I'm with you. Like, I have been tangentially aware of Charity mostly because of the excellent article, "The Engineer/Manager Pendulum". And I think I probably read that, I don't know when it first came out back in like 2017 or something like that and thought it made a ton of sense and then kind of forgot about it. And then we booked Charity and I was like, "oh my god." Like, this was such an influential piece for me. And the observability stuff was secondary. And then I started getting into it, and I was like, okay, I've heard this buzzword before we've talked about it on clients and stuff. And I mean, she is the expert in observability.

And so that alone makes me very excited to talk to her, but I'm also just kind of a neophyte when it comes to this stuff. Observability after doing the prep for this episode and with my experience on clients, it makes a lot of sense to me. But it's not something that I have had a lot of experience about, at least in the way that she talks about observability. Because I've always thought of observability and monitoring as synonyms, which I know is...

[00:04:05] Thomas: Ooh, careful!

[00:04:05] Jerome: Yeah, yeah, yeah. I know. I'm not gonna... She might hear this later, but I'm not gonna admit that to her face-to-face. And yeah, so I'm just very curious to hear kind of straight from the horse's mouth, what is all the fuss about? Like, why should we care about this? What is the future of it? Is it really as important as I've been led to believe in the weeks leading up to this episode? It's like I have an encyclopedia in front of me and I just want to ask it all the questions.

[00:04:41] Thomas: Yeah, no, totally. And I love that you mentioned that particular post around the management pendulum, forgetting the name, but we'll include it in the show notes, because she's also shared and provided so much guidance around just like how you should organize your team. And I don't know about anyone else when I read it, I'm like "this makes total sense." Kind of like cuts the cruft around the pomp around hierarchies and team dynamics and team structures. She's like, look, we want to get work done. Let's get out of our own way.

[00:05:15] Jerome: It's no B.S.

[00:05:16] Thomas: Yeah. Yeah, exactly. So, yeah. I love that you mentioned that. Cause that's the other thing, like there's so much technical expertise that she has, and then there's so much organizational expertise that she has. So I cannot wait to dive in.

Let's do it. Without further ado, Charity majors.

Thank you so much for being with us today, Charity. It is such an honor and a pleasure to have you. Just to dive right in with something maybe a little more philosophical: I'm curious about what being a software engineer, what that role means to you?

[00:06:00] Charity: I don't consider myself a software engineer, by the way. I consider myself an operations engineer. And to me, I feel like the business people are the "why," the software engineers are the "what," and the ops people are the "how." It's what gets built, well, you know, that's how software engineers work. It's how that value gets delivered to users, that's how ops people work. And the business ultimately, why does it exist?

Well, so the business is the why? So just at a super high level, I would say that software engineers, they create value. It's such an amazing time to be alive, because if you think of every other- you know, we get shit on by other engineers, but, you know, it takes a lot longer to build an interstate highway or a bridge or something than it does to like build stuff in software.

We get to iterate so quickly that our discipline is advancing so fast and we can just, you know, turn out prototypes and know languages and like models on the fly. And like all the other engineering disciplines, they're advancing like a snail compared to us. And we're like, you know, Roger Rabbit just like bounding down, you know?

And so I think it's just, it's such an abstract way of working right? All in your head, which is both good and bad, but it's all in your head, you know? And like, so I started using the terms like value, which used to creep me out as an engineer. Like talk about value, I smelled this person coming. But just like, being able to create things, inventions and experiences for other people with just you and your laptop or a small team and some computers. That's kind of magical when you think about it.

And this is where I feel like software engineers have so much more power than they tend to realize, because if they don't build it, companies stop dead in their tracks. So I think we have a real moral and ethical responsibility to take that power of creation very seriously.

[00:07:55] Thomas: Yeah, that's the kind of bit of being a software engineer or an operations engineer, and I'll get into that distinction in a second, that I was excited to talk to you about, because I think you're exactly right. There's a responsibility that goes into software engineering that hasn't been like rigorously defined. We're kind of just like get a laptop and go, but there's a lot, you know, that we have power over and responsibility for that we may not realize.

[00:08:23] Charity: Yeah. And I feel like part of that is because of how quickly the industry changes. It's really hard to regulate. It's really hard to set meaningful, not bureaucratic boundaries for our industry. And it is, you know, it is so volatile, so it's so different six months later.

[00:08:40] Jerome: To that end of the industry changing so much, you mentioned kind of like ops as being the how and software engineers being the what, but like more and more, those two things are kind of getting merged. And I'm wondering what are the implications of that?

[00:08:54] Charity: Well, I feel like they really never should have been split in the first place. Because the act of writing and creating is separable from the process of looking at it and validating it, you know?

And it's like, yes, we need to look for ways to specialize, but like separating the... the what and the how are just so intimately intertwined, you need to know something about how, what you're building is going to be delivered to users. You know, you need to know something about what you're delivering, right? They're just so meshed that I feel like, you know, we're always looking for abstractions and ways to like separate things so we can scale them. I actually think that was a really poor seam to cut down the middle. So I think it's good that we're stitching it back together. Like nobody can be all things to all people, right.

So we're going to continue to look for ways to kind of, split up this work in ways that makes sense. You know, the pod model, the tribe model, and you know, all of these things. But I just think that dev versus ops, it's just a boundary that, you know, that wall should never be built.

[00:09:54] Thomas: Yeah. And I think that attitude is imbued in honeycomb.io. Like it speaks the language of ops, but in a way that maybe someone who's more like feature engineering background can understand.

[00:10:08] Charity: So, so much that. You know, like, yes, it is fundamentally an operational tool in that it tells you things. It's your five senses for production. You know, in the past ops has kind of been like this layer between devs and between all the graphs who just- that we explained them to you. We're like, "ah, your software made these graphs wiggle in this way."

You know, we're just like the interpreters, which is terrible. Right. You know, so like speaking the language of endpoints and variables and functions and APIs, so that developers can kind of self-serve and understand in their terms what's going on.

[00:10:46] Thomas: Yeah, that brings me to a question about kind of how teams get started.

And I know you've talked about this before, particularly on your blog, but I'm wondering if I work on a team and maybe none of us have operational experience. And I say that out of respect for operation engineers. Like what are the prerequisites that we should be focused on in order to get started with maybe using tools like honeycomb.io for example.

[00:11:18] Charity: Well, whoever on that team knows the most about infrastructure is going to find themselves an ops engineer very quickly. Like if you know your way around and lytics shell. If you, if you solve the first problem- the first law of startups that if you know anything about it, you're immediately the expert.

So, you know, it's just the fact of life, if you're delivering shit, they're going to be vocal. They're going to complain and you're going to spend a lot of your time dealing with that. So what should you know? I mean, just be resigned to that fact, I suppose. You know, there's really very little that I can say just from a comprehensive, here's your ops advice.

What I will say though is that there aren't many ops founders. And very few companies have the benefit or the advantage of having someone to do things right from the start. In my past career, I've always been the one who came in once we gotten some traction and then I start unwinding all the tech that has already occurred, you know, and like building up to be a real company and everything.

But like just being able to spin it up right from the start. You know, it took two or three months. But that two or three months of my labor, like spinning up infrastructure and everything, we were able to piggyback on that work for... we're still reaping the benefits of it. You know, we've never had to redesign things and we're able to start incrementally, you know, upgrade this, touch that. We've never had to do the security nightmare of, you know, figuring out how to like spend up all your instances.

And then up again, inside the BPA/DPC or, you know, we've never had to deal with that. So I will say that teams are right. It's not mandatory from the beginning. You can do a lot of work without any ops expertise, but the earlier that you include some, the easier you are making it for your future self.

[00:13:03] Jerome: That gets me thinking, you're talking about, you know, building things right from the get-go. When you were building honeycomb.io, I'm sure you had observability in mind.

[00:13:12] Charity: The ultimate ambition [laughter].

[00:13:14] Jerome: Right. So how does honeycomb.io observe itself? And is it, is it just turtles all the way down?

[00:13:19] Charity: Yes in fact it is, we have the honeycomb.io production cluster, which runs, and you can think about kind of amazing, you know, we've got hundreds of customers and we run the equivalent of all of their production loads combined.

And then we have another honeycomb.io. So, and that cluster- we're a very dog themed company, so we've got like retriever as a data storage layer, begal, and we've got like, you know, Basenji is a little security thing. And so we're all about the puppies, right? And then we have a, of course we have the dog food cluster, which monitors production.

And then we have the kibble cluster, which monitors dog food. [Laughter]

And a few other little, you know, st- anyone can spin up, you know, a container-based, you know, entire environment for their laptop, which has another cute dog name that I forget. But yeah, yeah, it is turtles all the way down. It is honeycomb.io watching honeycomb.io watching honeycomb.io.

[00:14:15] Thomas: And then I'm sure, like you said that you are still reaping the benefits of having that from the beginning, having that mindset.

[00:14:23] Charity: Absolutely. I will say you asked me that question and I actually will answer it. The most important thing anyone could do as an engineer on day one. You know, you're studying it for CICD, right? Everyone knows that's important. Make your CICD pipeline run automatically, whatever changes checked in and push the binaries out to production.

It could be a janky shell script. But we just got used to the fact that anytime we merged something to production within 10 minutes, it's going to show up live. And so you just, you start like looking for it. You're like, oh, my page is going live, I'm going to watch for it. It's reasonable to expect engineers to really be owners and stewards of their code.

When you grow up expecting that. And it's never hard. It's like the myth of Alexander the Great would always lift up his pony, starting when he was a little boy, and lift up his pony before breakfast. So that when it was a horse, he could still lift it up. You know, it was never hard because everyday he just did it a little bit.

And if you just like train a team to do it... the thing is if you know your code is out in ten minutes, you're very likely to go look at it, if you're pretty sure that your code is going out some point in the next one to three days, along with zero to 20 other people's code, you're never going to go look for it. Right. And you severed that whole like virtuous cycle of dev and ops.

And you're just not going to look at it. You're going to leave it up to your future self or someone else to find that.

[00:15:48] Jerome: So this comes from one of our recent guests Hana Lee. And she says, I recently had the painful experience of trying to reconcile two different distributed tracing SDKs from two different platforms and wondered if open telemetry would be a better answer. I'd love to hear about the open telemetry project and how its reception has been so far.

[00:16:11] Charity: An unqualified yes. It is going much better than I expected it to. But your history of so-called game ender, like open solutions, I mean, before it was open telemetry, it was open tracing, it was before which it was, you know, blah, blah, blah.

It's a patchy record. But I do see people actually embracing open telemetry in large numbers. And I do think, it's kind of inevitable. And so the sooner you do it, the better, and it's really good for users because if you instrumented your code using open telemetry, all of the big vendors and small vendors have embraced it to some extent or another. Honeycomb.io is all in an otel, but you know, the Relics, the Datadogs, the Splunks of the world are also in a new Relic, in an otel. Which means that if you've instrumented your code for otel, you can switch from one to the other with almost no change.

It's almost as simple as just like changing a variable in your config. Which is amazing because it, you know, all of those sunk costs, you know, that's what keeps people with vendors way more than their love for them. It's like, I don't want to do this again. You know, that's why most people have their, their monitoring vendors, right.

And this forces vendors to compete on actual user experience instead of just relying on being locked in. So it's very good for users. It is getting a lot of traction and it does seem to be the future and it is absolutely worth learning and doing.

[00:17:41] Jerome: That's so exciting.

[00:17:42] Thomas: That's a good segue into a question that- well it was actually Jerome's question, but I'm going to ask it.

[00:17:47] Jerome: Take it, go for it. We have one brain.

[00:17:50] Thomas: Well, the question of like, or like you brought up, like it's the future and the whole idea behind an observability tool and particularly honeycomb.io is about the unknown unknowns. And I'm wondering like how far does that go? Is this the end all be all for like software development in the future, given an observability tool, you will forever be able to prod and poke and ask questions of your system. Are we in the future now?

[00:18:22] Charity: We are in a version of the future. So I feel like, you know, the whole metrics, you know, you think of like software as being like, you know, the tree of life, you know, metrics is, is something, is the first metrics were written in like the seventies, you know? And, and a metric is, you know, there's a generic term metrics and other specific term metric. Which is, a metric is just a single number with some tags dependency so you can find it, right? And the metrics heritage is long and fast. And it has, I think just about come to an end, we're starting to realize the limitations of the model and just, you know, datadog and prometheus are probably the last best versions of that technology that will ever be built, in my opinion.

There's kind of no reason to, they do all the things you can possibly do with metrics. And they're all trying to like branch out from that because they recognize that they're at the end of the road. Observability tooling is new. And while I would like to say that we've got it all figured out and all cornered and blah, blah, blah.

I'm waiting for other vendors to have achieved what I would call observability so that we can start competing. I don't want to compete with them on the, on the level of, well, where are the only ones who do it, because that means it doesn't really matter that much. You know, it's not real until, you know- and I know that they're doing this on the back end.

I know that they're competing. I know that they're trying to catch up. It's just that they can't move as quickly as we can because they're so big. But then again, they have like thousands of engineers to throw at the problem. So it's inevitable that they will right? Observability will change too.

Part of this is driven by the cost of storage, right? The spinning rust. You're never going to be able to achieve observability, just like you can always tune your databases so far. Right. And it wasn't until ram got cheap. As DS got cheap that we're able to do these things like storing raw arbitrarily wide structured data blobs, and then doing real time analytics on top of it, right?

It's far more wasteful. Metrics are super cheap, but they're super limited in what they can do. Whereas, you know, the building blocks of serviceability are much richer, much more, you know, wasteful when it comes to storage, but even do way more powerful things. 10 years from now when you know, SSDs are pennies and, you know, ram is like nothing.

You know, I think that there would probably almost certainly be more people who are moving to more of a... or additionally having sort of a real time, just sort of almost like GDB and production, where you're just like, you're catching snapshots of the code as it's passing through your, you know, your system without what's a very light impact on it, which, which is not really feasible right now.

If you have any kind of scale. But I think that the honeycomb.io model is going to endure for quite some time. Cause it's a real step function or two better than what we've had till now. And if people actually embrace it and adopt it, you know, we'll figure out how to make it even better. But I think of observability as being a way to debug systems almost more than code. The hardest problem in most distributed systems is not fixing the code it's finding where in the system is a code that you need to fix. And that's the problem that observability is just amazing at solving. Actually finding your code. You know, I think in the future, like IDEs are kind of going to emerge it's observability tools, so that you're literally kind of debugging as you're- there's this great study that Facebook published about the cost of finding and fixing bugs.

And it's like, okay, you just typed a bug backspace, that's the cheapest and fastest it's ever going to get. Right. But then the amount of- the longer it goes before you find and fix that bug and the cost of finding and fixing it goes up like exponentially. So in the future, and I've seen some academics working on some amazing stuff here where they basically like do these like mini graphs of like, they can kind of predict how it's going to run in production based on how it has run in production and how someone else's the algorithm.

So they can tell you as you're typing, oh, this algorithm is going to be way less efficient than the one that you had, or this one looks better, you know, just kind of collect. And the stuff that dark is doing, dark playing where it's just like you're developing in production and there's, there's no delivery pipeline at all.

Like that stuff is the future. That stuff is the future. And I feel like it's going to be a solid decade or maybe two before we get there. But like, that's clearly the way, because it's just, it collapses that time to finding resolving bugs, which is so expensive for organizations to do.

[00:22:48] Jerome: What I'm hearing is like observability is almost like a design principle.

It's like something that gets integrated with like the systems that we architect. I think the main tension that I feel these days is that you have like, observability is inextricably tied to the tooling and that additional tooling breeds additional complexity. And I feel like at some point we'll have the technology needed to be able to simplify those things.

[00:23:17] Charity: Well, the hope is that everyone embraces otel and can then quickly and easily switch from their, their let's say deprecated provider to honeycomb.io without having to reinstrument everything. Well, and the hope is that people will start using honeycomb.io earlier in the process because you're right.

The friction of changing midstream, it's mind-boggling that anyone would do it. I still can't believe that anyone does it. I didn't want to do it. I resisted the hell out of it at Facebook. So it wasn't until our intern or somebody from Facebook actually came over and we're using Ganglia. We were right. We had five years of effort going into this disaster of a metric so system. But like, but we knew it, right. We knew it and we didn't want to reinstrument everything. But you know how Ganglia has like a XML file dub, like VAR temp, you know, GangliaXML, whatever. And it dumps state there once a minute. Well, this guy on our team like wrote a Cron job to take that state file and munge it into events and feed it into scuba once a minute. And like, this is not a great way to see your events in scuba, but like, it was enough for me to like, you know, I knew the old variable names and the structure of the data and so I could find it in the new stuff.

And it was so much more powerful to be able to slice and dice there. That was like, ah, okay. Yeah, I see the benefit. This is, we have to do this because we're dying. Otherwise, you know, we're going to up and down constantly. It's really humiliating to me as a professional and we still have no idea what's happening.

Like we don't know. Sometimes it fixes itself. But we're just like what just happened? You know. We could spend the rest of our day trying to figure it out, or we can move on and get some work done. And after that happens day after day, you just get numb to it. And it's just awful. Right. And like, we were doing microservices before there were microservices.

Right. We were doing a lot of these newer development paradigms without- the world didn't really understand the problems of them yet. And so we were fighting them with our face basically. And so that's the story of how we got into scuba and it changed my life.

[00:25:22] Thomas: I want to go back to what you were saying earlier, like shifting tooling, shifting vendors is like such a heavy lift. And that I just saw honeycomb.io is now offering a metrics on their apprentice tier. And why I want to dig into that is, well, one, I think that's like a great on-ramp and also. I don't remember which blog post I was reading, but you said that kind of maybe metrics are for understanding the system, whereas observability is understanding the application.

And so it's like, there, it sounds like there's a place for metrics...

[00:25:56] Charity: There is a place for metrics. Absolutely. The thing is that you can derive metrics from these wide events. And you can't go any other direction. You can't get events from metrics. So yeah, there's absolutely, there are many places where metrics are valuable, you know, you derive them so you can store them cheaply over time, like an RRD format basically.

Or you can think of it as being a perspective problem. If your a person who's responsible for systems and traditional ops lan you spend most of your time going, is my system healthy? Right? Is this service healthy? Is this system healthy? Is this database healthy? And you're asking about that from the perspective of the system, right?

And so what you actually care about there is the aggregates, because that tells you the capacity planning, it tells you when there's errors and everything, it works just great. As long as you don't actually care about the user's experience. If you're developing the software, you have a very different perspective.

You want the perspective to be not from the system, but from the user and from every single user. And you want to know if every single user is having a good experience or not, which is a completely separate question from the question of is my service healthy or not. It can be very healthy. And your users could be having a shit experience.

It can be unhealthy. And your user is going to have a great experience. Like they're just almost disconnected from each other. Yeah, there are lots of places where metrics are useful. And I think that infrastructure versus applications is a big one. The reason that we built it, like first, you said it's a bridge.

It's a really great bridge. Like we did with ganglia to just like, see your world and ours. And second. There are a few metrics, which even, you know, software engineers who are operating further up the stack need to know, you know, if you shipped a change and your memory usage, just ballooned, you need to know about that.

If you shipped a change in your CPU is now grinding and thrashing. You need to know about that. If it's now considering, just basically didn't know about that. But those three are pretty much the only ones that you really consistently need to know. You don't really need to know about, you know, the entire proc file system and all of the low-level counters for, you know, this IPV six counter and that IPV four route.

And, you know, you don't care and you shouldn't care. The system is broken if you have to care. Right. And so much of the platforms that you develop on now, like serverless. Even if you care, you can't see it. So you just kind of have to learn to develop your code in a different way, where you're using your instrumentation to probe the environment around you.

But you're also cognizant that sometimes you seem to like abandon it and try it again.

[00:28:28] Jerome: So I think that we're consultants and a lot of times we come in and we make recommendations about tooling and things like that. And we have to make a case for either something that we've used in the past, or explain observability to a client or something like that.

And it's really tough to get buy-in for a vendor. Like it, it's just, it's really difficult to do that. You have to sit in on the sales meetings and you have to have influence within the organization to be able to affect those types of things. So a lot of our colleagues are wondering what's like a low touch or low fidelity way to start introducing the idea of observability to a client, or, you know, you don't even have to be a consultant just to your team so that you can start getting familiar with these concepts?

[00:29:15] Charity: First of all, I think what's really interesting to realize is. People only think that the way they're doing things now is easy because it's easy to them because it's what they grew up doing. If you take kids out of college and you put them in front of like a Datadog or a honeycomb.io, they understand honeycomb.io so much more easily than they understand what those Datadog graphs are.

It's just speaking to them in the language of their code, you know? And, and it just makes sense. It's more intuitive, you know, you break down, you know, there's all of these contortions that we have to stretch ourselves into in order to get meaning out of these graphs. And you can't take a metro scrap and like slice it and dice it and dive down deeper or follow the trail of breadcrumbs.

You're just kind of flipping through dashboards is all you're doing. And there's not really a story there. So the way that people are doing it now is the hard way, how much that helps or not is kind of, you know, we kind of have to like help a generation grow up, seeing it a different way, but we can coexist really well with existing solutions.

Like, you know, your Datadog's, your new relics and. Eventually everyone runs into a wall with those solutions. You just can't ask the question that you need to ask. You just can't, you know, it's usually something high cardinality, you know, you just want to ask something about each of your users. Ope, can't do that.

You want to ask something about each of your nodes. Ope can't do that. You know, you just can't do that and people get frustrated. And when you tell them here's a solution where you can at least solve that problem and they tend to take to a pretty quick.

[00:30:44] Jerome: Yeah. So what I'm hearing is like, you almost want to lead them to the point of where they feel that pain viscerally, to be able to say, oh, you know, there's, there's an easier way of doing this.

[00:30:55] Charity: That was our entire first couple of years of marketing strategy was taking the people who were just like, you know, if you say the words, like infinite custom metrics to somebody who's used to using metrics tools, they're like, "what!?" But yeah, it's just like infinite custom metrics because they're not metrics. They're just fields in the event. You can- they're effectively free. Just add more of them, you know. But we've worked really hard too, to make onboarding easy so that you should just have to like install a library, install a package or whatever, and you get it running basically from the start.

And you can go in and you could do more instrumentation. You can, you know, introduce spans and all this stuff, but you don't have to. You actually get a lot of value out of just even just since you're using like engine X or something, or, you know, Amazons, you know, BPCs, even just using it for your weblogs. Lets you slice and dice and figure out where the errors are coming from in the system. You know? So. We made it as easy as we can. There's some compelling reasons to use it, but I definitely recognize that like generations have grown up learning to think about it in a very different way. So it's kind of the hill that we're on right now.

[00:32:04] Jerome: I'm just really curious Charity. What if there's anything that you're really excited about right now, either in observability or management practices or things that you're seeing out in the world that make a lot of sense to you and you want to experiment with, or you want to kind of spread the word about?

[00:32:23] Charity: You know, I probably would've said otel, if we hadn't already covered all that. It's really exciting to me that, you know, we're starting to standardize in something. Finally. I have been taking a lot of heart from, to this day, the blog post that people seems to associate me with the most is the pendulum one. Which means that I get pinged from people all over the world when they've decided to give up management and go back to engineering.

And they're just like, I'm so happy now. They're like, I couldn't have done it unless, you know, I read this and I'm just like... and it's just like these little droplets of joy that keep feeding me. It's it's like, it makes me so happy to see people doing what they love. And it's like, yes, when we're unhappy manager down for the count and happy engineers springing up.

So, you know, that brings joy to my life.

[00:33:07] Jerome: That's awesome. Yeah. I love that. I love that post and I actually just shared it with our internal team. That's looking at senior career pathing ourselves.

[00:33:21] Charity: Yeah, the whole myth that there's just one direction. And you must like climb, climb, climb, climb, climb.

It's so pernicious, you know. It's more like you reach a level of fluency and then you get to explore. And sometimes you may like climb in the hierarchy. Sometimes you may go down the hierarchy. If we could just like drain the hierarchy of its poison of its power dynamics and its, and its like, we associate in our little monkey brains with being better and more powerful and stuff, which makes it really hard for the ego to take what feels like a step down.

But it's so unhealthy because honestly, the higher up you get in double quotes, the farther you get away from doing the real work, which is what brings most people actual satisfaction. It's seeing their code in the wild fixing problems with their hands, seeing users happy, knowing that they contributed, you know, when you're like three or four levels above that, you know, it gets very, you know, the altitude's pretty high. It's a little thin up there.

[00:34:23] Thomas: Charity, thank you so much. You've been such a generous guest.

[00:34:27] Charity: Absolutely. You asked me questions that I've never heard before, so.

[00:34:33] Thomas: That was our goal because you're so generous. I mean, all of your writing and we'll include it all in the show notes. You're on Twitter, you have your o11ycast you write the company blog, your personal blog, you do talks. I mean you're definitely, like we said at the beginning, someone we were super stoked to talk to. So thank you so much.

[00:34:53] Charity: Thank you. Thanks for having me.

[00:35:00] Thomas: Jerome.

[00:35:01] Jerome: Thomas.

[00:35:02] Thomas: I just can't with that conversation. That was so amazing.

[00:35:07] Jerome: Agreed. Yes.

[00:35:09] Thomas: I am like buzzing.

[00:35:11] Jerome: I can tell.

[00:35:13] Thomas: There was so much more I wanted to talk about and that we could have gotten to, I'm so grateful for the time that we had.

[00:35:20] Jerome: Likewise.

[00:35:22] Thomas: Just, my big takeaway is like that this is engineering.

[00:35:27] Jerome: Say more, yeah.

[00:35:28] Thomas: Observability isn't this like magical silver bullet, if you will, thing that you just like plug into your system and then like you're off to the races. Mind you, honeycomb.io clearly has done a lot of the work to make that as frictionless and seamless as possible. But at the end of the day, like the principles... and I like what you said, like, is this a design principle? Like the principles behind instrumenting your system to be observable and what to do with that information and how to act on it is engineering.

And if you don't have the tooling for that, or you don't have the mindset for that, then that's really a question about your team dynamic or the structures in place or the environment that you work in. And that needs cultivated just like, you know, the tooling does. But ultimately, like it's not magic. It's not some secret thing. It really is just like hardcore, not even hardcore... [laughter] I mean, it's engineered. Hardcore [laughter]. I just, you know, I don't want to put it on a pedestal. I think it can feel, at least for me, it can feel hardcore, like "whoa, data" and you know, but any engineer, I think in the right environment, given the correct tools and space and time can be as good as any engineer out there, so.

Anyway, I think that that's what was really like... I love that Charity was so humble and so willing to share everything that goes into it. And you're like, wow, that's engineering. How about for you? Like what are you... I can see you're like, you're really excited and buzzing, too.

[00:37:12] Jerome: Yeah. I mean, you're saying things that really resonate with me. I think like taking your idea of it's engineering, that really resonated with me too. Like we talked about how there was the seam that we created between engineering and ops as a way to kind of specialize and how probably that seam should've never existed in the first place. And that to me, has the implications that observability is going to be kind of like this fundamental thing, or is starting to be this fundamental thing much like clean code or testing your code.

And it's going to be kind of this, this cornerstone of quality for systems, now and in the future. It was just really exciting to have the expert on observability, come and talk to us. Chat about what it is, what it isn't, how to get started in it. I was just really stoked for the conversation and it delivered in ways that I couldn't even imagine. And Charity just a heads up, we'll be reaching out again so we can do a part two.

[00:38:23] Thomas: A hundred percent. And yeah. Just a little meta commentary. Just every question that we were like, you know... as we were doing research for this episode, it's like every question we thought we had, like, she had found a way to answer it, whether it was a blog or podcast or a talk like she's so generous and just so much to unpack and to dig into. So this was just exciting on so many fronts I mean, this is like, I guess this is what a celebrity sighting feels like. I was like, oh my gosh, you're here. Hi. Oh, I'm a nerd. That's fine.

[00:39:07] Jerome: It's fine.

[00:39:13] Thomas: Thank you so much for listening to this episode of Collaborative Craft.

[00:39:17] Jerome: Check out the show notes for a link to this episode's transcript and to learn more about our guest.

[00:39:22] Thomas: Collaborative Craft is brought to you by 8th Light and produced by our friends at Dante32.

[00:39:27] Jerome: 8th Light is a software consultancy dedicated to increasing the quality of software in the world by partnering with our clients to deliver custom solutions to ambitious projects.

[00:39:37] Thomas: To learn more about 8th Light and how we can help you grow your software development and design capabilities. Visit our website 8thlight.com.

[00:39:46] Jerome: Thanks again for listening, don't forget to subscribe to Collaborative Craft wherever you get your podcasts.

[00:39:52] Thomas: You can also follow us on Twitter at @CollabCraftPod to join in the conversation and let us know who you'd like to hear from next.

[00:40:00] Jerome: We'd love to hear from you!

[00:40:01] Thomas: Bye!

Into the Unknown Unknowns: Observability with Charity Majors

Context Links:

Subscribe to Collaborative Craft

Episode Transcript

Into the Unknown Unknowns: Observability with Charity Majors

Jerome Goodrich

TDD: The Missing Protocol for Effective AI Assisted Software Development

How Machine Learning Transforms Visual Validation in Game Development: A DevOps Success Story

AI-Powered Documentation: The Secret to Efficient Technical Writing