It's a Process: More than Machine Learning with Hana Lee

“You can build an initial model, but you have to keep updating it. You have to continually operate it, doing things incrementally rather than trying to solve the whole problem in one go. One of the benefits is that we have a good framework for building things incrementally.”

— Hana Lee

Machine learning is only as good as the humans who create the program. Even machines have biases. Hana Lee is a Principal Software Crafter at 8th Light, where she's currently leading data engineering work with a global reinsurance client to take prototype machine learning models that predict insurance risk and turn them into production-ready, scalable services.

Before joining 8th Light, Hana was immersed in academia. She completed a postdoctoral fellowship at the University of Chicago, where she studied the genomics of host-pathogen interactions. She holds a Ph.D. in molecular biology at UC Berkeley, an AB from Harvard University in biochemical sciences, and an AWS certification as a Solutions Architect Associate.

In our conversation today, Hana discusses her life in academia, the case for creating cross-functional teams, and how to frame data problems for machine learning success.

Context Links:

The Failure of Big Data
85% of Big Data Projects Fail
87% of Projects Never Make it into Production

If you'd like to receive new episodes as they're published, please subscribe to Collaborative Craft in Apple Podcasts, Google Podcasts, Spotify, or wherever you get your podcasts. If you enjoyed this episode, please consider leaving a review in Apple Podcasts. It really helps others find the show.

This podcast was produced by Dante32.

Episode Transcript

It's a Process: More than Machine Learning with Hana Lee

[00:00:00] Thomas: I learned my laptop only has 8GB of ram. It's basically an iPad, always needs the fan. It's also 97,000 degrees in this closet. Yeah, somebody was like, "Hey, why don't you open your 'About This Mac' thing" while I was sharing my screen. And they were like, "Oh God." I was like, "Oh, is that bad?" Gosh, you wouldn't know I work in tech.

[00:00:25] Jerome: Hello everyone. I'm Jerome Goodrich.

[00:00:27] Thomas: And I'm Thomas Countz.

[00:00:29] Jerome: And you're listening to Collaborative Craft, a podcast brought to you by 8th Light.

[00:00:35] Thomas: So, Jerome, who is our guest for today?

[00:00:39] Jerome: I am so glad you asked, Thomas. Today, we're talking with Hana Lee. Prior to joining us here at 8th Light, Hana Lee completed a post-doctoral fellowship at the University of Chicago, where she studied the genomics of host-pathogen interactions.

She holds a PhD in Molecular Biology at UC Berkeley, an AB from Harvard in Biochemical Sciences, and an AWS Certification as a Solutions Architect Associate. Now at 8th Light, Hana is a Principal Software Crafter, where she's currently leading data engineering work with a global reinsurance client to take prototype machine learning models that predict insurance risk and turn them into production ready, scalable services.

Her system is responsible for the automated data ingestion and feature engineering pipeline. Delivering real-time predictions, continually validating the models' performance, and generating reports with interactive visualizations so that non-technical stakeholders can make informed decisions about the model's performance.

Hana has graciously agreed to join us for a conversation today, to talk about the data science and data engineering work required to build, deploy, and tune machine learning models at a global scale.

[00:01:55] Thomas: I'm so stoked for this conversation. And I love the distinction here that she's literally working at such a gigantic scale because when we were first talking about speaking and having a conversation with Hana, I was like, oh, I've done some machine learning.

And I'm just immediately laughing because the machine learning that I've done... And literally, if you wanted to find this, you could, it's like identify the difference between a Morning Dove and a Sparrow. And it's like, the training set was like 80 images and yeah, not exactly a global scale. But I feel like I've had these like bumps into machine learning, kind of as an experiment and as a curiosity, but never from like a business perspective. Never from like the scale of, we have an immense amount of data and we need to make better business decisions. That's a whole different ball game.

[00:02:51] Jerome: Yeah. I mean, for my part, I think ML is one of these things, it's kind of like Design, that it's obscured for me by some sort of conceptual curtain. You know, I feel like it's very much in the realm of academia and somewhat orthogonal to more traditional software development.

And I'm really excited that we have Hana who came from that world and can kind of bridge that gap of these two places that ML resides. And can speak to us, not only about how it's used in research, but also how it's used on client projects. Much like you, my experience with ML is pretty limited. While, ML is used in academia- and I do have, like, I kind of put it on this pedestal and it has, you know, this kind of arcane mystique to it. It's also very much prevalent in our day-to-day experience as software developers. Things like, AWS, SageMaker, TensorFlow... These are becoming household names and it seemingly making ML more available to your average developer.

So I really want to understand by talking to Hana, how far that goes. And I'm really curious to see how her deep experience with these technologies maps to her current work as a principal crafter and the lessons that she's learned from living in both of these worlds.

[00:04:19] Thomas: Yeah, definitely. If I think of ML, I think of like AI and self-driving cars. You know, all of this... or even like algorithmic bias, like these kind of big headline flashy things. It's like, no, actually, if you work in tech, we are very close to it. Just being kind of the norm, another tool in our toolbox. Now I know that it's not just that easy and you can't just decide you want to use ML and it's magic.

I mean, I worked tangentially with kind of data engineering and I understand it's, we're not quite there yet to just be able to pull something off the shelf, but we're pretty close. There are a lot of things that machine learning can help businesses do. And I think we would be wise to learn a little bit more about it

[00:05:08] Jerome: For our own careers.

[00:05:10] Thomas: Exactly. Oh gosh. What is it? Copilot and...

[00:05:13] Jerome: Yeah, exactly.

[00:05:14] Thomas: If we don't learn how to write the AI, the AI is going to write us.

[00:05:19] Jerome: Hana, please save us.

[00:05:22] Thomas: So with that, let's hop over to our conversation with Hana.

[00:05:27] Jerome: Let's do it! So excited.

Thank you so much for joining us, Hana. To start off you are a UChicago Alum, aren't you?

[00:05:43] Hana: Well, I don't know if it counts as an alumnus when you work there as a postdoc. But yeah, I did work there for two years before joining 8th Light.

[00:05:53] Thomas: So you were there studying the genomics of host pathogen interactions, is that right? How do you go from that to the software world?

[00:06:06] Hana: So, my background is in computational biology and genomics and basically throughout my research career, I've had to deal with three large data sets. And at a certain size, you have to use code to analyze those datasets because it's just too big to do otherwise.

So, in grad school I learned R and then Python because that was around the time when biologists were all switching over to using Python.

[00:06:38] Jerome: You were spared from MATLAB.

[00:06:41] Hana: Yeah. And then basically, I had to sort of keep improving my coding skills and you know, just as that interest grew, I started to think, well, maybe I can make a career of doing this in tech instead of continuing to suffer in academia.

[00:07:07] Thomas: And here you are, you have a great career in tech.

[00:07:11] Hana: Yeah. And I think really, the 8th Light apprenticeship model is really great for that because it allowed me to make that transition because I had done a lot of coding before and actually was pretty comfortable with Python, but, that's still a leap away from writing software. Because, here's the dirty secret, academics might write a lot of code, but no one else uses that code except themselves. So, writing code in a way that other people can contribute to it. And also, simply just use it. That is a craft, right? That is a practice. And that's why we're here. So, that was a large part of what I needed to learn and what I did learn during my apprenticeship.

[00:08:01] Jerome: Was there something that drew you to that? When you left academia, did you know that you wanted to deepen your expertise in writing software? Did you have a sense that something was missing?

[00:08:15] Hana: So, I had been sort of exposed to some ideas. There has been a lot of I would say, activity within academia to adopt better practices from the software world. And a large part of this is not so much because, I mean, not a lot of software is being written. But we want our data analysis to be reproducible.

And so reproducibility is a big movement within academia and a large part of making your data analysis reproducible is to version control your code, put it up on a place like GitHub so that other people can access it. They make your code readable so that other people can understand what you did. So all of those sort of very basic software engineering practices are being adopted in academia. A papers published and then not only can no one reproduce their data analysis, but no one remembers how they did their own data analysis in the first place, right? So, that's actually caused quite a few scandals, like particularly when I was in grad school. And this has been kind of the response to that to try to adopt software engineering practices, to increase reproducibility, increase transparency.

[00:09:39] Thomas: Before we go into talking about the kind of work you're doing now with your client, in your experience, what's the difference between say data analysis, data science, data engineering, and how has that related to machine learning? And in particular, I'd love to know about, in your academic background, which part of each of those skill sets were you employing? Were you doing machine learning in academia or was that something you picked up once you joined the tech industry?

[00:10:14] Hana: Yeah, this is actually a really good question because a lot of these terms are fairly new.

I mean, not data analysts, but you know data engineer, data scientists a lot of these terms are fairly new and there's, I think finally there's been a consensus about what they mean, but you can still see some discussion over, well, exactly what makes a data engineer different from a data scientist. Et cetera, et cetera.

I think nowadays in the industry, it's generally agreed that a data analyst may do things like generate plots from data sets. You know, present different insights by looking at interesting data that is relevant to a business. Business intelligence, I think, is also a term that is often equated with data analysis.

And then the line between that and data science is usually when machine learning becomes involved. This is not to say that data scientists have to do machine learning, but frequently they do. The machine learning part though, often in the data scientists' job is actually a small portion. There's like this adage that says like 90% of a data scientist job is cleaning the data.

And in some sense, that's what data analysts do, too. You know, they have to get the data side, clean it and normalize it. Make sure it can be used for analysis in a robust way. Figure out what to do with the missing values. Figure out whether, you know, this field from this file is actually the same as that field from that file, even though they have different names, you know, et cetera. Like a lot of that work, data scientists and data analysts both do.

But I think the idea is that a data scientist on top of that will have to bring machine learning into the picture. And that also means that they have to do a lot more coding as a result. And then the data engineer, where did that come in? Well, you know, despite this image of the data scientist straddling both worlds of software engineering and statistical analysis and being an expert in both, there actually is a lot of I guess specialty to software engineering that it's hard to ask a data scientist to have. And frequently a lot of this is around like infrastructure and you know, setting up the right tools and platforms for those data scientists. So now what you're often finding is that data scientists will have like particular tools that they use to write their code. And at larger scale companies, you'll find that data scientists will sort of turn over the work of cleaning and processing the data to the data engineers once it's become systematized. You know, like once you, once you've looked at these data sources and you know, okay, we have to do these and these steps every time to process the data. All right, we're going to automate that. Let's give it to the data engineers.

Now, all of that being said, different people at different organizations will occupy different areas of these roles. You know, like they're not neat and tidy boxes. Data scientists at some organizations may have data engineering responsibilities. Data engineers at other organizations may have some data scientists responsibility. So, you know, there's always overlap and that's kind of why it's. There's still, maybe not complete clear picture of what the differences between these terms are.

[00:13:58] Thomas: Yeah, it sounds like they're like disciplines more than individual roles.

[00:14:05] Hana: Yes. That is a very good way of putting it. Yeah. Yeah. And I think actually I recently attended a virtual Google conference, Applied ML. And a lot of what people kept bringing up in their keynote talks were the importance of having cross-functional teams. So making sure all of these disciplines are represented by different people.

It doesn't necessarily matter who represents which discipline or what title they have. But making sure that all the different areas are present on a team and are all working together instead of being siloed into separate teams. So don't have your team of data scientists sitting in one area and your team of data engineer is sitting in one area. Build a team that is able to carry out all the functions that are needed to operate a machine learning model and production.

[00:15:02] Thomas: Yeah, I interrupted your answer when you were going to go into your academic history. But I also want to flag that it's interesting. We've had some other guests mentioned that same kind of approach, that same mentality of cross discipline teams. In that regard, I'm thinking of John Wettersten, who was our first guest who mentioned that in terms of design and that being important for a software product.

And it sounds like for a data product, you have these disciplines that are unique to building data products. And it sounds like the same kind of perspective or the same approach might be just as valuable there.

[00:15:39] Hana: Yeah, absolutely. I think there's a lot of parallels and yeah, I've been listening to your podcast episodes and I get a lot of inspiration from it even when they're not, you know, directly to do with data.

[00:15:54] Jerome: Well, thanks. I'm going to ask you the same question that I asked John, which is this notion of cross-disciplinary or interdisciplinary teams seems very important. And you mentioned, you know, you're going to these contemporary or state-of-the-art talks on ML and data engineering, data science. And these things keep cropping up. In your experience, is this something that you see often in the wild? Is this something that companies are doing? Or does it keep coming up because it's a difficult thing to get right and people are offering guidance on ways in which to structure these teams?

[00:16:41] Hana: So I absolutely think that it's hard to get right. I think it's something that everyone agrees is a good idea and wants to build, but it's hard to build. Different organizations are trying different ways of getting at it.

And I don't know if I know the magic formula. But I can tell you for example, the machine learning model that I've put into production and, you know, made ready to be used in business that was originally developed by a group of data scientists and actually like more your traditional academic statisticians even.

And that was developed before, you know, 8th Light was even brought onto the project and at the parent company of my client. And when I started on the project, they brought us over and they handed over the code and said, here you are, here's your model. And that's exactly the kind of thing that the reason why these conferences all say, you know, "don't do it that way, build a cross-functional team," just because how much easier would it be if the data engineer, AKA me, was involved from the beginning and knew all the rationale behind the decisions they made.

You know, even now I have a list of I think 170 features, that is like inputs that go into the model. And I don't actually know why all of them were chosen because I was given a prototype model and was not involved from the beginning. And you know, we've had conversations with the people who develop that prototype and often their answer is, "oh, that's very interesting. I don't remember."

And so then, what I have to do then is sort of experiment and try to, you know, reverse engineer, I guess is the term, like reverse engineer why they made that choice to get a better understanding. You know, when we're in that boat, when we are given a prototype, we can certainly do our best to, you know, make a surface out of him. We have been able to do that successfully, but I definitely believe that the better model is for, you know, data scientists, data, engineers, analysts, whoever is involved in the product to be all working collaboratively from the beginning. And sharing all of their insights too, because, you know, if I had been involved in the development of the prototype model, I could have maybe told them, you know, for example, your thousands of lines of SQL that engineer, these features it's great, but I really need some tests around this. And maybe that means we don't write it in SQL, maybe we write in a different language.

Maybe I can help you do that. Maybe I can also help you to figure out how to modularize this code so that it doesn't always have to be run in the same sequence otherwise it breaks, et cetera, et cetera. Right.

And this is not to critique those statisticians and scientists, they built a great model. It performs well. But when it comes to maintaining that code and, and making improvements to it, you know, there are certain software engineering practices that they could have benefited from from the very beginning. My ideal at least would be for a data product to be developed with everyone on board and everyone involved in the very initial decisions, even of how to go about creating a new model.

[00:20:33] Thomas: I love that you mentioned this specific instance because I'd really love to dig into it as much as you're willing. So I want to talk about more generally, like how do you get a model into production, but even before we get into that, how do you... if something as complex as a model with tons and tons of inputs are thrown over the fence to you, how do you even begin to probe at it and to ask questions of it?

[00:21:03] Hana: Well, one of the things that you can certainly do is just test the model. But a lot of what we've worked on is figuring out good procedures for validating our models, performance. We have a training set that the model is trained on, and then we have a validation set, which we check to make sure that the model's performance is as good on data that it hasn't been exposed to, as it is to the data that was originally trained on. But in addition to that, we also have what we call like sort of external validation and that's the actual sort of business orders that we receive and we test against that data and see how the model performs there. And often what helps here is having really good domain knowledge.

So we're very lucky that at the client, one of the executives involved has years and years and years of experience in the insurance industry. And so he has a good sense for, "when we look at this order, is it going to be high risk or low risk?" And we've relied a lot on his intuition to try to bring a better understanding of how to fine tune the model set performance.

The data set that we're dealing with is about properties. And a lot of the signal that they paid close attention to is whether there are foreclosures on those properties. And so as a result of that intuition of foreclosures being important, I dug down and looked at all the features that relate to foreclosures and also worked with the data analysts on the team to make sure, well, does this actually reflect what we expect given the external data, this data set, for example, is it reporting foreclosures accurately? Are we deriving information? Because we also have some derived features that we engineer from the data set we receive. So are we deriving certain conclusions about those foreclosures accurately based on what we know about this property from other sources?

So, we've done a lot of that kind of checking to get a better handle on what the important features in the model are actually contributing and whether they're being calculated accurately. But in actuality, when it came down to it and we had to actually end up rebuilding a lot of the not so much the mod. The core model, we didn't have to touch so much, but all the data analysis and all the wrappers around the feature engineering, like a lot of that, had to be rebuilt by us because we were moving it to a different cloud.

[00:23:56] Jerome: When you say feature engineering, I typically think of features on a, on a web app or something like that. What are features for an MLT?

[00:24:06] Hana: Right. So features is the ML term for inputs, basically. So they're anything that you use as an input into your model and in our case, you know, we start off with, I mentioned this earlier, but we start off with a pretty big dataset of, it's actually all public data about properties in the United States.

And it's provided to us by a third-party aggregator. And that just gives us a lot of, sort of raw data with all this information, but then there are some, what we call derived features. Derived features are basically any inputs that we sort of calculate as a combination of fields in the original raw data.

So, you know, I mentioned earlier, you know, like I had to check whether we were calculating things correctly. Like, you know, we have things that we look at, like, you know, what are the number of foreclosures on this property? And a lot of that has to be calculated from this raw data. So either that means aggregating several records in the data, or it means in some cases, just like, like directly calculating like a ratio between two fields.

And then there are like even more complicated cases where, we'll like check, you know, if this value is in this field and this value is in this field, then we'll, you know, the derived feature will have this value and et cetera, et cetera. So as you can imagine that actually ends up becoming a lot of code.

So, I have a lot of this feature-engineering code that was written in SQL to calculate these additional derived features that go into our model.

[00:25:50] Jerome: Thanks for the explanation.

[00:25:52] Thomas: One of the topics that we touched on kind of in our pre-discussion was this idea of a business going from "I have all this data and I'm supposed to harness this for some business intelligence or some product," and a lot of businesses wanting to capture this influx of ML knowledge that's being put out currently from both academia and in practice. But then at the same time, like not a lot of businesses succeeding at that for one reason or another. Their product doesn't get shipped or their model doesn't end up being deployed. Can you talk about where that trend maybe comes from or why you think that pattern seems to be happening over and over again?

[00:26:40] Hana: Right, right. Yeah. So just to provide some context I think it was Gardner who originally sort of... well, it's funny because Gardner first said, you know, businesses that are able to successfully harness data have a competitive advantage over businesses that don't.

And then a couple of years later, they said like, I forgot the exact percentage. I think maybe about 80% of businesses are not able to successfully harness their data and, you know, put their data science products into production. And I think really what it boils down to is that you need to be able to frame the problem that you want to solve with data in a realistic way.

So, that means that you can't just say, oh I expect machine learning to solve my problem of let's say, figuring out what features customers like most in my apps say, that would have relevance to us. You can't say that you want machine learning to solve that problem, and then just expect the solution to fall into your lap.

You need to make sure that you are collecting the right data and then that the data that you're collecting actually can be a proxy for what you are trying to reflect. So this is like the metrics problem, right? Is this actually the right metric for the whole customer satisfaction with features?

Often the metric collect is like the time that a customer spends interacting with it. And that is certainly one measurement. And it does have some correlate with this, you know, complex phenomenon of customer satisfaction, right. But it can also be actually a customer is really frustrated with this feature and that's why they're spending all their time refreshing that page or like clicking that button or whatever.

And yeah, something like a lot of the things that businesses want to solve with data are really complex phenomena that can't be measured directly. So we have all these proxies that we measure instead to try to get at that. But if you don't choose the right proxies and- or even if you just limit yourself to one, you don't get a full picture.

And I think that's one of the issues that's contributing to this. And then, the second aspect that is sort of related is also that I think people find it very hard to reason about data and a large part of this is not really our fault because we, as human beings are basically optimized to recognize patterns everywhere.

Right. But to be able to think critically about data, you have to realize that there's actually a lot of spurious patterns that can be created by noise or by randomness. So you have to be able to set up your problem in the right way that you aren't tempted to over-interpret or similar problem in machine learning is overfitting and detecting patterns where there is none.

You know, one of the things that's nice about software is that, you know, it's pretty predictable, right? You give the computer exact instructions and it follows them often to a fault. And the problem with data is that we're talking about probabilities here and then things become a lot more fuzzy. And randomness starts being something that you have to be able to reason about and account for. And we're, we're just, you know, we have a lot of cognitive biases that make it hard to fight against that and to think about those things.

[00:30:35] Jerome: It sounds like a lot of the value that you're providing as a consultant is helping the organizations in which your clients make this commitment to data integrity to understanding that you don't just get ML. There's many steps to this process and it is a process and it's going to get better and better. For instance, if you implement cross-functional teams. If you are being very diligent about what you're qualifying as a signal, can you talk a little bit more about what that looks like in practice? Like, is there a resistance when you go into an organization? Who just want to know, you know, what feature do my customers like the most?

[00:31:27] Hana: Yeah. I don't think there's necessarily resistance, but I think there's like a lot of analysis paralysis. I think there's a lot of, "well, we want to do this. And we want ML to come in and solve all our problems, but, you know- And we're collecting all this data, but how do we get from one to the other?"

And yeah, it is a process. And I think a large part of the paralysis comes from the idea that it has to all be done at once. Like you build a model and then that's it. But that's actually not it, right? You can build an initial model, but you have to keep updating it. You have to keep checking its performance.

It doesn't end with just developing a model and then putting it behind a web server and saying, okay, go forth and deliver predictions. You have to continually operate it. And I think bringing sort of our ideas about, you know, agile software development and doing things incrementally rather than trying to solve the whole problem in one go is one of the benefits that we could bring to the space.

And which is not to say that companies aren't trying to do it, but I think it's, it's hard to just do it if you don't have a framework for it. And I'm one of the benefits that we have is that we do have a good framework for, for building things incrementally.

[00:32:51] Thomas: So far, we talked a little bit about feature engineering. You mentioned a kind of data pipeline and this like thousands of lines of SQL. Kind of the other parts of data engineering, which I know you have experience with, is validating the performance of the model and tuning that building in that feedback loop. And then also kind of generating useful information from the model for stakeholders who maybe aren't data analysts and can't look at raw numbers.

So yeah, could you fill in those two areas of data engineering with maybe some of your experience?

[00:33:31] Hana: Right. Sure. So as I mentioned, you know it doesn't end with just training a model and, you know, putting it behind a server, you want to actually make sure that it performs and continues to perform well.

So a large part of what I've been working on over the past year is to first of all, develop processes around validation and then also to automate a lot of that and make that- report those validation statistics in a way the business stakeholders can have access to it. That part's actually been really interesting to me because you know, you don't typically think as a data engineer that you would spend a lot of time thinking about how to communicate with the business about the model itself, but actually that is a huge part of the job and it's especially challenging because people do tend to look at ML models as these black boxes, and they're very mysterious and that's intimidating, right?

You can't get a hold of it and you know, you have potentially audit requirements. And so you wonder, well, how am I going to explain? One thing that I've done is, you know, there's false positives and false negatives that come out of the model. Those are two ways that the model can be wrong. And I've tried to translate that into business terms for the stakeholders.

So, you know, a false positive in our cases, you know, unnecessarily spending money on a low risk property that we thought maybe was higher risk than it was. So that's the cost to the business. And then the false negative is a property that turned out to be high risk, but actually was low risk. And, or sorry, no the other way. Property that turned out to be high risk that we thought were as low risk.

And that actually is even more of a concern to the business, right. Because that's a potential insurance claim and they don't want to pay out that claim. So yeah, just like, you know, taking the extra step to translate these statistical metrics into business risks that they understand and then visualizing that so that they can see it clearly.

And I use that to help make decisions because there's always a business decision involved in a model. They can say, "oh, the model's output goes from zero to one." So it's, you know, like some number right. And they could say, "oh, we're going to say you know, 0.9 is high risk or they could say 0.1 is high risk, right?"

The actual interpretation of what those numbers really tangibly mean, well, that's a business decision. So being able to advise them on how to make that decision by giving them the information that they can interpret easily has been a big part of the work that I didn't anticipate when I first started on this project.

[00:36:40] Thomas: Is there a way to cross validate with other datasets to maybe like find bias? I guess what I'm trying to say is like, is there a way to take that property data and somehow map it to like demographic data, for example, and see, "oh, are we more likely to classify a home owned by a person of color as high risk versus low risk."

[00:37:04] Hana: Yeah, no, that's actually a really good point. So I know that both Google and Amazon, and I wouldn't be surprised if Microsoft as well, on their cloud platforms specifically around their machine learning services have started to provide tools to check if there are any sort of systematic biases in your predictions that correlate with demographics. In our particular case we're actually, not really using any demographic data about the owner of the house. This isn't to say that there may potentially be some correlates with demographics cause you know, everything does end up you know, correlating somewhat. But at least all the data we're taking in is just about the property itself and not about the owner. There was this study done of a machine learning model that didn't actually take race as an input, but still managed to deliver biased outcomes by race. So, you know, it's a valid concern.

And certainly on our client project, because it's under scrutiny by government regulators, we've done a lot of- we've tried very much to make sure that there is correlation with demographics and no biases that can result.

But across the field as a whole there are often problems where you. Where the data you're dealing with is about people. And then I think the challenge becomes a lot harder. You can't just say, "oh, we're just not going to look at people." In some cases, a problem forces you to look at people and you have to start figuring out ways of checking that your model doesn't have those systematic biases in its predictions.

These tools that have been made available by the cloud platforms are I think, a way of making that easy to access. Whether they're enough or whether they're doing what they claim to do well enough, and I think there's probably certainly a lot of ground there to examine. If I was on a client that was going to make use of these tools, I would want to do that research to make sure. Because, you know, when Amazon says it's providing you a tool to counter algorithmic bias and you know, that Amazon itself has been guilty of algorithmic bias, you kind of wonder, well, is your tool actually gonna be good enough to do this.

So, I think there is a question there. But I think to give a point of all that rambling, it's something that the field is moving towards. And I just I don't know if it's gotten to a point where there's like obvious best practices to follow. But people are starting to build tools for it and are trying to be more explicit about making it part of the whole machine learning development workflow.

[00:39:59] Jerome: How can people get in touch with you? And is there anything really exciting that you're working on or that you've heard about that you want to share?

[00:40:08] Hana: So, to get in touch with me, I am Hana at 8thlight.com. That's Hana. And I'm also @Lee_HN on Twitter. I mostly tweet about my kids.

Occasionally I will tweet about tech and I'm also happy to engage in conversations about tech on Twitter. Yeah. And then interesting things that I've seen... so, my teammate, Pierce, as you two are aware, but for the listeners is also working on a machine learning project for the same client. And he's been using TensorFlow as a framework. And one of the really exciting things that I've that I've been interested by is how TensorFlow has made, at least in theory, a lot easier to package models for production. So a lot of the sort of existing machine learning models. They'll just like package the model itself and they won't. Give you, you know, like all the preprocessing you have to do to the input before it can be fed to the model.

You know, I mentioned earlier that feature engineering pipeline, right? Like, you know, that means like, you know, whenever we would receive an order, we have to make sure we can do those calculations or at least have those calculations ready before we get the prediction from the model. Well, TensorFlow has built this framework so that with your model you don't just have the model itself, but all the pre-processing steps and then also a post-processing, too.

Sometimes you have to- what you get out of the model output is not really what you want to return to whatever web app or service that's going to work with that prediction. Sometimes you have to, you know, further steps to make sure it gets into a shape that's actually usable. So all that, we'll package it together and you can deploy that as like a single container instead of having, you know, separate pieces of code lying around to handle those pre-processing and post-processing steps.

So I think there's a lot of exciting things that are available in TensorFlow as a framework. And I think if I am lucky enough to get onto a machine learning project for my next client, whoever, whenever that may be, I hope I'll have a chance to try it out myself because it looks pretty neat.

[00:42:39] Jerome: Well, Hana, thank you so much for spending this time with us and giving us a very excellent primer in data science, data engineering, machine learning, that whole space. It was just such a pleasure to talk to you and pick your brain. And hopefully we can do it again sometime.

[00:42:59] Hana: Yeah, this was fun for me too, as I said I'm a fan of your podcast. So...

[00:43:06] Jerome: Awesome, thank you.

[00:43:08] Thomas: Thanks, Hana.

Okay, that was a really insightful conversation.

[00:43:20] Jerome: Absolutely.

[00:43:21] Thomas: There's a lot to unpack there.

[00:43:23] Jerome: Yes. I feel like I learned so much.

[00:43:26] Thomas: Well, tell me about it. Tell me what are you what did you learn? What's like, what are you buzzing with right now? What are you thinking about? What correlations are you making?

[00:43:33] Jerome: Yeah, I think the thing that really got to me was just how involved ML is. It's a process. It's something that requires constant attention, constant tweaking. It's a living breathing thing. And I think the point that highlighted that to me the most was the metric problem that Hana was talking about. That time spent on a page could be something good. But it also could just be that a customer's frustrated, you know, with the product. And I couldn't help, but think about training my new puppy and everything that I was reading about when, you know, we were going through the process of choosing a dog and making that decision was, "Hey, be really careful about the signals that you're giving the dog." Right. You could think that raising your hand up in the air is the signal that is getting your dog to sit, but it could actually be that you're raising your eyebrow every time you do that. And the dog is picking up on that instead. And I'm realizing now that I'm comparing my puppy to an ML model and I think that's pretty cool.

[00:44:57] Thomas: Yeah, I think that's super, I mean, that definitely resonates with me. I mean, one is maybe qualitative and one is quantitative, but I mean, yeah. MLS puppy training. I'd read that book.

[00:45:10] Jerome: Yeah. Maybe, maybe there's something there. What about you, Thomas? What sparked joy in that conversation? What got you going?

[00:45:19] Thomas: I mentioned it in our conversation, but as soon as Hana was talking about cross-disciplinary teams, like that little bookmark revealed itself and I was like, oh yeah, we had the same conversation with Jon in an earlier episode. And it's interesting to me because after I realized that connection, there was a lot about what Hana was talking about that I saw was very similar to what we do as software developers and software engineers. There's a process to follow. There's a beginning, middle, and end. There's multiple people across the business domain and the technical domain that need to work together to bring a product forward.

And it seems to be the same in machine learning. Now I'm not saying that I could just like hop in Hana's chair and do what she does instantly, but it did make me realize that. If you're maybe thinking of developing a data product or a data service that you're going to maybe go through similar processes in terms of building your team and managing the development of that project.

Some things are going to be swapped in and out, but like the same amount of work and discipline is required. And similarly, when Hana was talking about the difference between academia and industry, when it comes to offering code, and that the difference that she noticed was in craft, think it's kind of the same thing.

And of course, software development, you can be a coder and write code and use code as a way to kind of- as your primary means of expression to steal something from Jon. But then the other side of that is, I mean, building a product and there's craft that needs to go into that.

[00:47:08] Jerome: What a nice way to put a bow on it, Thomas. And what about you listener? What does this conversation spark for? You have your mental model of machine learning shifted like it has for us. How do you incorporate data insights in your business? What does data engineering look like on your team? Let us know. And until next time.

[00:47:33] Thomas: Thank you so much for listening to this episode of Collaborative Craft.

[00:47:37] Jerome: Check out the show notes for links to this episode's transcript and to learn more about our guest.

[00:47:42] Thomas: Collaborative Craft is brought to you by 8th Light and produced by our friends at Dante32.

[00:47:47] Jerome: 8th Light is a software consultancy dedicated to increasing the quality of software in the world by partnering with our clients to deliver custom solutions to ambitious projects.

[00:47:58] Thomas: To learn more about 8th Light and how we can help you grow your software development and design capabilities. Visit our website 8thlight.com.

[00:48:06] Jerome: Thanks again for listening. Don't forget to subscribe to Collaborative Craft wherever you get your podcasts.

[00:48:12] Thomas: You can also follow us on Twitter at @CollabCraftPod to join in the conversation and let us know who you'd like to hear from next.

[00:48:20] Jerome: We'd love to hear from you.

[00:48:22] Thomas: Bye.

It's a Process: More than Machine Learning with Hana Lee

Episode Transcript

It's a Process: More than Machine Learning with Hana Lee

Jerome Goodrich

The Potential and Pitfalls of AI-Assisted Coding

How AI-Powered EMRs are Curing Healthcare's Biggest Headaches

Using AI to Shift the Title Insurance Industry