Innovation and AI Strategies with Award Winning Data Science Leader Vidhi Chugh - ML 085
Award winning data evangelist, AI strategist, and innovation leader Vidhi Chugh joins the show today to share her perspective on various topics, including data quality, AI innovation strategies, responsible AI, model intelligence, and much more!
Hosted by:
Michael Berk
Special Guests:
Vidhi Chugh
Show Notes
Award winning data evangelist, AI strategist, and innovation leader Vidhi Chugh joins the show today to share her perspective on various topics, including data quality, AI innovation strategies, responsible AI, model intelligence, and much more!
In this episode…
- The importance of innovation
- Glorified failure projects
- Responsible AI
- Data driven companies and quality scores
- Tools for autogenerated business insights
- Model intelligence
- Unspoken assumptions
- Understanding the larger picture
Sponsors
Links
- Benefits Of Becoming A Data-First Enterprise - KDnuggets
- Top 3 Challenges for Data & Analytics Leaders - KDnuggets
- Democratizing Data in Large Enterprises
- LinkedIn: Vidhi Chugh
Transcript
Michael_Berk:
Hello, everyone. Welcome back to another episode of Adventures in Machine Learning. I'm your host, Michael Burke. Ben Wilson is still out. He'll be back at some point in the future. Today we are joined by Vidhi Chugh. She is an award-winning AI ethicist that focuses on the intersection between data science and product. And currently, she's a staff data scientist at an unnamed company. So Vidhi, do you mind elaborating a bit and telling the people why you're famous?
Vidhi_Chugh:
Yes, so thanks Michael for inviting me. I will give a little background on where I'm coming from, what my career trajectory was, and why we are specifically talking about the topic that we've chosen for today. So I am a data evangelist, an AI strategist, and an innovation leader. I generally try to give a short version, which is these three dimensions. And now I'll give a longer version of it, which says that these rules are actually tightly coupled with each other. So data is the new oil. Everyone is going crazy about how to use data. That's the asset that we need to monetize. And data is what is required to transform your business. That's where the data evangelism part of my introduction comes from. And when you have to have work on data, there is a particular point after which when you scale, you need some cutting edge algorithms. That's where AI solutions come into the picture and hence AI strategist. My part involves where exactly can I find AI use cases to enable the business to generate better insights and then take action on them. Now you might be wondering where the innovation part comes in and that's something not many give a focus on and that's where my prime responsibility comes into. Innovation is when you are trying to figure out a solution and you don't have a tailor-made solution, right? So that's where my job comes into the picture of finding what is the next best thing that we can do to fill that missing component. So that's a long version of explaining the data evangelism, AI strategist and the innovation hat that I wear at my work. So that's about it.
Michael_Berk:
amazing. Is innovation important?
Vidhi_Chugh:
It is indeed. So I can say that your business can experience nonlinear trajectory, the scale that it is working on, that nonlinearity can come when you have innovation that's kind of distinguishing you with your competitors. Otherwise, they'll have an edge if they are doing innovation at a much better scale than you are doing, then they'll take over from you.
Michael_Berk:
Got it. And it seems like there's lots of really good solutions already out there. But how often in your experience is innovation required versus using a tailor-made solution?
Vidhi_Chugh:
That's a great question Michael. So innovation comes into the picture. So general framework when somebody asked me how to use innovation My first answer is Go and try whatever is available off-the-shelf Commoditized solutions first is because you are always running out of time the experiments that we do are very iterative I really don't want somebody to you know start working on the algorithm from scratch But whenever something which is readily available is not something that you can consume or your data or your business problem is not meant for that solution. That's where you need to identify what are the gaps that the current algorithms provide. Those are not able to cater, but I need as part of the solution that I am delivering. Then what is the next best thing? And you can only do that first if and you're being in an AI background, right? You need to constantly keep yourself updated about what's next in the market, what's cutting is actually. So that's where it derives its name from. So if you're reading a lot of research papers and you know, for example, somebody asked me, if you just consider why dropout happens, what is the use of dropout in neural networks? So there might be certain need of it. Somebody would have devised this whole mechanism, how and where it should fit in. So just think where you have a current solution, trace it back. It's more like a reverse engineering way of finding those solutions, filling that missing component is important. Don't go all out thinking that I have an idea and I need to just implement it without. not properly devising it. I would say devising an experiment is really important here.
Michael_Berk:
Got it. So it sounds like start off with an SLA or an overall goal. And
Vidhi_Chugh:
Yes.
Michael_Berk:
then if your current solution does not meet the SLA, then you'll have to iterate and find, find new areas. I, I have a question on that though, which is how do you know when to kill a project? So if, if let's say the tailor-made solution is not working
Vidhi_Chugh:
Hmm
Michael_Berk:
and let's say there's a potential million dollars on the table for a better algorithm, how do you trade off? your team's time versus the potential uplift in revenue from that algorithm.
Vidhi_Chugh:
Excellent question. So this can happen. One like I typically mention is that you need to have the executive buying into whatever solution you are devising. They need to have a close watch on what you are proposing. You just can't keep buying the time and saying that I have something in my mind. I want to implement it. I think it can get us those million dollars. So just saying doesn't work, right? That's where the whole data component comes. Does your data support that? Have you already established the fact saying that the current solution is not cutting the mark and what is the next technology that you're proposing? So like you mentioned about trade-off, it comes at a point when you decide that so you need to give a framework, a certain timeline, initial timeline and that's where a proof of concept kind of thing comes into the picture. How much time can you borrow to initially give a Approximate solution it need not be the last leg of the solution You need to have an approximate solution in the pipeline and say that given the constraint of the time Maybe data availability the volume of it or the attributes that you actually need for or the scale Maybe the infrastructure is not yet there given all those assumptions and constraints This is the best I have achieved which is definitely already It has to have it has to be a better than what you know your current solutions are providing Unless you see that nudge, unless you see that Delta difference from your POC, Proof of Concept or the kind of MVP that you're trying to build, you can't just say that I think I have a hunch and I want to go about it. So that's the time if you think that your data is not supporting it, you are not able to establish the difference. That's the time I think you should call it off.
Michael_Berk:
Got
Vidhi_Chugh:
And another
Michael_Berk:
it.
Vidhi_Chugh:
thing that comes here is that never give a moonshot to... any of the stakeholders, either internal or external. So moonshot is something you promise something very big, big thinking that I have data, I have scale, and I can just plug both of them together with those fancy algorithms and something will definitely come out. So that's something that's never gonna happen because even if you are able to give some one or the other number saying that, okay, my model is doing maybe 90% better than, you know, what your current baseline is, right? It's not going to be sustainable. because once you put something out in production, the whole different journey starts from there. So you're responsible not for a current solution, just trying to prove it works. You need to make sure that it's sustainable, something which can run on an autonomous mode, if I can say, so that you can continue reaping the benefits of it. Otherwise, just doing something for one single purpose and not seeing a repeatable pattern around it, it's going to cost a lot of money to the organization.
Michael_Berk:
Yeah, that's a really interesting point that you bring up. Uh, so it sounds like the most common approach, and at least the most common approach I've seen is people try to achieve something or try to have something meet a goal and a very effective method that I try to use as much as possible is I try to kill a project. So if there are certain requirements that must be met, it's often very easy, or at least easier to test those requirements and say, Hey, we can't do this. Um,
Vidhi_Chugh:
Thank
Michael_Berk:
it's
Vidhi_Chugh:
you.
Michael_Berk:
the same as using like counterfactual logic and AB testing. where it's easy to disprove a null or reject the null versus accepting the truth. Like are all swans white? Well, I don't know. But if I find one black swan, then I know that all swans are not white. So that has been really effective in my experience as well.
Vidhi_Chugh:
Yeah and Michael on top of what you were mentioning right so killing the project is something which is seen in the negative light If you are not able to conclude on a project positively sometimes the associate or the analyst who is working on it is also Conscious thinking maybe it will impact my you know The stature in the organization that I couldn't do it and there could be a better person That's where the whole imposter syndrome comes into and that becomes a difficult decision to take. But I would say that one thing that organizations can cultivate as part of their culture is to not promote failures, to kind of not just promote successes. Success is not binary. You can't say because I've seen the output, it is success. When somebody is taking the initiative to calling the shots at the right time is also an implicit way of declaring a success and documenting is something which is crucial. Something didn't work, that's perfectly fine. Have you documented it or do you have a learning from it? And I used to run a session, which is glorifying the learning, glorifying the failure. So you are not just failing a project or calling it off, you're documenting and kind of spreading the word in the organization. Here I've worked on a project, it didn't work. And I have a reasons to prove that, right? So these are the reasons and the constraints under which we had to wrap up the project and call it quits. Maybe next time if somebody else hits a snag, they can reach out to you and that's the learning that should get promoted and not just the failure and success of it.
Michael_Berk:
Yeah, I'm really happy that you said learning. Often when you're working in uncharted territory, you're not just building and implementing, you're mapping out and scoping. You want to see, oh, is this solution possible? And if it's not possible, that's really valuable because then you don't go over there again and you can try other things that might be possible. So it's, it's all about learning and mapping and figuring out the layout of the land instead of just actually implementing.
Vidhi_Chugh:
Absolutely and you don't build this overnight this kind of intuition this kind of knowledge bank as I put it You develop it once you try practice fail learn and deliver multiple times You have to iterate as a data scientist as a data strategist There are multiple projects that will give you those learnings which will give you one final blockbuster. So That's important
Michael_Berk:
Yeah, I wish it could be done overnight. That would be nice, but I agree it cannot.
Vidhi_Chugh:
Yeah.
Michael_Berk:
So what is responsible AI to you?
Vidhi_Chugh:
Okay, that's great. So responsible AI is consisting of many fundamentals that a lot of people talk about. It can constitute fairness, explainability, transparency, accountability. There are many things, theoretically, they are lying everywhere and people know about it, right? So I won't repeat that and explain on them because they are self understanding. What I'll put in one word is responsible AI is... whatever technology you are using, AI is just a technology, whatever way of that's a means to an end, right? So whichever algorithm you are devising, are you able to keep the human or the end user in mind? So is it serving the humanity? Is it for somebody? Is it for the betterment of them? Right? And in order to do that, have you taken in consideration all strata of people? So all the end users who are going to consume it? Do you have a lens to think from the consumer perspective or not? So that's something I would say is the is some is the responsible part of it. You are responsible and accountable for the solution that goes out in the market and gets consumed by the users where there is no way of discriminating between a strata or the other. And all are kind of treated equally and same. So that's just one aspect of it. There are many definitions out there, but I'll say that if it is a technology, it has to serve for the betterment. It has to give something which is not happening right now, maybe make some users life comfortable and more convenient. So its purpose should be just restricted to that and no other kind of bias should be embedded into it.
Michael_Berk:
Do you mind diving a little bit deeper into your definition of fair or unbiased or responsible AI? Cause there's the do no harm rule. There's the net benefit rule. There's the maximize profit rule. So how do you think about it?
Vidhi_Chugh:
Yes, so this is how when you initiate the project, you need to define these things, right? So what is the end goal and objective of the solution that you're driving? So right at the onset is those set of questions like you correctly mentioned, right? You have to decide the business problem, who is the end user of it and what is the objective of it? That that the statements that you mentioned come under the objective part. And then I talk about bias. Bias is looks like just one technical term for the one person who is coding the algorithm. but bias creeps in at every point in your ML project life cycle, if I have to say. It could be right at the origin of a concept of a business problem that you're trying to solve. For example, I read somewhere, just take an example hypothetical where you are having a credit scoring mechanism for somebody who applies, right? And you're taking the data of only the folks that are present in your DB. and maybe a person who is below a certain income stream has not yet made its way into your database. So you are totally excluding them while you are modeling it. So next time those people who are coming from the lower income strata, because they were not seen in the database before on the training side of your algorithm, they will never get qualified and this is something which will always be left out unless and this is stemming from the formulation. So when I started with it, did I have a viewpoint that who can get adversely affected by the predictions that come out of it. So that's on the, you know, I can say the structural part of the business problem. The problem is structured like that. Like, can I make sure that I should have looked at that perspective? The way you collect the data, the way you label them, if it is a human labeler, certain bias always come there. And Is it about the just the algorithmic side of bias? And the way you interpret the output can also be biased, right? The way you're generating and making sense of insights because insights is what just set of numbers, right? Somebody has to make sense out of them and take action on them. So can that also be biased, right? So there are multiple checkpoints in this whole pipeline where things can go wrong and it can come. as a guideline that there are actually a lot of guidelines being available outside, right, but it generally comes from top down and Then slowly gravitates towards hybrid approach So it employs or I would say the developers who are working on this solution should be sensitized with these concepts first
Michael_Berk:
Yeah, 100%. And it's really, really difficult. So we're talking about bias. We're not talking about the bias variance trade-off. We're talking about when the data being put into a model is unfair and represents systemic issues in society. And so the classic solution or one of the classic solutions is to do nothing and just throw your hands up and say, Oh, life is unfair. Sucks to suck. Another option is to take out the potentially problematic variables. So if you have race, gender, age, something like that, you just remove them. But it turns out, and you were sort of hinting at this with the credit score, it turns out that a lot of data and other pieces of information actually encompass those variables or actually like sort of leak that variable. So it's really, really challenging from a policy standpoint to think about how you should be changing data. and how you should be actually creating the perfect vision of society. Like, should it be unfair to like, for example, with affirmative action, should it be unfair to certain people so that on the whole society is more fair? Or should it be perfectly fair starting now? So it's a really, really tough problem. And I was just wondering how you were thinking about it as well.
Vidhi_Chugh:
Yeah, definitely. In fact, I would say that ethics and responsible AI is still an abstract concept, right? So it's something that doesn't have a more qualified definition. Because if you need to pass it on as a concept, how would a developer code it, right? So it's just still a concept and it depends on the subjectivity of the person who is consuming that concept, right? And in essence, if everybody has been ethical while developing the solution, right, but then who would there would always be someone or the other who is standing on the opposite side questioning your predictions and make, you know, doing the pinpointing and all. But then how do we have a definition of who can validate the solution? Who is the worst person? Who is actually the so-called ethicist who can look at your solution and validate and confirm saying that this is, you know, the kind of certified ethical solution that can that is ready to be shipped to market. So that's the difficulty I would say is like going from a theoretical concept to trying to embed this as part of your machine learning life cycle. That's the critical part of it.
Michael_Berk:
Yeah. Yeah. Words are not the same as code, unfortunately.
Vidhi_Chugh:
Absolutely.
Michael_Berk:
Maybe one day, but not today.
Vidhi_Chugh:
Yeah.
Michael_Berk:
So, going back to organizations, I was wondering if you think data or an organization could be too data-driven. Is that such a thing?
Vidhi_Chugh:
Yes, so this is I can say is the happy part the North Star we can say that everybody wants to be data driven, you know, data literacy data transformation, digital transformation. These are the words we keep on hearing right and they look fancy as well. There are a lot of research and reports that say that what is the best thing that we can utilize or exploit today to get the advantage in our business, the competition that we are facing. Right. And everybody would say that you have to be data driven and all the actions that I'm taking are backed up by data. But the kind of, uh, I would say is the plan that you chart out versus the gap that you see while executing it. It's the whole journey has a lot of, uh, issues. The first barrier to being a data driven, I see is that the, the rate of adoption. So Think of pre technology error or something and somebody who has kind of making a shift to this Data you need to have somebody who can actually endorse those solutions that you're proposing you have a strategic decision and then you have a leadership team executives who sit with you and Everybody sign up for some goal or the other right? How do you first sign up for a particular goal? So that's also backed up by data. There are a lot of projects you can take Think of the macroeconomic slowdown that we all are going through. You're not going to have more resources. So within that constraint of limited resources, there are so many projects I could pick. Which one is going to give me maybe more operational efficiency savings that I can do or I can pick up altogether, discover a new project that will make me the first mover to the market. In fact, the rank order of these potential projects that you will pick is coming from data. You have to... do the estimates, the opportunity sizing, you have to estimate what is the size of the opportunity and turn it down into numbers and then give, you know, and then pick up the priority projects with the limited resources that you can pass it down to your business units, right? So this can go on and I can talk more about it, but being truly data driven, it is possible. Is it already there? And is it easy to do that? there are a lot of issues related to that.
Michael_Berk:
But have you ever seen organizations that blindly trust data and are too data-driven?
Vidhi_Chugh:
that are too data driven. Okay, I thought you were asking that are truly data driven. Okay, no, I have not seen to be very frank. So there is a first stage where you are actually trusting the data and then you have to pass that stage where you say that I'm blindly trusting the data, right? So what I have in my experience seen is... And this is related to just purely my experience where I've dealt with the leaders and whenever you present them with the results, there are going to be a lot of questions and all first is whether we can trust the data itself. Is it the whole analysis itself is trustworthy and the data it is coming from is trustworthy? How do I understand the quality of data? There are going to be a lot of questions, lot of iterations while you present your final results which get consumed. Again, this is purely coming from my experience, but. I have not seen somebody being too data-driven in the sense that just because it's looking aesthetically good, we'll just take whatever is coming out of it without questioning those nuances.
Michael_Berk:
That's good to hear. I have seen it a couple of times, but I will not name names.
Michael_Berk:
Sorry,
Vidhi_Chugh:
Thank you.
Michael_Berk:
I hit the back button. Like I was wiping something off my computer and it exited me out.
Vidhi_Chugh:
Okay.
Michael_Berk:
Cool, so I'll add a little note for the editor. If you can... This was like an organic transition. I'll just start from I have seen it. So I actually have seen it in a couple of instances. It is rare though, but sometimes the underlying data are not trustworthy. And I think that is something that we really need to be critical about. Um, and also extrapolating just like, for example, just the fact that there is a P value or correlation or a statistic that indicates that some trend exists versus does not, um, that. That, that sometimes needs to be taken with a grain of salt. And it's really important to understand what those P values actually mean, or what that correlation actually means. If the correlation, for example, is 0.8, but there are a bunch of outliers on either end of the distribution. Well, maybe that's not the best thing. But that's good to hear that you have not seen it. It's probably not very common, and people tend to fall into the not data-driven enough category.
Vidhi_Chugh:
Yeah, I
Michael_Berk:
So I
Vidhi_Chugh:
can.
Michael_Berk:
also... Sorry, go ahead.
Vidhi_Chugh:
I know I'm just saying yeah I can say that yeah I've been maybe blessed enough to not witness that because I've been grilled to the level where the person to whom I'm representing the results knows in and out of the data. In fact it has been a long learning point learning curve for me where there are some questions that when they are posed to me I realize that yeah I could have asked them first to myself and now because somebody has asked them I need to go surface dig back and come back with an RCA. So like all the cases of outliers and the data quality, whether it's trustworthy or not. So these were the learnings that I've picked from my initial years. And now I make sure that whenever I talk about data, I know a data quality score. That's the mechanism I've devised. There is a particular score I try to attest to, which says that I'm the consumer of this data. And I don't know how to trust this data unless you again prove the data for this. So that data is the metric of one score that can tie together this thing. which could be at an aggregated level. If that number doesn't make sense, I'll just go dig and see what are the corresponding pillars where I need to focus on, zoom down, and take you to the point where, if at all there are data issues, then I should be warned of it. If I am supposed to still take an action based on this data, I know the cost and repercussions of it. Should I just, there is a hard and soft constraint, right? Hard constraint, I'll definitely not take any action on this. The soft is, can I still start serving predictions? Can I still take some? action because I know that the cost of my action won't be too great. Right. So these are kind of thought processes. Everybody has their own way to deal with it. But definitely the leaders that I've worked with, they won't take my yes for a yes. And they'll ask all those questions being coming from those data-driven companies.
Michael_Berk:
That's good. So this could be like five podcasts in one. But what are some things that you look for in terms of data quality? Like you mentioned some pillars. Could you elaborate?
Vidhi_Chugh:
Absolutely. Yeah. So in terms of data quality, first it is, I'm talking about a data which is specific to a use case. For example, let's start with that because I'm a advocate of not taking it all at once. Take small data set or maybe a particular use case and for that use case, maybe the data might be lying at certain different tables and there are a lot of joins, you know, those computes that you need to perform to make sure the data is in a reasonable shape where I can start generating insects. Now, I look at the data that I have in the end and I'll just trace back and see what are the possible things that can go wrong with the data. And it could be in terms of joins, merges, and the data itself at the origin was not correct, which can totally happen when it is coming from, when it's an operational data. So somebody who is inputting the data manually can make whatnot kind of errors, right? So then I'll keep going back and I'll put a checkpoint saying that for this particular business problem, I as a data scientist need for example, 10 attributes, right? Then for those 10 attributes, I'll just see what is the attribute coverage, right? So over the period of time, has that attribute pattern changed or not? That's again, one of the metrics. And is this attribute for example, is having 90% of the nulls for last n number of periods that I've observed and published the report and all of a sudden I start seeing a huge increase in nulls, right? This data is important for my model. This is a critical attribute. I... And how would I label a critical attribute? I'll say it's a tier one, tier two and tier three. So tier one is critical for me. And without this, the algorithm will totally not be able to work as expected, right? So I deep dive into that. So that's when I'm saying keep going back into the data quality issues. And then it could be just not related to NERS. It could be related to the number of attributes itself. One day you are all of a sudden not having the 10th attribute itself because somebody made some change in the DB, right? So... Attribute coverage, null coverage is important, finding the outliers is important. It could be as simple a rule saying that it's falling out of three sigma does not make sense to me. Maybe I need to investigate it. Could be more sophisticated models you can put in place. So these are some of the ways that you can look into the data and see. So, let's look at the data. Hey, I was talking about a particular period and the date that is populating here is coming out of that range So maybe I'm looking at the data in a wrong manner or something happened in the joints where the date attribute is the critical problem So you don't have the exact matching happening, right? So finding out those issues specific to your data and designing a framework which says that Historically, it has been like this now it is this and I see that you know, there could be any way to aggregate it You can give importance to all those pillars as given equal weightage to them and tie them to say that, for example, point eight says that, you know, you can trust this data 80% maybe. And, you know, historically it has been in a certain range and this attribute value has stopped coming and it has fallen down. Do you want to take this, you still consume this data into your model or you want to stop and want somebody to take an action into investigating it? So these are some of the ways that I generally follow. And... the more closer you become go to the data. That's where I say that data centric science is focusing more on data as much on the data as possible before taking the algorithmic route.
Michael_Berk:
Yeah, and I'm really glad that you said focusing on the problem as well, because I'm really having subject matter expertise can shortcut things so dramatically. Like if you know that there should be a roughly linear relationship between X and Y, then you use a linear model and you don't even think about a deep learning model. If you know that these data columns were like produced during the pandemic, well, maybe that data is not super reliable and you need to take, take measures to counteract that. So I think you hit the nail on the head that it's very subject matter dependent. And point number two, it's really, I hate saying this, but it does, it does, it takes time to learn how to do this to do EDA and to develop the instincts on whether your data is like 70% good, 90% good. But I really like that you were trying to codify this into different levels of, hey, are the is the SQL joins? Are they reliable? Is the fundamental underlying data reliable? Is the use case reliable? So looking at each of those different categories, I think is really effective.
Vidhi_Chugh:
Absolutely, and you correctly mentioned domain expertise is the key here. So if you have a person who has an eye for these data, that's where you use their intuition about the data, but you make sure that that's kind of very well validated by the pipelines that you're building. So having a domain knowledge about the use case that you're working on helps a lot in finding out those possible bugs in your data.
Michael_Berk:
Yeah, 100%. I was also wondering, so we were chatting a little bit before the call about some topics. And one of them was sort of developing tools to generate automated business insights. I was wondering if you could elaborate a bit on what you mean by that and what those tools are actually doing under the hood.
Vidhi_Chugh:
Yeah, definitely. So generally when we talk about tools, people generally go and specifically talk about those tools and how to use them. I use tools as a word, which is a means to an end where I have an objective to solve. And it could, I would like to generalize it under the umbrella of business intelligence and the business intelligence follows a a thread which kind of relates to modern intelligence, right? So business intelligence is something you have a data lying there and that data in its raw form might not be usable and you need to do some transformations and kind of need to organize it well in a manner that it is well informed and you can take a decision on this, right? So that's the data part of it, which is where all those tools and technologies that you use to make sure your data is in a consumable shape. So that's how the tools part of it comes into the picture. What I talk about modern intelligence. That's where you want to make sure that all the processes that you're following is the person who is going to use it is able to run it on a hands-free mode. There is no one person on which you have a lot of reliant. So those are the advancement in tools that we are seeing when we go from business intelligence, which is a pure play reporting tools or something which would help you identify the what and why of business. I can say that something happened. Some event happened. Maybe last three months, my customer. conversion rate was not good in a particular region. I have certain information about those historical data. I just want to go and dig deep into that and generate a report about it. So I'll use all those tools to analyze into the diagnostic mode I can see. Now, if I am supposed to take an action on it, for example, today, I won't be able to take an action unless first is that I... the data freshness is not there. For example, I have some concerns, I pass this data back to somebody who did the analysis and the time they take to come back with the fresh results, right, fresh analysis based on which my next set of actions are due. So that's where this kind of bottleneck creeps into the picture in terms of tools and data, whenever we talk about anything related to data, it's supposedly thought of as more geeky thing. when somebody can just present you a set of facts and based on those facts, you are supposed to take the decisions. When I specifically talk about tools, I think of tools as a way, which is equally accessible by a non-technical user, somebody who is not so comfortable with data itself, but just a click away, that person can fetch and pull out those results, maybe. that is needed to take that particular action. You are not too reliant on somebody that can act as a bottleneck to the entire process because speed is something which is actually the differentiator. If I don't take that action today, versus I end up taking that action one week from now, the results that you're going to share with me are already stale. So I don't know how to take action on that. So that's where when you want to act on those insights and take action, you need to make sure that... In a way, I can say it's a data democratization. You have data accessibility and you have a know-how of how the tools work and how you can leverage that data in a timely manner.
Michael_Berk:
Yeah, taking a bit of a step back sort of earlier in the flow, what, how do you think about the automated side of the question? Like, we can definitely create a linear model or do whatever causal inference we want, but is there a way to sort of get insights while you sleep?
Vidhi_Chugh:
Okay, so getting insights is still possible, but whether you are acting on it, so that's where actionable insights and so that's automated insights versus autonomous system. That's different, right? So autonomous system is something worked on the pile of data, some insights were generated and you already have a system that consumes those insights and is able to and is allowed to trigger a certain action on top of it. That's the autonomous thing where there is no human in the loop. So I would say having that autonomous system should still be a thing which is a little far into the future. It is already there in multiple places maybe, but I would personally, I have not yet touched a system which is fully autonomous because once you go down the path from automated, so automated insights is fine. Once you go down the path of autonomous system, you need to decide the category and the impact of your actions, right? So for example, maybe you can say that the... the cost of the actions taken on the basis of automated insights is not too high, but it's kind of speeding the cycle with which I can push those predictions into the system. I'll definitely take that. That cost could be in terms of the impact on the end user, or the cost could be in terms of the actual monetary cost that you'll end up paying, or the cost could be in terms of the time delay. Time is also a factor. It's a factor of cost. So you need to... devise a way where I can still assert certain rules saying that if I am gravitating towards autonomous system that are consuming those insights, I need to make sure that only a certain section of those insights are consumed by the autonomous system which is not business critical for example which is which is not critical in those three pillars that I gave you basically. And once you have a faith coming from those systems, maybe gradually you can do it in a phase wise manner, then you similarly, like I mentioned, divided into certain tiers, you start from the lowest impacting tier to tier three. And then once you have enough system, then you maybe can do a switch over, take it to phase two and phase one. But that trust is the word that's critical here. How much are you able to trust whether it's running in an autonomous mode is possible.
Michael_Berk:
Yeah, and you touched on something that I wasn't even thinking about, which is actually consuming the insights. Consuming insights in an autonomous or automated way is just really scary, in my opinion. It's really amazing how such simple rules can lead to so much complexity. If you think about life, the start of life obviously had very complex rules, but now we have cities and computers and all these crazy things. and they came from very simple initial conditions. So the ability for simple rules to create complexity is just mind boggling. But what I was also thinking about is, is there a way that we can sort of, sort of like in a passive retraining of a model type of approach, can we create systems that will learn and sort of surface business intelligence insights or ideas? Have you ever seen those in action?
Vidhi_Chugh:
Yes, so actually I had worked on one of the projects where it was I would say still semi-automated or something and it should have Gone to next phase by the time I had left the firm but yes, so you can You know identify the hidden patterns from the data. There are insights you can dig into the data and generate those Rule based or maybe those rules which are evolving and you know that's where precisely when the machine learning concept comes into the picture, right, but Then one thing that you correctly mentioned, Michael, is that why is it scary if somebody takes down that path is that who is accountable for that system. So accountability becomes a big picture. Whoever is going to sign up and say that, okay, I give a go ahead and this is a go live project for me, then every single person who is touching that project is responsible for what goes out of that system. So that's the, I think, crucial part which everybody needs to be watchful of. Maybe everybody's working in... So how a typical project works is in an engineering environment is everybody is assigned a task and they are very, for example, if it is JIRA, right, everybody is just JIRA focused. So somebody has given me a task and I need to do that. I don't maybe know how it is going to impact the person who will consume the output of my task, right? So because everybody is focused towards their piece of work, they don't know how the overall picture looks like. So who is going to be accountable for such system is the one who is. driving this whole initiative, who is the enabler of it, and that definitely doesn't kind of give you enough confidence that I know I can nail down to the person saying that something that came out of the system didn't really go well, but the impact is already made by then. So it's a little circular thing, but it's a little scary to let the system run on its own.
Michael_Berk:
100%. So zooming out a little bit, I was wondering why throughout your career you focused on building frameworks instead of looking at specific tools or algorithms.
Vidhi_Chugh:
Okay, yeah, so specific tools and algorithms is what I've worked on, but at length, I'm not having enough liberty to talk about them because I've worked on a lot of patent-worthy materials. So it's kind of under a process where I don't have enough liberty to talk about them. Framework is something which I am an advocate of. So that's where I, wherever something comes, I talk about frameworks and best practices, guidelines, collaboration tools, documenting the work as much as possible. is something that I keep talking about in multiple forums and these kind of information are generally not proprietary so I'm at you know free will to talk about that but having said that it's not that I've not worked on the algorithmic side of it and had I not done that I would not even have a purview of how to create innovation that's the part that is coming from my introduction right so I had those set of gaps that I identified in the business problem that I was supposed to solve and there was no such solution that was available for it. So that's where I went ahead, did a lot of research, read a lot of material from those professors and understood certain concepts, which was again a learning curve on its own, before being finally able to produce a solution that was patent worthy. And generally research is what people not assume will go into production. So my algorithm was actually production ready. So I have a flair for both the things basically.
Michael_Berk:
Got it. What are some frameworks that you think are pretty rare, but are incredibly valuable?
Vidhi_Chugh:
Okay, so these frameworks are in terms of, so there are multiple ways people talk about framework. When I work on any particular framework, my intention is first is to make sure that there is no repeat task being executed if I see a certain action or certain pipeline or certain work that my team is doing, which is repeatable. And today, because they have to deliver something, they might think that they don't have time enough to put it in a manner which can be consumable readily, maybe for example, by somebody else. And that's where my piece comes into the picture where it looks like a little bit of a technical debt kind of a thing where it's not readily giving you results, but in the long run, it's going to be, you know, help you run an efficient system. If those kind of frameworks is something which I am generally talking about, are you talking about something else?
Michael_Berk:
No, that's what I'm talking about, yeah.
Vidhi_Chugh:
Yeah, okay, so I can talk about one thing where I am, for example, I'll not go into details of it. But for example, if I'm working on analyzing a particular entity, entity is a generalized abstract term I'm using for the work that I actually did. And what I see was that after three months of delivering the analysis on a particular entity, I have to do it again for entity two, for example, right? These two entities, because they're coming from the same ecosystem, they have certain shared characteristics. the underlying characteristics, you can say that this is the horizontal layer that they share, right? This horizontal layer will keep getting repeated if I'm now next supposed to work on entity 3, 4 and 5. And that's where my part of framework comes in that is there something which is specific to this entity, which is not something I'm going to see across, which would be a vertical, so which is attribute specific to entities. But in general, if I have to give a characteristic, can I say that 70% of the baseline of work If I make an automation around it or if I design a framework saying that, okay, a simple framework as such as, you know, leave aside doing something with respect to tools or technology, something which in a simple Excel, I can lay down the point saying that when I picked the analysis of entity one, these are the checklist of the set of steps I followed. And this is how I concluded. Now, if somebody else is supposed to do that, they will make sure that at least whatever is specific to their entity, they might do those steps extra. but they know that something that somebody originally did, I am not missing out on any of those steps. So I have a more comprehensive view and this list will keep getting maybe more advanced. If you have as many entities as you're supposed to do analysis for, then you'll say that, okay, because somebody else has also worked on a case, which was not seen previously before and this additional insights or maybe this additional understanding of the system that came from those analysis should go back into rest of the work that we have done till now. So having a set of checklists in a pure Excel, which is shared across the team is also something that I can say will give you those gains, which you initially won't realize. You'll think that these are something which is a tribal knowledge. I have it in my mind. It's in my knowledge dump. It's a brain dump, which I'm not putting out somewhere because I simply think this is very trivial. But once you put it out there and you let the other person kind of read the gains and reward for it, you realize how much time cumulatively it can save for you and the organization.
Michael_Berk:
Yeah, checklists are incredibly powerful. They align teams. They ensure that we're all working towards the same thing. And they also, as you just mentioned, they get things that people think are obvious out into the air and into discussions. Oftentimes, it's the things that people assume that cause the issues and not the edge case. Well, I mean, edge cases are a problem too. But it's often the unspoken things in my experience because good engineers will find the edge cases. but the unspoken assumptions are really dangerous.
Vidhi_Chugh:
Absolutely, I can totally agree with that. Yeah. So there are some things because I've developed the system from scratch up. I think this is too obvious for me to put it down because the person who would be taking it forward from me as well should be equipped with this knowledge enough. And that kind of vague assumption is making your understanding getting lost in translation and, you know, passing it on to the entire team. And these, these kinds of documentation can also be a good point for deep technical discussions as well. I thought of something, I worked on something and I designed a system. And I didn't put it anywhere because the consumer didn't ask me those questions, like you correctly mentioned in the beginning of our discussion, right? If nobody questions you and your insights are just getting passed into the production environment, right? So you didn't really dig deep into the data. So this could totally happen by chance, right? This is a random event that can happen. Then if the next person picks up and he's doing it maybe more comprehensively, but didn't have a chance to know how you did it. So that. A simple thing that the way you are taking a sample of the population versus the other person who is taking a sample. Maybe you took a random one and other person thought maybe the stratified sample makes more sense here. Unless that person knows how you originally started with the data or the kind of data that you worked on, there won't be that discussion. The channel of discussion won't happen. There is not a case who is right versus the other, right? So there is not an objectivity that we're talking about the developer. It's about those assumptions with which you put a system out there. And now you're letting the entire team know that, you know, this is what happened. Somebody can later go back and say that I think that really doesn't, you know, that assumption that you took about the population doesn't stand true. Do you think that we need revision or what was the extra knowledge you had at that point of time that you basis on which you took this decision? So having those technical discussions also is a good starting point coming out from these frameworks when you lay down.
Michael_Berk:
Yeah, and one more point on top of that. When I was first starting in my career, I thought it would be disrespectful almost to over document and explain my thought process. Cause I was like, oh, I'm a newbie. Everybody else should know this. Everybody else is so much smarter than me. And it turns out that regardless of intelligence and knowledge, over documenting and over explaining is actually really good because people are just different. how you approach problems are different, how you think about the system is different. So if you just give a sentence or two on where you were coming from and where things you were intending to take them, it can save hours and hours of time. I've gone down so many rabbit holes because someone just didn't tell me what they were doing. It would have saved me years of my life. Not actually, but hours of my life.
Vidhi_Chugh:
Yeah, true.
Michael_Berk:
Um, cool. So I also had another high level question. What are some buzzwords that actually matter? Because it seems like you live in the world of buzzwords, but you actually are defining them and putting a face to the name and figuring out what actually is important in those concepts. So the classic example is AI. Well, it can mean Terminator. It can mean linear regression. It can mean machine learning. It can mean 50 other things. So could you pick one or two buzzwords that actually matter and are actually relevant to our future.
Vidhi_Chugh:
Okay, so I think, so one buzzword that I'll say here is something which is evergreen and which is looking backward, looking forward in future is always going to be the core and crux of it, which is being data first, which is again, because I'm saying data first, this is every decision that you take, leave aside straight away going to AI or algorithm, right? Every decision you take. Do you have enough understanding of the problem statement? Do you ask questions which are related to data? Do you put data on the table first, saying that what is the success criteria of it? How do I measure it? If there is, so what is it that I'm looking for? What kind of data requirement is there, right? So asking all questions related to data, being data first is important. Being data driven is, I would say, coming secondary, asking the right questions related to the business problem in terms of availability of data. How would I want to procure the data if it is not available? So these are. the fundamental thing that everybody should be talking about practicing and then it should it should go beyond a point where we are just preaching it coming from the leadership. It should be something like, you know, walking the talk whatever you are mentioning you need to demonstrate it and I have a I think coming to think of it. I have a good example also right. So I was just crawling over my LinkedIn today and there were a lot of people talking about saying that. I think work from home is more efficient for me and then you know going to office this back to office programs that the organizations are laying down it's too silly a thing and then there's a comment saying that but how do you decide that is there a data that prove that can prove that work from home is more efficient than you know just going to office have you recorded the number of hours you used to travel the kind of breaks you used to take and then what says here the collaboration that it needs so all of it essentially is that you need to be data first. So that's the first buzzword. And related to AI, I would say that not focusing too much on the algorithm. Again, algorithm, tools, techniques are a piece to an end. I'll iterate it as many times as possible because that's something that we generally get caught up in. These are the fancy things and These will continue to change. That's where my focus is not on technology or algorithm. These will continue to evolve. Today you are talking about maybe in NLP, you're talking about BERT. Previously you used to talk about LSTM and then tomorrow there might be something else also, right? So algorithm is again, not something that is at the core of my focus. Again, the next buzzword that I have, and I think for this decade or so, I've heard some of the predictions, some of the veterans also talking about this, the responsible use of AI. So that's... I really wish that it goes beyond the buzzword and it gets to see the light of the day where people are having those kind of ethical lens with which they are designing the solution. So first is the data you're working on. That's the first buzzword. And the second is whatever you're producing or you know, whatever is going out of your system, right? What kind of impact it's going to make. So these are the ingestion point and the I would say the output point rate. So if these are kind of taken care well. Anything that you can pipe in any other algorithm, you can bring in any other thing, and you can think that, okay, I have big data enough, I don't need to look into the data, whether the data has enough signal or not. I don't really bother about signal versus noise ratio. I have enough power of data and compute and algorithm. I think I have what it takes to be, AI-powered company or so. So I really wish that we go beyond that stage, and first we need to just make sure what we are putting into the system and whatever is being put out of the system. These are the two things that are important for me. And I think, yeah, so these are catching up a lot in these days and I believe it's going to continue for some more time.
Michael_Berk:
Got it, cool. I had one more question on a more sort of personal level. So you've been working in industry for over 10 years now, and it seems like you've done lots of speaking engagements and have gotten your name out there. So I was wondering how you thought about creating your sort of personal brand. Is that a concept to you or did it just happen organically? Could you speak about that a little bit?
Vidhi_Chugh:
Yes, actually that's a very interesting question and a journey I would like to share. Maybe till 2019, so three years ago, I was more of an introvert person who used to go to office, just do my work and get back home and I really don't bother about what is happening and what is this influencer thing and what does brand even mean. Brand is something that I never concerned, leave aside for myself, but for the organizations that I used to join, it was all about what kind of work is going to be there. But in 2019, a veteran, an AIML veteran in the organization that I was working for, he chose me to lead this glorifying failure project. So this is more like a brown bag session. I told about you in the beginning of the project. So whenever, so that's where the crux of the problem is. And that's where the leadership and the culture that they want to put it into the organization comes into the picture. this very senior person, he saw that we are building out in the field of AI, there is something that we are trying which nobody has tried before and every time we are attempting it, we are maybe missing out on something or the other that the whole solution is not coming out in the way we expect it to be and on top of it there are geographical distances right so somebody in the India team is experimenting something versus somebody in the other part of the company is doing something and all the data scientists are not maybe talking to each other there is no forum where we come out discussing what we are working on, maybe pick brain from the other data scientists, get into the collaboration mode, not take it personally that hey, why did you try that, right? Just go beyond their imposter part. And he thought that if we create a channel where these kinds of discussions flourish, maybe first is those repeat failures won't happen. Somebody who has tried on something, the other person should not be repeating that that's the kind of silos we are living in even today. So he wanted to do away with that. He wanted to make sure that maybe at the origin of the data or origin of the problem statement, we don't have a control. But what we have a control is if I make sure that all that people are talking about, they're sharing their knowledge bank with each other because we're working on a very specific domain problem. So for that domain, we in-house data scientists were the only forum we could reach out to each other for. In fact, it went to the extent where I said that we'll maintain an internal repository where we know all the data scientists, this is MNC, this multinational company I'm talking about, have all the data scientists there, what kind of background they are coming from, what are their maybe key strengths or, you know, the, in fact, weaknesses also. Some maybe feel, for example, computer vision, if they have not worked on and some of the projects that they're currently working on. If they're a long-timer into the company, maybe they can tell about the success and failures of the previous projects. Now this acts as a stepping stone for others who are joining now. First, they'll pick up fast. You learn by examples. So you can learn by somebody else's mistake rather than going and doing that repeat all over, repeating that mistake all over again. So that senior person, he wanted a person who could orchestrate those sessions. And for some reason he chose me because I was very inquisitive about asking questions because I was struggling with the domain knowledge first and he thought maybe you are the right person who can... develop that channel where you have the liberty to go and ask any other data scientist in the company that I have this problem, I want somebody to put brains here with me and I want to devise a solution that is actually something that we can put in production. I don't just want to do some fancy research there. So when he gave me the responsibility, I was shying away from it. But when I took it out, I realized that there was so much to learn in those sessions and forums which kind of took me past my introvert nature. And I realized that this is something I started enjoying. First, it's improving the way I'm interacting and networking within the organization. Still now also I've not gone beyond my organization. I didn't go out in public. LinkedIn was something very, the last thing for me to attempt and try at, going in conferences was something I had never thought of. But when I started realizing that it is accelerating the way I am growing in my career, the kind of... knowledge that I'm taking in from these people. It's good that I'm doing that, but I think now it comes back to me saying that I want to pay it forward. So that's where I realized if somebody will approach me, I'll definitely go join those sessions, whatever I've learned, if it is helping somebody, a single step in maybe moving ahead in their project that they're working on, they're stuck, if somebody reaches out to me saying that this is a problem statement I'm working on, I'll be happy to help. So that's where it started, that I became a little more open. I started writing down what I had learned, which is where the blog part of it comes, right? I started writing on two words, data science and multiple other forums. So when that happened, I think then it became more organic. Then people started considering me as a community person who is more involved into the discussions. A part of it is also that you become more... free of expressing your opinions because it's coming from experience you have learned from others so you go beyond that imposter that everybody inside us carries, right? So I realized, the more I started speaking, I realized everybody is so receptive of what you have to offer that it's a two-way street then afterwards and there is no going back.
Michael_Berk:
Got it. So it sounded like it was mainly organic. You were just looking to progress in your career and you were given this role and just by chance and doing a good job in that role, you started to get more of a following essentially.
Vidhi_Chugh:
Yeah, and that turned out to be the most, I can say that was the pivotal point of my career, you can say.
Michael_Berk:
Yeah, it's unfortunate, but selling is part of the career. You have to, you do kind of have to look the part for people to trust you.
Vidhi_Chugh:
Yes, yes, definitely. And I think a part of it also comes back comes to the point that you need to voice yourself out. If you're in given a project, nobody should assume that you have all the skills given in you that you can do it right away, which is like, uh, we're trying some short term gains or something. When I work on a certain project, I make sure that like those edge cases and everything is covered or not, or would it actually yield business impact, whether there is a potential for value realization or not that kind of. the nature, the inquisitive nature that I had made me look out for those answers. I made me question the people who were giving me the work, right? So you need to voice yourself out rather than just take a download session from who is giving you the work. And the more you question, the more you understand the business problem. I think that's the little difference I would say in the data science world is everybody is quickly thinking of data, putting it into the algorithm and giving something and developing a baseline, which is, I would say, reasonably good. But I think trying to understand the business context is way, way more important. How do you understand the business problem defines how do you map it to a statistical problem? Right. So you need to know your stakeholders, you need to know what you are putting into the system. So that's what I was looking for. And maybe me asking a lot many questions is something that put me on the spot. So voicing yourself out is really important. Do not take anything for granted that you know it all. and you are the smartest person and then you know you can single-handedly deliver a project or something it takes a whole lot of people around you who are very smart. Teamwork is really important and asking the business question and where your use case will eventually be put is very important to ask.
Michael_Berk:
Yeah, 100%. Cool. So where can people find you if they want to reach out or follow some of your content?
Vidhi_Chugh:
Okay, so I am mostly active on LinkedIn. That's still my go-to place. People might think of me from some other era where I'm not available at any other places. But I've recently created a Twitter account which I'm not at all using because that doesn't come naturally to me. So if you have any technical discussions or something, you can drop me a note or give me a text on LinkedIn. That's the only place I'm available right now.
Michael_Berk:
Got it, amazing. Well, we're at time, but this has been really fun. Do you have any closing thoughts, any final pieces of wisdom for the audience?
Vidhi_Chugh:
No, Michael, I think it was pleasure speaking today. And the final piece I would say is to understand the larger picture. That's the key. And while you are attempting to understand the larger picture, the project that has been given to you is coming from business where the end user is going to be the theme. The customer is the king. You need to understand the end use case much better. You could be working for an internal stakeholder or an external. but till you kind of stitch back your data to the end point, right? Then, you know, fitting this journey will become very difficult unless you know both the end points of your project. So just pay more attention to that. And other things, learning algorithms, we'll keep having a lot more coming in and the other ones kind of fading away. Don't just go by the buzzwords. And AI, I think last one thing that people ask me a lot, I can say that you don't need to be an AI. trained person to deliver a project. I don't come originally from this background. I was not in fact a computer science engineer to start with, right? So if you have an understanding of the problem statement, the domain that you're working for and the data that you're dealing with, all the other tools are available. The education has been democratized. You have a lot of courses and for learning the software languages also, you have a lot of places where you can visit that. So don't go by the buzzword, know what it takes to be a data scientist. It is a very iterative job and sometimes very pressing as well. But once you start seeing the returns of what you actually did, right? So I think it's all worthwhile. So having a careful evaluation of your profile and whether you are a match for it or not is very important.
Michael_Berk:
Well this has been a lot of fun. I just want to thank you again for joining. It has been Michael Burke and Vidhi Chowk and until next time, thank you for tuning in.
Vidhi_Chugh:
Thanks everyone.
Innovation and AI Strategies with Award Winning Data Science Leader Vidhi Chugh - ML 085
0:00
Playback Speed: