Transcript | Thinking big with small data | Dr. Jennifer Prendki (Founder @ Alectio)
The transcript from my podcast with Dr. Jennifer Prendki
Dr. Jennifer Prendki 0:00
things that are a little bit more of the human sciences or whatnot, right? It should be in all of the skills on the one umbrella, then you really have an opportunity to use AI use machine learning in a way that benefits everybody. So get our Sophie forgetting is something that happens when you dynamically train the model like online learning style. And basically like, at some point, your model starts understanding really well concepts or let's say your class or whatnot, right? And suddenly, you add this little bit of additional data, and then the model forgets for the benefit of something else, right. According to open AI and Sam Altman himself, basically like we're going to run out of the law by like 2025 2026 write the information that lives in model that you can use to generate typically, is the information which you grabbed from the training data.
Thanks for reading Future of Product! Subscribe for free to receive new posts and support my work.
Max Matson 0:51
Hey there, everyone. Welcome back to future product. My guest today needs no introduction, but I'm going to introduce her anyway. You might already know her from LinkedIn, one of her many speaking appearances. She's the godmother of Data Prep ops. That's right. My guest is Dr. Jennifer PennKey, founder and CEO at electio, inventor, researcher, AI wartime strategist and a lot more. Jennifer, would you mind telling us a little bit more about who you are and what you
Dr. Jennifer Prendki 1:13
do? Yeah, absolutely. So I think you sort of like explain at a high level already, like what I think people like should know, right? I mean, so. So basically, like, maybe going like, a little bit further into my past, or what a lot of people don't know about me is that, so I'm a scientist, but I'm a particle physicist by training. So you know, like, one of the topics I think, can I hope we're going to cover today is the impact of science in in AI and machine learning. And yeah, having so basically, like, I eventually became like a, you know, like, an AI, person, machine learning scientist with a very heavily focused on the scientific aspect of everything that we do. I am currently like, the CEO and founder of my own company, which I started, like, more or less four years ago, which is specifically focused on, as you said, like Data Prep ops, even though when I started the company, like, the term did not exist yet, because I came up with that a little bit later, right. I mean, so basically, like see that as being like the, you know, like the the, the, the full automation of the preparation of an AI training data set for any application. So pretty deep topic.
Max Matson 2:24
Got it. Very cool. So yeah, we have a lot to cover. But I want to start kind of with that first point that you made. So your original background was in particle physics, right? Would you mind talking about how you, what made you make the jump to, you know, data and AI, as a solution?
Dr. Jennifer Prendki 2:41
It's kind of it's sort of a complicated story, but I think it's an interesting story. So I'm, I'm basically like, so being like, you know, like a physicist, a particle physicist, or an astrophysicist, basically. So I have our training in both, right? I mean, so was my lifelong dream, basically, like, as far as I can remember how I was drawing pictures of myself with a, with a telescope, or collaborator with these sorts of things, right. And so basically, so I naturally got to the point where I got my PhD in particle physics, right. And my goal was always, you know, like, basically like understanding the universe understanding like, the various phenomena around us, right? I mean, like, pretty much any, any scientist like who really, really wants to do things for the right reasons, right? You want to put it that way? Right. I mean, so I always had a very curious nature, I always like to question everything that was happening around me, right. And so I eventually graduated with my PhD in 2009. So basically, lectures in the heart of the Great Recession, right? I mean, and so basically, like, at that time, as you can imagine, like, getting any sort of funding for fundamental research was really, really hard, right? I mean, so. And so of course, that was that was a huge heartbreak for me, like, in fact, I specialized in a concept called CP violation, which is the study of the asymmetry between matter and antimatter. So fundamentally, it's understanding like, why the universe actually exists and did not get any related at the at the time of the Big Bang, right. And, yeah, I mean, so basically, like I was looking for very specific type of like, particle accelerator with a very specific, you know, like, you're like, if we're not talking any funding, like, back in the days, in fact, like, the search for the Higgs boson was sort of like the, the topic of the day and so that was not the one thing that was really like, the most interesting to me, right. And so, so I eventually did a postdoc with your, like, a focus on this, this area Alpha CP violation. So eventually, like I reached the point where, you know, like, it doesn't look like the type of research I want to do is really going to defend it anytime soon. Right. So I, I eventually, like, realized that you know, what, what I really care about is like using math and understanding the world, right? And so and so I know aspect of that is of course, like understanding human intelligence right there is basically like, so, you know, like, I started realizing like, I have this amazing like, data analytic skill, which I use an overused as the medical particle physicist, people don't necessarily realize that the life of a particle physicist is generating or collecting a lot of data and then physically finding, making sense of it, right. I mean, so basically, I was very, very, very similar, like, in many ways to what you do when you're a data scientist, right? I mean, so basically, I made what is actually sort of a natural move for a lot of people with my background, especially in the days to go into machine learning, right? Think back in the days, like, it was like, that was like AI, machine learning was not a natural topic that was even available for study, even at large universities, right. I mean, so basically, like, in fact, if you go back with people who have like, anywhere, like more than, like, 1015 years of experience in the domain, right, I mean, so you have very, very few people who actually have a background or, you know, like, in actual machine learning, or even, like, probably computer science, but like, so it's not as our typical as it seems. Right. Um, and so I think, like, for somebody who has a deep passion for for science, and for math, it actually makes a lot of sense to do what I, what I did.
Max Matson 6:22
Yeah, no, it makes sense, right? I mean, in the sciences, you're like you said, your job is to collect data and then find, you know, learnings from that data, right? That's very similar, like you pointed out to a data scientist, I mean, it's basically the role. So concretely, kind of what was the the transition that you made when you wanted to get into data analysis? Was that changing your research track did you go into so
Dr. Jennifer Prendki 6:46
basically, what kind of basically will happen is like, so part like, you know, like, again, paradoxically, like when I started, I started realizing like, this is a really mighty path for me to explore, right? A bit. So basically, back in the days, like, the finance truck was sort of like a place where they would hire like a quantitative analyst, people who can do math, he can people who can do like data science research and whatnot, right? At attack was also like, pretty hot back in the days, right? And so, when I started looking at, like, what the interviewer questions might be to get a job in this space, like, I realized, like, Oh, I've done that before, right? I mean, are basically like, the only difference is like, this was not the same type of data, or it was not the same goal eventually, right. But basically, I realized that in terms of genomic data, data preparation, and basically building models, like, you know, like, establishing like, basically mathematical formula, whatnot, it was very, very similar. However, the problem is, the interviewers did not necessarily hit or they can get the point that, you know, like, Oh, that girl actually, like used, you know, like, I actually, like use the protein neural networks in the context of, like particle physics research, which was like a very, like a special back in the days, right, it was not something that everybody was using, right. And I was one of the first people to actually see the potential of doing that, right. So I started, like, more like reformulating basically my value proposition, my skills in a way that would appeal to the industry and specific industries as opposed to like, just physics, right. And so I think, when I look at people who want to do this sort of our DNA cooperate, the same move like today, right? I mean, so I don't necessarily think you need that much in America, retrieving or whatnot. It's just like a reformulation of your skills, both on the paper, write them in. So basically, like rewriting your resume in a way that you know that you don't necessarily, like, put too much emphasis on like how what you use, like, your skills for a busy clinic, and the fact that you do have those skills, right. At the same time, there are some nuances that you want to be where you have when you do this with me. So basically, it was wasn't actually like a very huge honor for me to eventually articulate the fact that you that I do, indeed have the skills that are required for data scientists.
Max Matson 9:08
Makes sense. Makes sense. Yeah, it's a, I think that that's becoming a more common thing, right is people who go into school, they specialize, they learn things, and then they have to apply it to a different kind of sector when they get out into the working world. So like, for me, it was economics, right? And then came out into the world applied that to digital marketing. And I think that that kind of gets lost as a track. People expect that, you know, you train for one thing, you learn one thing, and then you do it right. And
Dr. Jennifer Prendki 9:33
so two points here. So first and foremost. So you like both for a masters or a PhD? Like I think what you really learn is, again, the ability of like, questioning things, establishing a research plan or whatnot, right? And so basically, like, I see absolutely no difficulty for somebody who has trained to do research, in economics, in finance, in biochemistry, What Not To basically like a transfer those skills? In fact, I wouldn't even say it's a transfer, right? I mean, because like, you're really, you're really like learning like, you know, like, scientific questioning scientific methods, right? I mean, basically like a design of experiments, which I think that is extremely valuable skill in knowledge visitor in data science or whatnot, right? I mean, so basically, I don't even think it's, you know, like such, I wouldn't even qualify that as being a job. It's like, again, a reformulation. Right. The other thing that I also think is very important is like, when you look at the world today, right? I mean, basically, like, when you look at AI research labs, right. And many companies basically, like, are taking this direction of multi disciplinary ism, right? I mean, so basically, like, what they're trying to do is also hire experts from different areas, basically, with the belief that, you know, like, somebody with an expertise in like, neurology, sociology, so like, you know, like ethics, right? I mean, even things that are a little bit more of the human sciences or whatnot, right? It should be in all of the skills under one umbrella, then you really have an opportunity to use AI use machine learning in a way that benefits everybody, right? I mean, so I think as much as you say, like this, if it's becoming more common for basically people to make that sort of like transition and jump or whatnot, I also think that there is a lot more, you know, like, appetite from many companies, especially the advanced research loves to basically like hire people with different backgrounds, unexpected backgrounds, right?
Max Matson 11:36
I see. Gotcha, gotcha. No, that that makes a ton of sense, right? And it kind of also plays to the strength of the models, right? I mean, they're very multiple, disciplinary by nature. So it makes a lot of sense. So one thing I want to talk about, you are the pioneer of Data Prep ops, right? So was this something that you built up kind of, from your experience, you know, in physics going forward and going into this new kind of realm, seeing that this was a problem that needed solving?
Dr. Jennifer Prendki 12:04
I love that question. So this is like, again, like I figure, this is a story. I don't tell like frequently enough, because people like usually proceed. Like this is something that came up relatively recently, right. And in fact, I've been speaking about that problem in the industry for like, what, like, six, seven years now. So it's like an idea. Like, it's become like, top of mind for people like very, very recently, right? I mean, specifically with generative AI, I'm gonna say, right, but actually, you're absolutely right. That goes back to by physics days, right? I mean, so basically, like, I think I have a very interesting story to tell about this, right? I mean, so. So when I did my PhD, right, I mean, so basically, being an experimental particle physicist, what happens is that you're part of what you call like, a collaboration, right? I mean, so basically, collaboration is a group of like, between 200 and you like 2000, basically, like a particle physicist, t shirt, like all over the world, right? And they all work with that the same particle accelerator and particle detector, right? I mean, so basically, like, they're very, they're very few of these in the world, right? I mean, so the one that I was associated with was an experiment called the Babar experiment, based in Slack. So basically, in the Stanford Linear Accelerator Center in California, so in Stanford, right. And so basically, like, so the agreement with specifically PhD students is like, you can do your PhD using that. Experiments data, but you have to collaborate, right? So basically, you have to participate in data collection, like tuning of the data collection process, right? I've been cleaning up the data, etc, etc, right? And so, I actually, so basically, what they meet you do is like, you have to be stationed over there. So basically, like, at the site of the experiment, for a couple of months, right. And so my responsibility during that time, which was in 2008, was basically like a part of the detector, which was like a, you know, like a specialized on a car. Like, basically like that DK I mentioned earlier, right? So measuring speed of particles and whatnot. Right? So not super important for the conversation, right. But what was important is that I got there January 2011 2008. And that's where the recession started, right? I mean, so basically, what happened is that the deal, the deal, we originally had planned to run that experiment for an additional like, like three years or whatnot, right? And then basically, they said, like, you guys, you have a couple of months to finish up. Right? And so I was going there with the intent of like, you know, like, going slowly, gradually collect data without guess what happens. So all of the senior researchers of that experiment basically decided, like, we're gonna increase what you call the luminosity of the accelerator. So to live on that luminosity is basically the number of collisions you cause right a bit and so what you do is like you increase power your critical data, and so that becomes very messy, right? I mean, so basically, Saturday, we had six months to finish up everything right? And So I started working on the big taters of the system, which was a disaster because of course, like this system was not prepared for that magnitude of like, the, like amplification of the signal or whatnot, right? I mean, and so, and then I was working on what you call the reconstruction of the events, right? I mean, so basically what happened is like, you take the raw data, which in this case is electrical signals in the particle detector, right? And basically, what you do is like, basically, you try it, you try to reverse engineer what might have happened in terms of like particle decays, right. And so what I did is like, for the run, basically, which I was responsible for, if I were like, we have so many collisions, but when I reconstruct the useful events, for research, there's much less interesting data, right? I mean, so basically, it's like, the fact that you, you know, basically, so I had my first experience with like, volume of data doesn't matter if you have junk information in there, right? I mean, so I really started developing. And so, you know, like, I had lots of stories of like me to actually arguing we can do that this is not going to work, we're not getting as much useful data, we need to change the reconstruction algorithm or whatnot. Right? But so and so that was sort of like making me like an angry back in the days, because this is the data I'm gonna have to work on for my PhD, right? I mean, you're making things harder for me, right? I mean, or whatnot, right? I mean, so anyways, right? So, but then I realized, like, look, volume means nothing. If if this is not the right data, the right type of data, if quality is not high enough, or whatnot, right? Fast forward. So as you though I jumped into the industry, right? And so then I see the same problem here, right? I mean, everybody's like, volume, volume, volume, let's just, like, throw more hardware at the problem, right? And so I was like, you guys, basically are not gonna make the same mistake again, like, nothing like what right there. So So basically, so I was I was like, Okay, fine, right. I mean, for the time being, I understand how this is an easier solution to just, like, throw more compute at a problem. But you have to understand that any data set is made of like, useful data, useless data, data that can actually destroy your bottle, because that did happen for me in physics, right? I mean, basically, that it was like almost like a attractable problems, like for reestablishing the origin of particles or whatnot.
Right? And so eventually, right, I mean, so I kind of like it was on the back of my mind all the time. Eventually, I got my first like, very, you know, like, massive team to manage. When I was at Walmart Labs, I inherited visited several, several initiatives, which was not necessarily what I had on my mind when I when I joined the company, right? After like, basically, like, acquisition, and basically like, a reorg, internally and whatnot, right. And so part of my new team, and part of my responsibilities was basically the liquor data labeling, or management of data labeling for the newly or the nascent, deep learning initiatives within the department, right. And so now I realize like, wait a second, so now you have that problem as well, where you want to operate with volume, but you've got all of the data that suggests rotated, and you're not increasing the budget properly, right? I mean, so basically, after that, that that merge that happens, suddenly, you had like a 10x 20x, like in terms of data, but the budget that the budget increased, I was actually able to negotiate for data labeling, right. I mean, basically was like insignificant in comparison, right? I mean, so basically is like, so now you have this problem that basically, we actually started not having enough computers? Well, because Walmart, the Walmart is not going to be used AWS for, you know, the cloud for fortunately, or what not, right? So we had our own servers was never enough for the entire team, right? I mean, so basically, like outside, like, Okay, we need to rationalize our understanding of how volume is a necessity, or is the way to get a proper machinery model, right. I mean, so basically, so I started, like, looking into this concept of active learning, which is basically like the key, one of the key proposition of Data Probes, right?
I mean, essentially, which is like, dynamically, curating, preparing or curating, or throwing away stuff that doesn't matter and ordering the data in a way that accelerates the learning process, right, and improving the, the quality of the learning curve, right. And so, you know, at that point, it's like, full on this need, like, we need to change things I understand. I'm probably the first person to experience that pain point in in the industry, because I'm running like the sort of initiatives like data, data labeling data operations for like basically Walmart, right. And so you're like a human again? Yeah, I started having like, a much clearer idea that you know, like, clearly there's gonna have to be a shift here, right. So back then I didn't have again that term Data Prep up so I basically started like really falling in love deeply with the problem of thinking about the data differently right. After that, like I had a couple more opportunities. In fact, I I left Walmart eventually to start a my new role as you know, like the the chief data scientist and the head of data science basically like within within Atlassian, right? I mean, so Atlassian wanted to get into machine learning. So, again, similar problems data to be collected, where do we put this data? Not only to is created equal, right? I mean, how are you thinking about those problems? Right? I mean, etcetera, etcetera, right?
I mean, at Atlassian, the problem was not necessarily as much on you like data labeling, as it might have been, like data exposure from like, you know, like, different customers or whatnot, right? Eventually, like, I got an offer to join figurines, configurators, usually better known as CrowdFlower. So, the top data labeling company, before scale, AI became scale AI, right. And so back in the days, they were like, the only true alternative, you had to make ethical Turk, right? Before data labeling. So I was like, Okay, Hi, Joe. And Bert, you have to believe in my idea that, you know, like, basically, in order to scale data labeling, the goal is not so much to scale the labeling process, it's more like to basically like rationalize the molucca unit curate a data set. And I truly believe this was that I still believe this is the right approach, right? Because, you know, like solutions, like weed supervision or whatnot are going to give you ways to accelerate data labeling, but it's going to fail at solving the other problems associated to data storage, compute, throwing away potentially harmful data from your process of about, right, I mean, and so. So basically, I joined, I joined figure eight and then figure it was actually like, you know, like, on the fast track tool to an acquisition, so I did not have the time to fully execute on what I wanted, right? And so basically, like, so eventually, it's like, look, there's this massive problem, complete unawareness.
Of this, for most people on the market. This is a complicated technology. So basically, like and Annika, I think this is the right time to, you know, expend all of my energy on this or whatnot. The fun fact is, I often tell people, like I'm an accidental entrepreneur, I did not plan to start my own companies just like this problem needs to be solved. I know it, this needs to happen now, because later is going to be too late. Right? I mean, so I eventually made the decision to start electoral, right? I mean, and basically go that direction, right. And so ever since then, you've seen me everywhere talk about like, you know, like, the, the lack of data volume is not the solution. In fact, people talk about data quality, I like to bring this with like data quality, there is structural data quality, and there is functional data quality, which means that it's not just about like having properly formed data that matters. Because if you can have perfectly good looking data that actually carries no informational density, right, I mean, of value to your mission in the model, right? I mean, so. So I'm glad that the market is sort of shifting there. How I'm glad that you're that into more and more voices basically, like, expressing themselves on the topic. But I mean, I think it's about time the market catch up with the idea.
Max Matson 23:00
Yeah, absolutely. Absolutely catches up with you. Know, I love that. I mean, it really, it draws a line between, I mean, you, you essentially have the same problem in physics, right. And so ever since you've just been seeing these different kinds of applications that the problem has been exacerbating, right. So that actually bridges into a pretty interesting topic that I wanted to cover on real quick. So you have a really interesting perspective on big data. Right. And, as I understand you left to start electio, in part to solve that problem of big data. Would you mind delving into kind of what you see as the problem with big data? And what do you think is this? Yeah, absolutely.
Dr. Jennifer Prendki 23:37
Right. I mean, so. So it's like, look, it's a it's a multifaceted problem anyways, right? I mean, so basically, like for me, what so you're not like busy, like, I just like number one, I see the data as brute forcing your way into a machine learning model, right. I mean, so basically, like, you know, like, again, as I said earlier, I think I mentioned that already, right there. So when you look at any training data set, regardless of its quality, you're always always going to have like this sort of like free competence, where you have useful informative data, you have neutral data, which I call useless data, and then you have what I call harmful data, right? All of this comes with like different flavors, because in many circumstances, what is harmful data might be harmful, if you inject that information early in the training process, while it might not be harmful, later on, or whatnot, but at a high level, if you look at a specific record, that record can bring you information, it might not change the outcome of the model, right? I mean, if you look at what you inject, like little by little, or it might make the model worse, right? So there is actually like an entire new conversation that just occurred in the past few weeks on Moodle collapse right there and basically like how, you know, like, you know, like, basically an extrapolation of like, how, in fact, like, you know, like you can have effects where the model eventually like a He comes back because you injected like, you know, like, I'm gonna say like steal the towel, right? I mean, or something that doesn't really carry information, right? There is a concept, which has been like very well known to researchers for a while, which is not super popular, I don't see a lot of people talking about this called catastrophic forgetting. So catastrophic forgetting is something that happens when you dynamically train a model like online learning style. And basically, like, at some point, your model starts understanding really well concepts or let's say, your class or whatnot, right. And suddenly, you add this little bit of additional data, and then the model forgets for the benefit of something else, right? I mean, so basically, like you, your mother understands what a dog is, and then you add a little bit more data, it suddenly forgets dogs to the benefit of cats, right. And so when, when this happens, you have this sort of instabilities or whatnot, right?
And so this happens, like, it's very hard to control, it's practically impossible to the day to predict that by injecting this additional data, you're actually going to kill your model, right? And so this is what harmful detection means, right? So the concept of just like, grab the data, throw that into a model and just pray for it to like, go whatever way it's supposed to go. Like, it's just like, A, it's just like, what you do when you don't know what to do what else to do, right? At the same time, I'm gonna say, this is very natural in many ways, because back to when I started in industry, hire them in so basically, like, collecting data was not such an easy process, right? I mean, so basically, like, in fact, my first job in an ad tech company, like visiting the industry was like, you know, like, we want to do something with machine learning. We believe we have data, we don't trade. I mean, basically, you don't have a proper data collection pipeline or whatnot. Right? Okay, figure it out, right. And then basically, what I did is like, I established what I collect XYZ, I worked on instrumentation, basically, forming like, basically a proper schema for, you know, like, the data we wanted to use for training later on. Three months passed by, we collect data, and then I build a model on it. And I'm like, should I forgot to collect like two free features, which would have been instrumental for this model to perform well, right? I mean, so at that point, there was no going back, right? I mean, because basically, it's like, Okay, I'm gonna have to make the model work for that specific data set, right. And there is no other way around, right? Fast forward with better hardware, with every single one of us having a phone on there, you know, like pockets or whatnot. Like, collecting data happens often I try to mean, so basically, like a, like shorter, like basically, like, shortly put, collecting data is not the bottleneck anymore. It used to be 10 years ago, right? So now, this approach where whatever data you have, you have to make it work with, it doesn't exist anymore. But we're biased, right? I mean, so people like me, like, people who started their careers, like 510 years ago, right, they're in visited, I gotta come with the same kind of like approach where I need to make the model work for the data, right. And so basically, this big data approach is sort of like collect whatever, you can feed that into the model, and then expect to return results with it right. Now, a lot of people including like, you know, like Andrey Carpathia, I think was one of the early speakers on that topic as well. Like, back in 2018, he gave a very interesting talk multiple times on the topic of, you know, like, the necessity to invest time on getting the data set, right? By cleaning, optimizing, annotating the right way and whatnot, right? I mean, so basically, like, and I think like, he, he actually spoke to the fact that by spending, like X number of hours, on optimizing the data, as opposed to the model, he would get, like, a burger, like, you know, like, not necessarily double digit, but like, at least like a high single digit improvement on the model, where, by improving the model, the like, he would just get, like, you know, like, an additional like, 0.1% in performance, right?
I mean, and so basically, this is, this is what my shift also comes from, right? And in fact, when I was talking about this, like, Finally, somebody is talking about this, all right. And anyways, right? And so basically, like, it goes with, like really understanding that the look at a high level, I'm gonna see this way thinking so we collect data, and at some point, earlier in the life, the life of data science, like data was practically synonymous with information. That's not the case anymore. Just look at that, again, you like the fact that the internet is full of repeats and retweets said you like sitting now synthetically generated data that might be you know, like, impoverished and information or whatnot, right? I mean, basically, like, it's all changing. So now, information is a subset of the data set, right? And so you used to have this relatively information rich datasets. Now you have a small amount of information in a ton of data high. And so the shift that's necessary now is like, because data science and machine learning is the science of transferring information from a dataset into a model, right? Basically taking care of that sort of like a surplus of Antarctica, useless data or whatnot that can basically like harm the model or whatnot is becoming really key for. And so you're asking for specific impacts, specific impact is like, if you do it that way, there is a financial cost to it, right. And so basically, if we keep going in the direction of the data, right, you're really not getting a fair chance for smaller businesses, who do not have like large compute, like, resources or whatnot to basically like, have a fighting chance in the AI space, right? I mean, so. So this was something that was very important to me, when I started the company, right? You have the environmental impact, you have the you know, like, basically the data labeling problem where, believe it or not, but if you had to basically like label all of the data, even with the automation we have today, all of us would have to stop everything else, and pretty much spend all their time, like tuning fine tuning, labeling data, reviewing automatically generated labels or whatnot, right? I mean, so basically, like, so there is a real, like a shortage of resource, right. And here, I mean, basically, like, there's like exposure of the data. And fundamentally, for me, the bigger problem that's coming now is like, if you don't have the proper data, and when I say proper data, I hate to use the term data quality, because data quality is sort of a limitation. You know, like, just like stating that, you know, like, you just need to fill the missing records, the missing fields or whatnot. It's much more than that. Right? And so basically, again, I like to split the data quality directive into data quality, so is the data that you currently have, like, the proper format or whatnot, right? But also data value, so is the data that you actually have, what the bottle needs to do to like, actually learn and become better.
Max Matson 31:57
Right? It's like quantities does not necessarily equate to quality here. Yeah, so for you and for electio, how do you kind of draw a line between useful and harmful?
Dr. Jennifer Prendki 32:09
That's, that's a billion dollar trillion dollar question, obviously. So So basically, what so so here's the idea, like between, like, data props, right? I mean, so basically, like, so, again, from everything I explained, right? Um, it's basically like, it's a perfectly natural, like market adoption that there has been until now, such a ridiculous focus on the models, right. And so basically, like, in fact, usually, I like to say, like, we all want to go to the moon, we focused on building rockets, but the role of, of chemistry in this for the creation of the right fuel that works in conjunction with the rocket is often like, you know, like, underrated, right?
Because best best case scenario, your rocket might not usually go as fast as you'd like, right? You know, like, the meat situations that maybe doesn't take off worst case scenario, guess what's gonna happen, you're gonna have an explosion. I've heard them. And so basically, like, so that's it, I think it's a sort of a nicer and nicer analogy here. So now, I think like, we're sort of like, we need to see this shift on the market where the same amount of effort we have put in building those outstanding models, and obviously, like, we're experiencing something really like a historic change, right? I mean, I'm basically like, what we're capable of doing with generative AI and then Alexa not writing it. So very exciting times to be to be alive, right? I mean, so at the same time, now, there needs to be a shift towards like, Okay, now, we need the same amount of science effort sophistication to go into data preparation, right? And so basically, like, so my premise is that look, we can use, we should use science to actually solve that problem. That's what data proc Ops is, right. And so you could basically like, focus, like Data Prep ops as being ml ops for data centric, you guys data centric AI is like, reacting to or improving your model, not by changing the model itself, or tuning the model itself, but tuning the data, right. In fact, I think we should start talking about data tuning as well, which is kind of like a weird term for me to use as a as a data scientist.
But like, I hope you understand what I heard, I mean, directly. So. So the key one of the key ideas behind this is like something based on a concept called activity. So active learning is a semi supervised technique where you take a little bit of data, you prepare the data, you train your model, and then you assess, basically, so the trick of active learning is like you take that model in its current state. So when you take a little bit of data, the state of the model is nothing good, right? That means just like it's just it's the state of state of the model, right? You take that model and you infer on the remainder of your data. So all the data you haven't used yet, right? You're basically going to use an X sort of like a run predictions on it and you don't have the ground truth because you haven't prepared that data yet. However, you can try to guess whether you believe what the model is selling says to you no doubt, right. And so basically like the vanilla active learning does the following. So takes a little bit of data, let's say you have 100,000 Records pictures of like cats and dogs and whatnot, right? You you take, you take a first batch of like, 1000 pictures, so you call that a loop, right? I mean, so you prepare that, like you want to take that data, you feed that into your model, you're gonna have a pretty bad model, that's, there's no question to it, right? I mean, so. But then you take that model, you apply on the 99,000 pictures that remain, which you don't have ground truth with, or for basically, like, you look at the predictions, you don't know if something that was predicted as a cat actually is a cat, but you have metadata associated to that prediction, in specifically, like the vanilla activity basically looks at the confidence score. So if you have a low confidence score, you can basically say, you know, like, I believe that the model doesn't actually know it seems to be hesitant about some on perspective or whatnot. Right? And, and basically, what you do is like, Okay, I'm gonna pick like the records that were predicted with the least amount of data. The beauty of this is like the model sort of like, almost like begs, for what needs to see next, the problem with this approach is that you have to believe the confidence scores, which is actually actually very difficult with deep learning models. And the deeper the model, the more likely you have to have like a deep Kroger syndrome kind of problem where the bottle actually over estimates. The confidence of its predictions, right? I mean, so. So one of the beautiful areas of research within data preps is actually like coming up with like, machine learning driven processes to basically not just look at the confidence score, but look at everything you can look at, look, including the state of the model, the value of the parameters, the stability of the parameters of our because like you keep repeating that process over and over again, right? And so you can look at actually a ton of things, right? And look at this in the lens of like, okay, can I build a machine learning model, or a reinforcement learning approach to basically guide my training process, right. In fact, there is actually an area, which we're exploring as well, which is defining which data is not currently in your data set that you should generate? or collect more off, right? And so basically, you start associating, like, by using causal inference, for instance, we could basically say that, you know, like, it seems my model stopped learning when x happens, my data set when I run out of cats, right? I run out of dogs, or after only why dogs are already in my data set or whatnot. Right? And so basically, based on this, you can sort of like, intelligently feed the right information into your your model, right.
Max Matson 38:00
Gotcha, gotcha. So it's not just quantity, for quantity sake, it's basing, you know, how you should expand these datasets on what is actually needed.
Dr. Jennifer Prendki 38:08
It's so huge, so huge opportunity as well, because like, according to open AI, and Sam Altman, himself, basically like, we're going to run out of the lab by like, 2025 2026, right there. And so basically, people have, like, people have proposed synthetic data generation, but there's sort of a chicken and egg problem, because the information that leaves the model that you can use to generate typically, is the information which you grabbed from the training data, right? I mean, so there is no free lunch kind of situation where you cannot generate new information that easily, right. And yeah, so basically, like, so I'm a huge believer, like this sort of, like smaller approach where you're going like, rather than just like throwing the kitchen sink at this, right, because like, if you use regular synthetic data generation, you have the mother of the old data problem. So right, I mean, basically, like, you can generate pretty much anything, right? I mean, and so when, when you like physical, like a thick, it's like very important that we start thinking about like, Okay, what do we generate? Does it better? Does it bias the model? Is it respecting the privacy of the people whose genetic data was used as a training data set to generate the data?
Max Matson 39:23
I see. So yeah, I've read an article recently about model degradation. Yeah. Right. And I think you may have mentioned that earlier, but is that kind of a consequence of of using that synthetic environment? It's
Dr. Jennifer Prendki 39:33
an interesting conversation, right? I mean, because because it's not that there are no solution to this. But like, at this point, basically, like if you go back to my comment that really data science, machine learning is about transferring information from a data set into into a model and into the weights and the biases of a specific model, right. Basically, it's a matter of science. It's scientifically speaking, it's a problem of entropy, right? So basically like that information, you're not going to tell like new information, you can generate like new, like flavors of that information, if you will write them in, which might have some value when you retrain the model or whatnot. Right? So, for example, if your original data set is lacking summary presentation, let's say we're talking about language data, and you know, like, people from a specific ethnicity, a specific age, a specific like, location, one that are not represented in this data set, it's not going to automatically be able to generate data with that representation, right? I mean, so it's actually mathematically speaking, it's a problem of a vector of space, right? I mean, so basically, like you are only represented seeing every presenting differently, basically, that original information, right? So if you actually believe that you can reject that information to like, force the model to become better or whatnot. Like I think, you know, like, you should probably think about this, like more than with a scientific level.
Max Matson 40:55
Yeah, no, absolutely. Absolutely. I, you know, and I think it's a burgeoning field now that we have kind of reached this point where people are interacting with, you know, MLMs on a daily basis, kind of like at scale, but questioning what, you know, consciousness and sentience and how we kind of form our ideas comes from is kind of this question that I think, is inherent there, right, that we don't necessarily have a full scientific answer to, but the concept of like, stealing, like an artist, I think, is very relevant there. Right? Where, yeah, when you look at, you know, artists, creatives in the human space, we are taking all of these bits of information and formulating something unique by combining them, right, but it's still pulling from the source information. And it sounds like MLMs kind of function similarly there. Gotcha. Very cool. So I'm kind of moving forward there. Could you tell us some of your, you know, predictions, thoughts on kind of where AI and machine learning are heading? And, you know, that can be in the context of Data Prep ops or not?
Dr. Jennifer Prendki 41:57
Well, I mean, it's so basic, like so my predictions like in terms of like, the people working on general, like, basically, like machine learning. So basically, like, I absolutely believe that like yoga, like, I had to believe that all along that eventually people would be it would get to, like, data centric AI, and the importance of like, tuning and improving data sets or whatnot. So more and more people are using the term, you know, like, it's becoming like, like, both data centric AI and data pipes, right. I mean, so more and more companies are getting the space. So from your neck, it's a huge win, right? I mean, I, I hope I contributed to getting like, your, like, partially the narrative to where this will be, right. I mean, definitely, like the booming of LLM has played a big part. Because, you know, like, now we're entering like, what, what I think is pretty interesting nowadays is like, we sort of like we all thought we were going to make a proof point for AI with autonomous driving, right. And so I was among the people who are always scared that there was going to be another AI winter, because, you know, like, the original AI winters were coming a lot from people who were over promising under delivering on what AI could do, right, a bit. And so physic like, so. You know, like, there are lots of like, you could claim Elon Musk with Tesla sort of data, right?
I mean, basically like so. So there was definitely risk. And unbeknownst to all of us, basically, what happened is like, Okay, this yellow Laker solidification in the belief of AI, and the potential of AI actually came from generative AI, right. I mean, so which was not necessarily a move, like none of us would ever would have forecasted. Right. I mean, so. But anyway, so I think we are at you like the Crossroads now, where companies are like, Oh, that works. Yeah, it's worth investing, right. I mean, so basically, like, now, I think everybody wants a piece of it right? At the same time, because of the current market conditions, the fear for recession and whatnot, right? I mean, I think people So realize that they want to do AI, but they want to do AI without breaking the bank essentially, right? I mean, so basically, like, you look at any company, they're sort of like in this mode, where, okay, well to do AI, but like, how do I make sure that I control my costs, even the piece of it of scalability and whatnot, right? I mean, so I actually call that the AI, like, the AI chasm or crossing the AI chasm right there. So basically, we've long talked about like, how ml ops was basically like helping companies go from a prototype of a model to basically something that would function in production. Nowadays, the problem is like having something in production that can basically like, stay up and running and update itself, without having to like retrain the model from scratch every single time. Right. And so, so anyways, right? So this is definitely like from an operational perspective, where the shift is going to go, right.
I mean, so doing machine learning, like, in cost efficient ways, right? I mean, which is going to, like Data Prep Ops is going to contribute to is not the only solution for sure, right. In terms of like AI as a tool for humankind, like, look, I think we see what's happening. mean right away basically like the augmentation of humans or whatnot. Right? And so basically, people will tell you like, how we're not going to be replaced by an AI, but we're going to be replaced by somebody using an AI. My position is like, I don't think anybody gets replaced, I think, just like for you, like the Industrial Revolution, jobs are going to change, right? I mean, so basically, it's like, Are you like, whether or not like, I think, in fact, like, I'm not like, basically like you, you can think of like, writers artists using AI to basically like, generate new new content or, or whatnot. Right? I'm thinking about also like, creating more jobs for supporting AI. Right, me. So basically, they got AI product managers, you know, like, AI Customer Success people, right? I mean, basically, they go in the space of robotics, like more people who can maintain robots, retrain robots, basically, like these sort of things, right? I mean, so I'm not even sure because, like, even stating that, you know, like, ah, you know, like, basically, like, we're gonna have the same rules, but we're gonna be using AI for for like, physical like accomplishing the same task. I think it's sort of like, not quite what, what I see is happening right now.
Max Matson 46:12
No, I think we're definitely aligned. In our thinking there. I recently wrote an article that was kind of comparing historically the the trajectory of the typewriter from a marketing perspective and a market position standpoint to AI, right? And how, if you had guessed that all of the roles that pre existed, the typewriter would just be augmented by the typewriter, you'd be wrong, right, because it invented, you know, a number of jobs that that ended up replacing those jobs, but instead of its, you know, just being like these jobs are gone. People re specialized. Right. And I think another piece there, that's kind of similar to AI is the, the individual value and early adopter. Right? You mentioned Crossing the Chasm. I think that's very apt metaphor here. Because I think AI does still have to cross the chasm, right? I mean, I think that the buzz that we've seen today is is just kind of like the the first shots, right of the huge transition. But still, there needs to be a transition to providing massive organizational value. And I would say that, overall, we're not quite there. So it definitely seems like Data Prep Ops is going to be a big piece of that puzzle, making organizations you know, able to move quickly when it comes to ML and AI and being able to instrument it efficiently. So all
Dr. Jennifer Prendki 47:25
that being funny, like just to build on top of this rhetoric, so basically, like, I think, I think, you know, like people are like, AI is here. So ai ai is absolutely here, right? And so basically, like, I'm really happy that there's, there's been this market validation, right? At the same time, generative AI is such a small piece of what we can write them into basically like so in fact, like, I had this phone conversation with somebody, like, a couple of weeks back where I believe that AGI is really like a automatically generated intelligence, so AGI right there. And so basically, like, so the idea is like, instead of like AI in the box, which is what you could call generative AI like pre trained models and whatnot randomly. So don't get me started on basic, like the fact that intelligence requires like, weights to be like, basically what makes us intelligent, like properties, like, none of our weights, or neurons are fixed, right? I mean, and basic. And so basically, that gives us the ability to grow, rethink, change our minds, right of invisibility. And it's not just about like shooting new information. It's also about like, adapting herdsmen. Basically, to the situation, right? I see the same thing with with MLMs with like, generic TV. So you have this like, super smart, basically, like a AI in a box sort of thing. But this is not going to evolve, it needs to be like basically like, ah, it can answer your question or whatnot. Right? Where I see the future is, like, in any circumstances, you need external help for making a decision on the fly based on you know, like, very little data that you have, or whatnot and building this, like, on the fly AI assistants, right?
I mean, and not in the sense that we're talking about AI systems today. Right? And so basically, like, imagine having, like, you know, like, somebody assessing like basically like the gardener or deciding whether they should cut the tree or whatnot. Right. And so right now they're using their experiences like, oh, this tree seems diseased, I don't think it's, it's worth saving or whatnot. And so imagine like the world where that person pulls their phone up, right? I mean, basically takes a picture of that tree. And basically, there is a process behind that automatically creates an app that evaluates similar situations compares, like, basically, like, establishes like the end to end like app. And so three seconds later, you have this app with like, now you can apply this app to any of the trees in the in the yard, right. And basically, it's telling you cut out of Gatorade and so basically, and so this has nothing to do with LM sorry, demeanor, all generative, right.
Max Matson 49:52
Yeah, it's, I love that, right. It's like the number of applications not just a generative AI but of all of the other kinds of breakthroughs. that are simultaneously hitting with ml and with AI are all kind of coming to fruition at similar times. Right? So I'm very interested to see kind of how that pans out. I do. We're running a little bit tight on time. But I do want to just real quick, talk about kind of your experience with electio and how you've, you know, enjoyed being a founder and CEO would have been like, no, no,
Dr. Jennifer Prendki 50:21
like, it's like, I said, like, I'm an accidental entrepreneur, but like, I'm a person who enjoys living on the edge of what I can do. So basically, like, so, I'm very much into like, I often say, like, being comfortable with the uncomfortable, right? I mean, so basically, like, in fact, the moment I'm not uncomfortable with the situation, I get bored. So, so, basically, like, which sort of experiences like, Okay, I don't want to start a company, let's do it. Right. But I guess also, I think in many ways, like, I have, like, this is like the natural, like entrepreneurial spirit, right? Look at the stories, like, it's been an amazing experience. So far, it's been like ups and downs. And you're like, these have doubts and fear. And you're like, arguments with other stakeholders and days where you're like, I'm making such a difference. And where are you guys going God, like proud of fire that you completion without right me. So basically, they got so you have to embrace it, like the good and the bad, like, all together, right? And just remember, like, I think one thing I've really learned and develop, like, since I started that I always really like, you'd like the ability to say like, this is horrible today, here's a better. So basically, like, and so, I mean, I unfortunately started a company like just like months before COVID started. And then you had to start like, basically, like, running teams were mostly which I was not used to doing. And basically, like, with a view, like the market changes where large companies were poaching talent, I had to go international, right? I mean, basically, and then, you know, they go now like, basically, like shifts in market or whatnot. Let's be honest, it's not trivial yet for solo woman founder to basically get funding get support, get credibility or whatnot. Right. And so it's like, yeah, I mean, I think it's like, whoever wants to start this is like, it's worth it. But like, you have to be the beyond gritty right? To make it
Max Matson 52:25
100% 100%. So, to that point, do you feel that your, your technical background, and the fact that you are, like a pioneer in this field has has helped with that? That aspect? Yeah, I would imagine, but just like with the kind of inherent discrimination that comes with, you know, being a woman, it's
Dr. Jennifer Prendki 52:46
like, you're not like my sort of like, so when I started writing. So basically, like, I had this question of, like, I want to start this, I have no idea. Like, basically, some people start solopreneurs, because they want to be in charge of their own destiny. There was a little bit of that, but I genuinely like orange, like, considered like having like a, you know, like a co founder or whatnot. I was so early with this idea. Nobody would really eat right. I mean, so busy life, and this is something I had to do anyways, regardless of whether or not I find a co founder. Right. And, yeah, I mean, basically, like, Are you like, to some extent, you would say like, there are challenges associated to like, VCs and investors, not necessarily understanding why you're doing things that you're doing, right. I mean, so basically, you cannot expect, like, the typical VC to basically just say, like, oh, this sounds like something that's going to be needed, like 345 years from now, right? I mean, so. So you have you have to meet the right people. Or like, look, I have no idea why you're doing this. I have no idea what you're building, but I feel this is going somewhere. I mean, so basically, like, and so unfortunately, like, when you're solo founder, when you're an immigrant, when you're a woman, it's harder to get to get to that.
Max Matson 54:01
Yeah, well, all the more impressive that you've made it. Of course, of course. But all that being said, what is one really memorable or, like, significant moment that you've had so far in your journey as a founder?
Dr. Jennifer Prendki 54:13
I mean, it's like, I have memorable moments good and bad, like almost every week right? Now, I feel like it's like a it's a whole journey, right? I mean, so basically, like I would say, like on the bad side, like, it was a defining moment, the moment like COVID like because like, I started the team, like I started this brilliant, but junior team in the valley, right? I mean, we're basically like, my like, because like I was trying to do like physically, like I did not go for that. Crazy, like fundraise. In the beginning you bootstrap with whatever you have, right? And so, so my trick was, I'm gonna hire like talented, hungry, straight out of college people who I'm going to mentor right very closely and whatnot. The first few months were amazing because we were all in the office. We had this amazing times like in front of us. You know, like a whiteboard, like drawing architectural diagrams and equations and having fun together, and then somebody ever but like, also, suddenly everybody has to go home, right. And so I had no idea how to manage the entire team remotely. I had remote teams before, but basically, like, it's a completely ballgame. Like, especially with a team that I was assuming I will be able to mentor and interact with on a day to day basis. So that that was a really tough time for me, right. And so basically, like being a founder, you do what you have to do you learn new skills, you, you you Rise and shine, right? I mean, you go you go forward, right? I mean, so anyways, right? I mean, so basically, like, I mean, recently, like, obviously, like, the market is getting tighter, like basically, they got like any ml ops company, like basically, like we had customers who didn't make it through the like, the the recent times or whatnot. So it's constantly like looking for for new customers and whatnot. I would say, like, the part of the part of entrepreneurship I like, and I dislike the most is like having to do everything on your own, right. I mean, because like, you have to be like the marketing person, the evangelist, the the fundraiser, the person who establishes the culture, like establishes the partnership or whatnot. So some days when everything goes fine. It's like, Oh, I've learned so much. I've learned so many things, and whatnot. Some of it is like, I can't do this anymore.
Max Matson 56:22
Yeah, all right. Well, I appreciate you, you know, taking on all those hats, it's definitely not an easy thing to do. Right. But you're here we're talking. I really appreciate it. So I'll ask you one last question, which is, for, you know, people who are aspiring to make a mark in AI and machine learning. What advice would you give them? Maybe if they're starting their career right now? No, I
Dr. Jennifer Prendki 56:47
mean, look like something. And so at the risk of coming across as being a little bit like, unpopular nowadays, right? And so it's like, it's just like, sheer, like, at least at the beginning, like hard work, sacrifice, and, you know, like, are basically so, you know, like, having, I have the background that I have. So be maybe I got it a little bit easier that some people basically, like getting started and whatnot, but like, there is no free lunch, right? I mean, so basically, like, you're not gonna waltz in in the space and get like a crazy salary, crazy opportunity or whatnot. So I did work for companies where I did not relate to their mission, right. I felt the board, I felt frustrated. I wasn't where I wanted to be, or whatnot. Right. And so, the one thing that I think really made a difference for me is like, take everything with an approach of growth, right? And so basically, like, I do not like this project, I don't like this job, right. But like, basically, like, I'm gonna, whatever comes out of it, I'm going to learn even if it's a poor experience, right? I mean, in fact, I'm gonna say I learned a lot from terrible, terrible managers right there. And just by learning what not to do, right? I mean, so basically, like, so it's just like, if somebody wants to sell their career in ai, ai is obviously a cutting edge field. It's not an easy field, you need to be good at a lot of things. And in many times, like, you have to, like there is no textbook on like, how to get there, because like, we're all building this together right now. Right? I mean, so it's not just something you can memorize or whatnot, right? I mean, so be ready for, you know, like the disappointments and the hard docks and whatnot writing into physical liquor. I see that as an entrepreneur, but I do see that as a data scientist as well.
Max Matson 58:35
Right? No, absolutely. It's fantastic answer. Well, Jennifer, thank you so much for coming on. It's been fantastic talking to you. I've been familiar with your work for a while through LinkedIn. And I've just been really impressed by it. So it's really amazing to
Dr. Jennifer Prendki 58:51
me, it was very nice talking to you.
Max Matson 58:54
Oh, Likewise, likewise, would you mind telling everybody where they can find you? Yeah, I
Dr. Jennifer Prendki 58:57
mean, so I usually like for social media, like I'm mostly on LinkedIn, right? I mean, so basically, like, I tend to be relatively active on there, right. And so I think the easiest way to reach out to the
Max Matson 59:10
perfect, perfect, Jennifer, thank you.
Transcribed by https://otter.ai
Thanks for reading Future of Product! Subscribe for free to receive new posts and support my work.