AI in Drug Discovery and Value of Data with GSK's Kim Branson
This is Cross Validated, a podcast where we speak to practitioners and builders who are making AI deployments a reality.
Today our guest is Kim Branson, who is the Senior Vice President and Global Head of AI and ML at GSK. GSK is one of the top 10 largest pharmaceuticals globally and Fortune 500 company.
Listen and subscribe on Spotify and Apple.
Follow me on Twitter (@paulinebhyang)!
Transcription of our conversation:
Pauline: Welcome to Cross Validated, a podcast with real practitioners and builders who are making AI in the enterprise a reality. I'm your host, Pauline Yang, and I'm a partner at Altimeter Capital, a lifecycle technology investment firm based in Silicon Valley.
Today our guest is Kim Branson, who is the Senior Vice President and Global Head of AI and ML at GSK. GSK is one of the top 10 largest pharmaceuticals and a Fortune 500 company.
Thanks so much for being on the show today, Kim. You've had such a wide experience, both with AI and types of companies touching drug discovery and healthcare more broadly, whether it's being Head of AI at Genentech, or being a research scientist at Vertex Pharmaceuticals. How did you first get started with AI?
Kim: It's great to be here and thanks for inviting me. I guess like a lot of people, I came through it through the idea of wanting to solve something. So I was fortunate enough, I did medicine and science as my undergraduate, and I was really interested in that area.
And, like a lot of people, I read the Billion Dollar Molecule, which is a book about drug discovery and decided if I could just discover and build one drug that would be my contribution to the world. And I really fell in love with structural biology.
So I was at the University of Adelaide, I'm taking lectures in the Braggs Lecture Theater, which is the guy behind diffraction theory and things like that for solving these sorts of things. And I ended up doing a PhD in structural biology and next for Crystalography. And I was looking at basically like small molecule like protein ligand binding and how can you design small molecules to bind to a particular protein.
And in the course of doing that, we have all these various models and theories for how you might be able to estimate that. And I was interested in both constructing these scoring functions, working out how something might bind. But also the problem we had at the time was that you have a lot of these different models and you didn't really know which one works.
So some you could use, okay, we've got some known data, we know how this thing works, it tends to work well for this protein and these types of structures. So it's calibrated on this. But then if you want to find a new thing, you're like, well, do I just take the model I've got before? And how well do I know it works?
So there’s interesting domain of applicability. And so everybody was so interested in oh, we'll just measure - I'll take a whole bunch of different ones and we'll average them. And I kind of fell into the thing of like, A) I want to learn how to build these models, but B) like why does everything always work? And what do you do when you need to have prior information?
And I then fell into the world of machine learning. So, I ended up doing this sort of whole PhD into working out optimal combination of weak learners without prior information and things like that. And a lot of things on ranking theory and really discovered this entire world of machine learning, which back in, I guess in the 2000s wasn't like the world's biggest domain and discipline, right?
You come and knock on door of the computer science department and reach out through email a bunch of different people. It was a sort of a cool little club. Everyone’s like welcome, come in. We think about that. Let us teach you things. So I fell into that way and then once you sort of have the hook, the bug sort of thing, you then gravitate to more and more of those problems.
So that's sort of how I started with that. I was at Stanford working with Vijay Pande and others. And then I, as you mentioned, went on to Vertex to work with Pat Walters and Mark Murko doing a lot of stuff on, again, machine learning in that domain.
And then, after that some good friends of mine that were, we did a search engine called Discovery Engine. So did a lot work there on building models then at scale. So that was kind of fun to have like going from small models where everything's barely works to like terabyte datasets where you can build some really interesting things. And I've sort of followed that.
So I've always been interested in machine learning and medicine. And also the different large datasets in general. Like most machine learning people, there’s no data like more data.
Pauline: That's right. And so, there's been so much discussion about the impact of AI and I've been particularly excited to dig into use cases where AI allows humans to do things that have never been able to do before.
And I think drug discovery is such a prime example of that category. Can you walk us through what role AI & ML plays at GSK in particular?
Kim: Sure. So, when I came to GSK, it was kind of unique in terms of it's this big 300 year old company that was really in the phase of reinventing itself.
They have this strategy of, look, we're going to have functional genomics technology on one side, so we can manipulate data in the cells and we can generate lots of data. We also have this explosion in data from human genetic databases. So think of UK Biobank, 23andme obviously connected with GSK and other types of things.
We have all this genetic data, we can apply machine learning to that. We have all these now experimental methods. We have single cell sequencing and things like that, imaging. So we've going off a huge amount of data, and we can now turn up and turn down individual genes to test hypotheses and see what happens. So you need machine learning at the third bit of this part of the strategy to make sense of all of it. So that's the kind of the core strategy.
And GSK was fairly unique in that like they wanted to build 100% in-house a machine learning group. And they also knew that like, no, you're not data scientists. No, we're not going to make you do the ETL, or no, you're not just computational biologists or statisticians. Like we want a proper machine learning group. And so the way we play at GSK is really all the way across the spectrum.
So you can broadly think that the way GSK is organized is, we have a development organization, a manufacturing organization. We obviously do pharmaceutical research, we do vaccines research, commercialization and AI & ML So we sort of sit there with that. And so we play all the way across from early stage discovery and I'll just talk to the sort of the pharma business to make it sort of simple. So we do a lot of work on like earlier stage machine learning and sort of in that discovery piece.
So a lot of those genetic variants in those large databases, often we only really know what to do with what happens with them about 15-20% of the time. So if it's in the open coding region of a gene, there may be a missense mutation, maybe loss of function, etc.
But a lot of them form the regulatory region of a gene. And sometimes they affect the closest gene and sometimes they affect something distant. So there's machine learning models to work that out. And sometimes you don't know the impact of the mutation. Does it increase the level of protein? Does it decrease level protein?
Does it happen the same in all tissue types? So we do a lot of machine learning in that sense to work out which sort of genetics and genomics. We do a lot of machine learning with a high content imaging. So you think of things like CellPAINT or these sorts of things like that.
Or often we have this kind of complex, say neuroscience. IPSE neurons, or three different types of glial cells and primary neurons. You make these very complex cultures, you can't do lots and lots of experiments in them. So there's machine learning to analyze all the images.
There's a lot of machine learning also to make sense in terms of like how to do an experiment. So if we have a particular endpoint, and then often these endpoints themselves are multimodal. Like, I want to make this gene expression pattern look like this to back to wild type.
I want the images to look like this. So like this protein expression panel. It's a complex multimodal thing. And then, I've got 20,000 genes I could possibly mutate or even more if I want to do pairwise. Do I just do a whole genome-wide screen? And some, for some things you could, if it's a very simple system.
But inevitably for the more complex systems, the ones that we know translate better, it's too expensive to do that. So we even use machine learning to work out which experiments to do. So this active guided things you do, have the machine predict some things to start with, and you might be seated by genetics or other types of things.
You do the mutational analysis or do the CRISPR experiment, which is mimicking your pharmacological intervention. Get the data, feed that back in, and have it do this active learning thing to try and find the optimal set that produces a response. So as you go all the way along, we do machine learning obviously in design of small molecules, antibodies and things like that. There's various ways to sort of engage in there and across GSK. And then we come across to sort of the clinical aspects. So we have machine learning in like clinical imaging, which we can use to kind of create continuous traits as it were.
So the UK Biobank we might look at, have a model to predict how much liver fat you have. And traditionally for doing these GWAS studies, you might just say, okay, I'll look at what's diagnosed for you. And that's certainly one way of doing it. But if you actually look at just the liver fat, you might have people that have a degree of that, but they're not formally diagnosed or things.
But you can put some on a continuous trait. So whenever you've got a gradient, you've got more information and that becomes useful as well. That's also useful for clinical analysis things in trials, and we start to go things into more the application where we're looking at, we have the strategy of building software for every asset we release, which is often around who's likely to benefit, how does it work? You know, various things around, risk and response, for example. And it can be things, so computation pathology is one area that we do a lot of work in right now and have done for a long time. We have a GSK professor of computation pathology in the UK as well. We do a lot of work on picking response prediction.
So then this is sort of complex multimodal data. So what you're now starting to take things from imaging or whole bunch of different blood-based markers, borrowing strength from population studies, for example. And so we work play then into that kind of clinical domain.
And then there's really more what I'll call applied AI, where we're picking up other types of things across GSK. So, you know, doing automated analysis of how we use animals, for example. To know where they're moving, what's going on. Animal welfare and all sorts of things allow us to get less animals, get more information. It's a very good thing. It's a part of how we have to do drug development, but also things like predicting which cell lines are going to be stable, which ones will make the most protein in manufacturing.
There's all these different areas where we apply things as well. So it's quite a wide spread. And so we have this fairly large group that's distributed across the globe. So, I'm in San Francisco, but we've also got people in, Philadelphia, Boston, London, where we have this AI hub in Kings Cross there.
We have people in Heidelberg and in Zurich, a large group out there in Tel Aviv as well. So it's really quite a very kind of large distributed organization as well. So, we sort of play all the way across the value chain.
Pauline: It's very clear that you guys think of AI & ML as a superpower across all your different products and everything that you do. So that's incredible to hear. And it sounds like since you've joined, there's been a lot that's changed and this was a really interesting opportunity at GSK.
I'm curious, setting aside generative AI, which we'll talk about in a little bit, how have you seen traditional ML processes or landscape change even during your tenure at GSK?
Kim: One of the advantages that big companies have is they have data.
Lots of historical data and everyone says, that must be so valuable. Well, yeah, GSK has been around for 300 years. They've got lots of data. But let's just like think of like small molecule solubility alone, like yeah, we might have data going back for 50 years, but what you need to realize, it was like, six different versions of an instrument with vastly different errors.
And maybe we weren't so precise at knowing the exact chemical composition of what we were measuring 50 years ago versus now we have much better sort of purification things, mass spec, all that kind of stuff. And so the data quality varies a lot. And the other thing you can't do is you can't just sort of work with the data that's lying around.
It's useful for sure, but if you really want to do something and build a model. And the shift we drive as the model is the tool that allows us to build value is to actually generate data with the express purpose of fitting into the model. So, which is odd cause a lot of people are like, I didn't come to GSK to generate data to feed into Kim's model.
But we do a lot of automation, these different types of things and we do, we treat it into a sequential learning or an active learning thing. So we're doing adaptive sampling under uncertainty constraints to say, okay, I generate some batch for the model, we retrain the model, we make some more predictions, we feed in what it does, the model learns faster.
And that's a huge shift in terms of generating data because the model becomes a critical asset. So we're really try starting to do that at like huge scale, so we now generate data at an exponential rate. That's driven by really the decrease in cost of sequencing technologies.
But all these interesting new measurement technologies we now have in biology can generate data at the scale, particularly imaging. And we also do all the things in time series. So you get a very different response if you've made a genetic perturbation and you measure it day three versus a day eight or day 12 when you see certain things come up and come down again.
And if you perturb both negatively as well as positively, so you can outbreak later. In the different contexts of different patients. So, you can think about that is one huge change that we drove. And it's not… everything is a neural network problem.
There are still lots of things where we would like to use, we use sort of simpler and more robust things, to answer these particular problems. But the bigger thing that is coming now is that we're seeing increasingly the ability to join multimodal things. So we used to build various neural network methods where, you would, might combine them with the top layer or various types of things, or you have an imaging based thing and maybe you have RNA seq and you can join them together and end-to-end learning wasn't kind of like possible for both those areas / modalities, but we're seeing those sorts of things are changing as well.
That's a big shift that sort of happened across things. So culturally, GSK thinks about data as a critical asset and generating data as a critical asset for both building models now, but also models in the future. And then thinking about the quality of the data that we generate. So it's very different than sort of just using historical stuff.
So we really take a lot of care to the metadata and the variance in the noise and how you generate and those sorts of things. And when you kick off these really big learning experiments, what happen is the experiment is actually becoming quite a thing because I say, okay, I want this outrageous thing.
And they're like, that's not possible. Actually, well, I was thinking on the weekend Kim, and what if we did this like that, I can't get you 1000x, I can get you a 100x. Like okay, that's awesome, I'll take it. It's still better than the 1x. But then they come back and as you start doing it, you get these kind of Wright’s Law coefficients where they get better and like suddenly we've improved it by 20%.
And the more you do that, so you're getting more data per cycle, the cost is falling so that value of that data point falls, driving that is really key. So that, that's been a huge thing that we've driven at GSK and we've seen.
Pauline: A lot of people are struggling with how do you deploy this at scale? And it seems like you guys have really cracked that.
What's maybe one or two learnings that you've had with dealing with all of this data and then generated data at scale, whether that's storing it or picking the right models or picking the right technique? Like, what's been something that maybe has been surprising to you?
Kim: So certainly, like a lot of people, I mean, GSK has gone from being, having its own giant data center they built in Philadelphia to shifting to cloud. And certainly some of the scale of the data sets we generate now, we have a data set where we are taking in, for example, taking like 20 donors or look at their t-cell or subset of their T-cells.
We'll edit the top 5,000 differentially expressed genes both up and down. We'll do a time series, we'll do imaging, single cell proteomics, whole bunch of things over time. That's a huge data set that doesn't ever fit on prem. Even if we convert, or things like that. We’re moving like most people to fully to cloud.
And so those sorts of things really have to accelerate your need to move to cloud. And then it's always a challenge between do I do a live data experiment to answer a specific question versus building a general data set to mine that for the future? And how do I work out the ROI between that.
It's enduring data asset, because you always have all these tensions and tradeoffs within the company. Like you want to spend $X million to build that or you want to spend, like 2x that to build this big data set, which we think we can use for this, this and this. Okay, it allows you to do those things and maybe it allows us to do things at speed and pace so we can build some models that allow us to not do something or prioritize things differently.
And so you have to think of the long-term system effect of these models once they're in place. Other than just can it predict X today? Because I need that X for a very narrow tactical point of view.
Pauline: One of the questions that I think investors have is this like, this generally intelligent model or task-specific model.
How have you seen that shift change over the last few years?
Kim: Certainly almost everything we build is we build a lot of relatively narrow purpose models. But, like others, you get a lovely sort of regularization effect if you have the same model that can solve a lot of different tasks.
And we've been building a lot large language models — and stacking encoder architectures are all the rage right now and variants therein. We've been building a lot of those on raw DNA sequence for a long time at scale. And we predict all sorts of different things like open / closed chromatin, transcription factor binding sites, things in different cell types, for example.
We don't build just one model that does open / closed chromatin. We have one model that does a whole bunch of tasks simultaneously and that produces a more robust model. And so that's one thing that we're pushing. And also those models become, they do become foundational models in a certain sense is that you can use to borrow strength from other types of things and then retrain or refine them or fix various layers and they become reusable components.
So we end up having a lot of reusable kind of components. That you've trained at a whole bunch of different cell types and now when I bring it to a new cell type or the same assay cell type, whether it's seen previous data, I can work with a lot less data so they become more sample efficient. Something that we think a lot about as well.
Pauline: That's really interesting on the more traditional ML processes, and you sort of alluded to it, but how do you think about the generative AI opportunity within GSK and maybe pharma companies more broadly?
Kim: Yeah, so generative AI - it’s funny, I like to, I've been emailing people around that classic 1980s Fortune magazine cover about our rational drug design, because I've seen some pretty outrageous claims of drug discovery goes completely digital and like, testing our virtual patients.
And I'm like, well, that's fine. We can do simulations of patients. And we do that, not now, but Generative AI doesn't do that. It might allow you to mine literature and come up with a QSP model parameters faster about you reading papers. But so, it's been a quite oversold. I think the interesting things are, certainly for us, it's about, I would say we frame it as being able to recall and reason.
Often we have lots of different data sets. There's lots of stuff in the literature. We've built large knowledge graphs and things like others have. But one of the nice things that you can do is using all these various kind of retrieval augmentation or things like that. I can like give it a PDF or a series of PDFs and ask questions.
And that becomes a very useful thing. So I can start to, what's in this pdf, saves me time, allows me to summarize, that kind of thing. You also can think about, language models as a way to effectively solve like the giant ETL and the ontology problem. So a lot of what we're thinking about is building a series of narrow, and I call them narrow language models.
Because they're not like GSK-wide, they are sub-departments or subgroup areas. and they don't need to be especially huge. They're not like, they're not like 175 billion parameter counts. They're smaller than that where you can almost use as a super ETL and database where you pull all that information together.
And then you can actually have another model you can query various things and say, what do you know about X? Do you know about why? So that can join things because we can discover things we didn't know about. So starting to build those sorts of narrow purpose specific things in our data is key.
And there's various techniques about taking the Apache license foundations and tuning that versus there are various ways we could go into technical details about how to tune those models in an efficient fashion or train them from scratch. And then there's sort of ways to think about once you have all that data together and those different models, well how do you pull them together, how do you act upon that?
And that's where I think, we're also thinking a lot about agents as well. Very recent things, because so many times I have, what do we know about X? I can email someone or I can just walk to a little agent that can say, okay, I can talk to this LLM, I can talk to this database, I can search the web and I can pull together a reasonable answer for me.
I can work out whether what's going on. So, and I think giving that to everybody in research is going to be transformative. So that, and then allowing people to like, to act on the data, to make a better decision is key. And then there's lots of tasks that are really about, using these models to assemble documents or reason over documents.
Like these complex reports we have to write. We've taken the raw documents and assemble them together to make a first pass. There's lots of roles for generative things and obviously marketing and things like that are not super unique to pharma. But there's a lot of roles and I think in summarizing and pulling literature together, and we've been sort of surprised at how well some of these things work for these use cases. And I think you're going to see lots of little 10% gains across the entire value chain. And there'd be some big things like having all the data together and agents and things like that.
I think we'll have the biggest effect in early research and some of the other things that are more process driven, they'll speed up those processes a little bit. Maybe these times we’ll never send around our document of here's our expenses policy. We'll say, here, the expense bot’s been updated and you can query the expense bot about what's going on.
Those sorts of things I think also will happen. Because GSK is a big company. There's 150,000 people at the company, I think vaguely. So those sorts of things actually do have an impact at scale as well. And so we're in the process of actually, rolling out, June 1st is the launch date for that, which is burnt in my brain, one of these apps to reason summarize over data.
And one of the things that's great about doing things at GSK is that we get a lot of reinforcement learning feedback. It answers this question, didn't answer this question right. I think these technologies will definitely have an impact.
The ones I think I'm more excited about are multimodal ones. So you can think of computational pathology coming back to that, or imaging for example. Those sorts of ones I think could be really interesting and that's where I think that, we are still very early on working at what's going to happen, but I think that's probably that one that could have some really quite startling approaches.
Pauline: You just walked through so many various use cases across both GSK specific and non-GSK specific tasks. How do you as an organization figure out who should be working on what, the ROI of things, how does that process work?
Kim: Yeah. So the way we sort of work like my group doesn't do all the machine learning about at GSK, but I care about the standards and the practices. So we have good machine learning practices and standards. We have a code of ethics. We have also data practices and standards to follow.
And it might be that we have some things that - so our group particularly works on big problems. So you had a big problem that's a 1000x impact and things like that, we don’t how to solve. Then we go and build that. We might generate data. If it's something where someone can just solve off the shelf, finetune a model, build something for x. And it's a data scientist or someone in one of our tech organization can do it, then they go off and do it. They don't need to involve us.
So imagine we do a lot of the research, we build very specific things and then we use other organizations to either take and scale that for example. So that's one way we kind of scale things and we kind of sit and coordinate as a group about where best to spend this, like where are the most impactful areas?
And because like everything, you can't do everything and you can't do everything all at once. And there's also an interesting cultural change where you want to automate the boring, everybody's scared about oh, this is taking jobs. There's plenty of great tasks for human things to do.
What we don't need people is to take a whole bunch of notes of the meeting and write meeting minutes, for example. Like, we can summarize that. Maybe Microsoft will have that in Teams and Copilot, that kind of stuff that can be automated away and that frees people up to do other things.
So there's a lot of potential there. There are things that are more in the scientific research type stuff that sort of fits with that. And then there's more enterprise-wide things, but we partner with our technology organization to scale those bits out.
Pauline: Is there maybe mistakes - one mistake - that you guys have made as an organization that you would say you've learned a lot from and that you would never make again from this ROI or task generation process?
Kim: The tricky thing is the long and the short term impact of things and the credit assignment problem.
Because on the end of the day, we make medicines and we sell medicines, and that's how we make money and that's how we're evaluated. Like we can generate all the cool technology we want, but like everyone says, okay. But did it lead to a drug? What drug did it lead to? So you invest sort of stuff, where's the drug?
And the thing is like, well, look, there are certain things we do that just take time. Like there are safety studies, there are stability studies, there are all these things we have to do that take time and AI doesn't speed that up. There are regulatory requirements and things like that.
And then there's everybody trying to say, well, okay, well we found this with this method, but could we have found it another way? Did you guys really do that? And we've got examples where we found things that we wouldn't have found before. We've found, we've had examples where we looked at the same dataset again, we found something missed, and we have to sort of come back — remember, like, look, machine learning, it's a tool we build that allow all the other disciplines that come in, drug discovery development to use maybe more efficiently and things like that.
So that's a key understanding is like that credit assignment problem is unique and you don't just say, AI did all of this and no one else did anything. Because that's one way to antagonize the rest of the organization. This simply isn't true. What you're trying to say is like, how did this work pre and post?
And that comes down to something that, I think that most companies are poor at which is actually capturing the value of decisions. So everybody right now is about, I'll solve your MLOps deployment problem, Kim, or something like that. And I'm like, look, I don't have that problem. I can deploy models.
I've got a whole framework. We’re fine. The problem I do have is I have all these things that, for some things where you routinely measure something, so in small molecule, we make small molecules, we register them, we measure a whole bunch of properties. That happens automatically.
The data comes back in a database. You can build a model, you can look at the model lifecycle and how it's going out a spec or when to retrain it, but all these other models that produce data to a committee that makes a decision of like, okay, we're weighing up data from the directionality.
Does this mutation increase that? How many people have it? What do we think the effect of the diseases? All those different types of things. How do you capture the feedback of those things for a model sense. When someone decides to do something based on the strength of the model or not do something based on the strength of the model.
If they do something, then maybe you can capture that because it becomes initiated or a program or something else. If they don't do something, how do you capture that? And also, how do you know if your prediction was right? But all of these things, it might take two years before someone's done the experiment to say, actually, yes, that variant does up-regulate the program expression in these different cell types.
And so you've got it right. So how do we track that? Because that's in three different departments, it's things like that. So actually tracking those decisions that were made, the impact and the data, the feedback, that's actually the complex problem. Particularly in these sorts of domains where this traditionally hasn't happened.
Whereas if you go to more of a tech company or things like that where I'm doing, like I've got a model that's predicting ad churn or things like that, I can compare my real churn, I can update it. I've got it easy to build a model life cycle. I've got systems that capture the decisions and the results, and it's all tightly coupled.
This is very weakly coupled and a lot of this sort of stuff isn't there. So there's a lot of people practices change and legacy systems to effect into that one. So that's a real problem. That's a challenge.
Pauline: I'm not sure how solvable that is, but I can certainly see…
Kim: Well, it's probably - each one is company specific, but I suspect by looking at, there are probably some general things, but yeah, decision capture is the problem there.
Pauline: You mentioned that you, that you’ll have vendors and maybe you say, oh, I don't actually need this, or I can do that. As you think about the tools that you're using, how has that changed over the last few years?
Kim: So we build a lot on open source technologies and tools. We publish our approaches. We publish code and things like that. Like most people, it's the data that, like everything, is the key thing and the model we built on that data. The model we built on public data that, we're happy to release that, as long as it didn’t take millions dollars of compute, we're happy for that to go out there.
A model that's built on the joint of public and private data is ours, and a model on private data is obviously ours. The compute stack I think — we're starting to see, we work with Cerebras. You're starting to see these sort of custom platforms becoming more useful in particular areas, and I think there's going to be a really interesting growth in that kind of front.
So there's an interesting sort of split in the compute hardware now. It's not just GPUs and that kind of thing. We're seeing different things for that. Everything has definitely moved to Pytorch in our world, like most people. Before we had some things built on TensorFlow and others and things like that.
Where possible, we try and use sort of not — as a research group, we need tools that we can interrogate and own and things like that. We do pick up a lot of the DevOps type stuff, grabbing code and unit testing and pushing it all together, those sorts of things.
And on the deployment front, it used to be even like we would run our machine learning on prem, but now, because these data sets are so large and they require so much compute that you need a system service group to kind of manage that in terms of doing your infrastructure-as-code and all the MLOps type aspect there. So that's shifted a fair bit.
But I would say, the biggest driver's been the world kind of standardizing on Pytorch largely. And we still don't have really good handles on how to do, as the world is sort of doing dataset versioning, people are thinking more about like, the version of my dataset isn't just the hash of the data, it's the hash of the data plus they get a hash of the ETL stack used to make the version for my thing, and now, we're thinking about metadata for those sorts of things. That's coming and that metadata about the data is actually really important if you want to do automated machine learning.
There's a lot of things that we have that, like someone wants to build a model, automated machine learning can go, we can go and build you the best model that fits all our practices and the standard and you can take and deploy it and it works. For some things, if it doesn't work, then maybe it's a super interesting problem, maybe we should get involved.
So that automated machine learning aspect is also kicking off, but most of the changes I think have been more on the dev and deployment stack, but also all the stack that running all the cloud infrastructure. Cloud infrastructure has gotten better in the last five years as well.
Pauline: Right. That makes sense. And certainly I think given the scale of your datasets, I think the cloud, I assume you have similar infra issues of just how do you scale up?
Kim: Well we have both — ironically, we have big data sets we generate, but then a lot of the data sets we generate are like high d, low n. I mean clinical trials and things like that. They're very rare samples and often hopefully we see more, we get time series, but it's a very low frequency time series. It's super noisy and we have dropout and things like that. So, and you have different groups that work on the data. We also have groups that work on data in a regulated fashion.
So it's just statistical analysis things like that. So they have very different requirements. The computation pathology for example. So we work closely with PathAI to be our sort of our deployment partner. This is public - there's a deal we have with them.
They have a whole regulated stack. It takes a lot to build that stack. I don't want to build that stack. I want to focus on doing the science discovery part and build my model so we can partner with them - take our model and like you can deploy on the stack into the clinical trial, for example.
Every piece of that data collection and operation has to be validated. So the data is valid to be used for regulatory purposes and all different types of things. So yeah, that's actually the sort of vendors that the regulated vendors are actually the space that my group particularly engages with.
Pauline: And speaking of data sets, I think one of the big mindset shifts that seem to happened amongst Fortune 500 is, and it seems like GSK is further along in this journey or has thought about it a lot more, but it's just how valuable that data set is. How do you think about the privacy and obviously Microsoft is coming out to say, we're not going to train on our models or OpenAI saying we're not going to train our models.
How do you think about how data continues to play a role or what role does data play in the coming 5 or 10 years of AI & ML evolution?
Kim: Data is becoming more the moat. If you look at the early genetics, like everybody has access to UK Biobank and various types of genetic databases.
So if we all have the same datasets that we go fishing in, it's really, am I better at understanding what's hidden in plain sight? And I can act on it or I can discover something that no else has. The way I can build a better model is either, I can say I'm the best person doing machine learning in the world, my people are the best. We're good. Are we absolute best at all things? No one is. But what I could add, if I just bring more data to the table, different data, high quality [data], I can build a better model that allows me to understand that. That becomes increasingly valuable.
So data's a key asset. It used to be you would see a lot of these things where companies would come to you and say, hey, we're a bunch of really smart people. Please give us all your data and we'll build a model on your data, and then we'll sell it back to you. And I'm investing you in twice, I'm giving you data. And you want to sell it back to me? And you want to charge me for doing this? And I’m there like, oh no, wait. We won't keep the raw data, that's fine. You can have that back. I'm like, I don't care about the raw date. It's a derived asset.
Because you've basically learned the entire copy of my data. You’ve got the whole three different companies. You've got a unique data asset that no one has, and then you're competitive. So we think now a lot more about the derived rights for data, things like that, and the release of data. And we say, okay, how unique is this? Cause a lot of things we have you cannot buy in the open market. It's a unique asset. It costs millions and millions of dollars to generate this data in humans. And it needs to be used appropriately.
We do share data with academics who behave like academics and do academic research and publish it and do the things they make their tools and models available to the rest of the academic community. But we don't share data to an academic, like, I'm going to take this and create my spinoff company. Like well, no, because that's actually us losing shareholder value, like that's value we should be capturing that. Or we should be getting some form of equity for the data being provided to you.
So there's interesting here about like, how unique is this data, what does it enable, what does it enable combined with other things? So we think very carefully about the value of data now and what we generate, what we publish, and what we retain.
And I think that trend is starting to happen, but it really depends on how sophisticated you are with your machine learning. And there's a trend now to clinical trials. They're getting more expensive because we collect more data to build more complex models about who to respond.
So we do all the single cell type stuff, all these different proteomics, things like that. Whereas previously you would, like we gave them the drug, we might measure a blood test and we'll measure some endpoint. Whether some clinical schedule of assessment and they were fairly simple.
And so that data was fine to disclose and you might disclose that and like allow people to have it at the end of it. So that there's going to be interesting tension with clinical trial transparency and the value of data as an enduring asset. Because you can imagine every trial you do in an area, although maybe the drug doesn't work, which is unfortunate for patients and everybody, but that information is still useful too, for the next go in that area.
So it kind of forms like an internal bio bank. So there's going to be this tension between releasing things, allowing things to the wider community versus maintain the value and the unique secrets that only we know so we can move forward.
Pauline: I think to that point, a lot of Fortune 500 companies are at various points of their machine learning journeys.
And when we think about the value, whether AI is a sustaining technology or not, certainly data is part of it. Team is another part of it. You guys are in a privileged space where you have both. How are you seeing sort of maybe the talent pool of people who can deploy their own models? And I think one of the innovations that's happened is at least OpenAI's models allow this very abstract API for a language model.
How does the team and being able to recruit the right people to the organization to actually build your own models or train your own models or finetune your models play into the equation?
Kim: So when we started, I started building the group in 2019. GSK wasn't the number one destination for people to come and do machine learning.
But if you cared about doing machine learning and making a big impact and you care about the mission and the data and the various types of things we're doing, and over time we started to prove that out to the community with the various challenges we run and data sets, things like that.
You get people that are drawn to doing exactly that, so the mission matters, but you can't [say], I've got a great mission, but I don't have any data and you can't make impact. Well, that's just basically me wasting my time, and I've got better options if I really want to make my impact. I’d go somewhere else.
So you have to bring those three ingredients together. But you can't overlook the cultural problem. There are people on the other side and almost when you build your team, you have to a continual player.
Some people are translators that speak both forms of languages. And you have to be little way, way machine learning type stuff. Like, what's a gene? You have to understand that it takes time for this other organization to go, well, what are these guys about?
Are they just comp bio people, but different? So they build more like, and how they work with everybody else. Everyone has to sort of feel the others out and say like, okay, they're here to help. They did that. That was actually useful. and that cultural shift as well. Cause when you come in and say, okay, now I want this - generate me this data, like that doesn't work.
You have to explain what for and why and bring them along with the journey. And that cultural change takes a big time. But once you've gone to that cultural change, you've actually brought. And it doesn't happen very often in the industry that you bring a whole new department of discipline in.
Like you think about, it's a very weird thing, It's not like, we're a steel mill and like, we have all these different things. We have metallurgists and something. We have a whole new group that's come in like that brings a different technique. It looks at things in a different way.
There's a big kind of cultural transformation that has to go through. And so I think that actually is as important as well because that organization can fight this new organization. They can engage with them. It really depends how it's messaged. And some organizations, some parts of the organization will be far more receptive than others.
And so it really depends on how people view it. They're like, okay, we weren't getting a lot of love on how to do this sort of stuff as a new group. And hey, like, let's go and chat to them. Maybe they'll help out and they're like, oh, these guys want to work with us, so we'll work with them.
But that may not be the highest value thing. And really it's the other organization where there's a bit of friction that you have to focus on. So that's a really key thing for groups to think about as they engage as well.
Pauline: To go from advice for Fortune 500 companies all the way down to advice for startup CEOs.
Certainly, the big tech companies have the capital, have the data. Which are so important. What would you say to CEOs running startups today, or founders running startups today trying to navigate big tech and competition from them?
Kim: I would argue they don't have all the data and they certainly don't have the domain expertise. So you poke anywhere. These new companies, they don't have a whole bunch of people that are actual deep experts in single cell genomics and things like that, analysis of that. They might have one or two, they don't have an entire team, and they're certainly not going to have institutional experience in using that.
So that's the thing that different companies have. I think you need to think about the where and how you engage with like a big pharma or all these sorts of things. So I think, for me, if you come to me and say, oh, I've got compute and smart people and things like that, I'm like, well, I've got all that too, what do you bring to the table?
But if instead you come along and say, actually, I've been generating data that you don't have so orthogonal data. I'm like, okay, now I'm interested and I can build these different models. And also, by the way, I'm not trying to solve everything you have because a lot of people come in and like, oh, we'll just design all the drugs for you.
I'm like, really? Do you think the people with the models inside the company are just like, they're down for that because it is zero sum at a certain point. And often these people come into a company with a different person they've met who say, oh, that sounds great. They must love that.
And they become the champion and eventually hits these guys. And depending on the political will, sometimes they'll part you to death. And sometimes you just become entertainment for these guys. At the end of the day, that internal team can always say no. They know more about their company and everything else, and you are often learning the area and they will say no.
So if you have a real champion and coming in and you say, actually, we can take this boring bit off your plate and you guys can free things up so it's not zero sum and they'll do higher value work, that's the thing. But a lot of people should think about how to actually prove their technology.
And I think the better way to prove the technology, and almost every company goes through this phase of a platform company. I'll try and engage a whole bunch of people. It takes too much time to eventually, why don't I just do it myself. You need to own — your models get better with the quality of data you have and the rate of new data generation.
That's your learning time. So investing directly into the lab to generate data for you to own that learning loop is probably key to building a better model. And then you can say, my model is just better than what you guys can do by a huge amount, not by 5%, but like 30%.
That's suddenly a very different conversation. A 5% difference or 2% difference. Well, that could be just noise. 30% difference is robust. I can use it. Well, I can use it now and it's getting better because you guys own your own learning loop. That's really cool. So companies that are, that are sort of doing that and when they have their own lab to feed back in their data, they also have control of all the noise factors in that.
So they're putting better data for machine fit, for machine learning purposes. So I don't know, I mean, like a big hat buyer for example is — they're the great example company who have a really nicely, tightly couple learned loop and then many others. But that's just one I've got off the top of my head.
Pauline: This has been a really interesting conversation and we'll shift into our rapid fire round. First question, what is your definition of AGI? And when do you think we will get it?
Kim: I guess an AGI is a system capable of solving any task you give it and part of doing that is it's working out what it needs to learn and what systems it needs to engage with and how to do that.
And I think that you can kind of tell by my answer. I think agents are a very — we've now actually probably have agents that we are using LLMs to control agents that look at the task, describe the task, break it down, work out what tools to use, evaluate the answer. We're starting to see these different types of things.
I think when you have agents that can work out what data they need to consume, how to retrain themselves to perform a simple task. So the agent decides one of its tools is to retrain itself. I think that's the stepping stones starting to build these different types of things. And then you can kind of go into the sort of the RL or various types of things are probably the techniques to do that.
One question we have, I mean, are humans even artificially generally intelligent. Or are we just narrow purpose? Because it's a bunch of tasks I don't know how to do. So we are building something that perhaps exceed ourselves in that domain.
I think like all these things, it's going to be a bit of a shifting definition. As we start to use these tools, we'll decide what we think is the true intelligence and things like that shifts a little bit, but I suspect that — I think we are going to see things that are far more capable, not, maybe not totally generally intelligent, but more broadly intelligent across things.
And certainly language models have done a lot for that. But the question is like, are we going to be writing - we basically mined everything that’s human is ever written and digitized on the web to be where they're now. And that's taken a long time to do that. So that rate of growth of that data asset, though more and more people are coming online, you have to ask how much unique tokens and things like that are being added.
I suspect that's an interesting key limiting factor to think about it. But then again, there might be a whole new sources that things can control or digital streams and things like that more time series data, more interesting things like that we can mine and different modalities we haven't used before.
I think all those things will mean that we are going to see more capable general agents over the next five to 10 years. But when we draw the line that something's AGI, I think AGI is something that is writing its own code and building its own thing.
Like at that point, who knows? And I'm not sure when, to sort of call, when that would happen. Look at Cambrian explosion in languages models right now. Sometimes, like it might be sooner than we think. But a lot of these sort of things there are fundamental information, theoretical limits of the universe and computational constraints that actually might be the bigger thing that stops us for a long time. So it's hard for me to call that particular one.
Pauline: No, very nuanced answer indeed. And I agree that I think the definitions are shifting, which is why I like to ask this question now, and then you can go back and see how your definition changes. For sure. Second question, what is your AI regulatory mental framework? And in particular, I was very curious to hear your answer given pharma is a highly regulated industry.
Kim: So we think a lot about this. So we have a VP of AI ethics and policy that we've had that in place for the last three years. And not only do we actually we have a research program at Stanford where we bring people from all over the planet.
They come into that and they can be engineers, we'll teach them bioethics and they do research into how these things work. So we do evidence-based policy. We think about a lot about the risk and the downstream consequences of models. So models that touch humans much more higher risk and consequences and things like that.
So we want to, if we're making treatment decisions about who gets care or not, or response, things like that are very important. So we need to be really well regulated. So that's where we care a lot more.
A model that's used to make a decision about, where should I, where should the team go for lunch? I don’t care so much, low risk consequence. But there's a continuum of when to apply things. That's a very big expensive decision, but it's not touching a patient. It's still very important to the company. So we trade that up. We are not so much on everything that needs to be explainable.
In fact, I would argue that most models like linear, oh, linear is explainable. I'm like, really? If I give you a five parameter model and allowing like both positive and negative coefficients and you can really explain to me how it works? Because remember, there's an entire ensemble of models with very different weighing factors that can produce the same output if using confusion matrix or some kind of derived metrics that all looks the same.
They're not really explainable in that sort of sense. What we really try and use that as an idea of like, as a crutch for, I don't really fully trust the engineering process and how robust and reliable it is. But when we - to go back to pharma, when we're giving drugs to people, we look at a population and we look what's the rate of different things?
It gets better. How's it work? Those gives us a sort of a population thing of whether this be safe and work. We don't know exactly what's happening on medical detail. There are many examples of medications that later on we found out they had other side effects or an off target effect has been the thing we’re working on.
They're not explainable. In a sense, we don't have a perfect system. Our AI systems we need to [make sure] they’re robust and reliable. We have, we really trust engineering processes validated them, but they don't have to be human interpretable. So we can build computational pathology models are really robust and reliable for picking gene expression or particularly a molecular phenotype, but it's not a task a human can do.
So it's actually about the engineering validity of those types of things. Now, yes, ideally you like the simplest possible model and if it could be robust and reliable and you could unpick its decisions apart and that's key. But a lot of the times when people want explainability, they also want to know like, what did the model learn?
Like what hypothesis did it have so they can build a simple model or what they learned from that as well. These things get conflated together, so the answer is like we care deeply about that. We care deeply about — we have an AI code of ethics we've had since 2019. We care deeply about the data sets we use. So it's not okay to say, okay, well this is the dataset I have, it’s got some biases.
Did you go and try and buy more data to fill those biases in? Sometimes biases in medicine affect the prevalence of the disease. Sometimes there are known things, it's not always going to work better in all cases, but is it fair? Does it work well enough? Is it safe? is a key thing.
So it's a complex issue. It really depends on the technology and the application. But if you wanted one thing is like we are not transparent explaining why AI people, we have robust, reliable methods because there's lots of things you interact with that are not transparent and able, but they have a low risk.
I'm here to look at LED screen. Most people aren't going to tell you about P and N gaps. But like, I don't need to know how it works for it to work and it’s low risk to me, for example. So we have to think of the technology on that kind of front.
Pauline: As a follow up, do you agree that we should have a regulatory body similar to the FDA to regulate these models?
Kim: I think each department that regulates the use of things should start to regulate them. So I don't think you need a separate body, for example, I would say we build models. Why shouldn’t I lodge for the FDA a separate holdout set at the start of a project, It's got an API. The FDA continuously ping my API once it's up and running and have their own independent stuff.
They could collect their own holdout set and do an independent evaluation. If you think about it at the highest level, what we do is we do a whole bunch of experiments. We do a whole bunch of statistical analysis, and we give the FDA, well, here’s my experiment. Here's all the raw stuff. Here's data, here's my homework.
And they can go along and take the same thing and check my homework and then debate about the interpretation of that. That's what they're doing. You can do the same thing with machine learning algorithms, I think we need to have a debate about where to use these things in society. We've rolled out lots of algorithms in society that have complex systems interactions.
So you need to think about the interaction of the algorithm with the rest of society and the systems and what it does. Because it's those interactions that can cause harm as well. I think there needs to be some oversight and some general conversation about what things we do want to build models for, and what things we don't want to build models for.
But I do think each — there's a lot of expertise in each of those government regulatory departments that needs to be brought to bear. So I think the FDA should be the one regulating ones with human health. I don't want to create a new department to do that, similarly with the NHD, the transport authority should be doing the ones for self-driving cars, for example.
I think there's a broader policy thing, conversation about that. There may be some broader, cross-cutting topics for that, which you need like a different body, but I think in general I would try and view the experience and domain expertise in those departments.
Pauline: That certainly is reasonable. What is the biggest challenge facing AI practitioners or researchers today?
Kim: I think, it's certainly data and compute. So it’s access to data, depending on what you can work on as the access to data and the data sets you have to play with, compute to better build models. There's a lot of conversation about certain people locked out. It may not be a bad thing to have some constraints to play in than just going bigger and much into bigger and better models.
I think one of the big things we have is a bit of sort of test set exhaustion. So over time everybody is playing with the same test sets and things like that. And, if you think of Kaggle competitions where you have the split three ways, you bang it enough, you can exhaust, you can like leap the leaderboard.
We have a lot of things right now where we're getting 1-2% differences of state of the art and a lot more complexity of the model. And I'm like, I don't really care. Like with a 1% difference, whatever. If you have 30% difference, I'm like, wow. And you have a different model, different architecture. That's interesting.
I think, so, having lots of different test sets that we agree on and new test sets, new data coming into those, that's an interesting thing. So like ImageNet's pretty worn out for a lot of things. It's good to have as a standard benchmark, but we need new things coming in. So that's a big challenge.
I think industry should do a lot more job about seeding things. Like, hey, this is an interesting thing we want to work on. Here's some really good quality data. Here's how we assess it. Have at it. Like more of those types of things I think is key. So that's probably, those are, that's a subset of data.
It gives you a really good thing to work on that's valuable. Because often we work on what we think is valuable, what people tell us is valuable, which is the data we have. And they come like, they're like, oh, we solved the thing. And they're like, that's not that interesting.
Like, why? Oh, because it's not actually solving the real problem we have. Like, why don’t you just give me the data and I'll solve the real problem for you. So I think that that's some of the challenge we have right now.
Pauline: Certainly, one that we hear of a lot is those two: data and compute. Who are one or two of the biggest influencers on your mental framework on AI?
Kim: Well I guess I really came from an information theory kind of perspective. So I was fortunate enough to read MacKay's the Information Theory, Inference, and Learning Algorithms book early on, which is a free PDF floating around. And that really framed the way I think about a lot of things.
And a long time ago I realized that the algorithms come and go, but model evaluation is forever. And actually really that's also the most important thing is like to make sure you're not fooling yourself. And there's this guy called Timothy Masters wrote this great book called I think it was Prediction Assessment of Patent Classification Algorithms.
He was another guy doing training with neural networks and stuff on Wall Street. I think, really great book that influence like how to really robustly assess models and when to use various types of things. And I teach this everyone at my group, like that's a really cool key skill to have.
How do you compare and understand techniques and that's such a key skill to understand whether you should pay attention to something or not. Is it really transformative or not? There's evaluation metrics. Often that's caught as an afterthought. We spent so much effort on this algorithm and how random forest looks like that, but how do you evaluate this sort of stuff?
Ah, it's confusion matrix. Maybe there's a Mathews Correlation coefficient or something else. F1 score if you're getting super fancy, it doesn't matter. Like that's it. Like that actually is like the key is as a much as interesting thing. Understanding those different types of things.
Because understand those tells you about failure mode. It tells you other types of things, helps you watch or use. So those are some kind of key things. I like - he's a statistician. David Hand. It was the, the former of resident of Royal Statistical Society wrote this great paper, it's old now, called Classified Technology and the Illusion of Progress.
He took a whole bunch of problems you see over machine learning library there, and show that between simple and more advanced, and I think advanced, there might have been SVM or something, if you do simple methods, you can get yourself of 90% of the way there.
And do you need that extra 10% gain for the tradeoff and the complexity. And it's probably not worth it. I think that's really, really true today, is creating simple baselines and saying, what's the information gain over the baseline? And so those three things would be key things that influence the way I think about machine learning.
And then there's lots of other people. I mean, I like Neil Lawrence's work, he's on our EAB like Gaussian processes. And there are many other things like that. But like those, I think the Information Theory aspect way look at machine learning is probably is my particular lens.
Pauline: I love that. I'll have to pull that out for the show notes. And the last rapid fire question, though it hasn't been too rapid. The classic Peter Thiel question, what is one thing that you believe strongly about AI today that you think most people would disagree with you on?
Kim: I think a lot of people feel that it's going to hollow out a lot of white color jobs and actually, haha, the people that manipulate physical things will be safe. And I think that sort of thing, I don't think that's true. If you look at the creative destruction of labor, in industrial revolutions like that, it creates just as many new roles and different types of things and specialized functionality that we didn't have before.
Lots of things that happen between banks and capital, like venture capital, your industry started during the industrial revolution. Different societies, their regulatory structures and things like that change where you've got capital and shares, all that sort of stuff happened.
None of those specialized professions existed. We certainly lost a little labor, but we got new things. So I don't think that we’re going to have such a massive destruction of jobs and labor. I think we'll create new things and traditional sorts of things. Depending where you are in your career arc, you have a greater or lesser effect.
It's not to say it doesn't have an effect on society, but I don't think it's going to be this giant negative thing where everybody is running everything by AI. I think it will lead to far more interesting specializations and creativities where things where people will augment and use AI. So that's probably my Thiel-ism for you.
Pauline: It's a very optimistic one, so I'm sure people will appreciate that. And with that, thank you so much Kim, for jumping on for what was a super interesting conversation about a number of different topics and so really appreciate you taking the time.
Kim: Thanks very much Pauline. Great to chat to you.
Views expressed in this podcast are the speaker’s own and do not necessarily represent the position of GSK.