AI and data fuel innovation in clinical trials and beyond


The last five years have seen large innovations throughout drug development and clinical trial life cycles—from finding a target and designing the trial, to getting a drug approved and launching the drug itself. The recent use of mRNA vaccines to combat covid-19 is just one of many advances in biotech and drug development.

Whether in preclinical stages or in the commercialization of a drug, AI-enabled drug development is now used by an estimated 400 companies and has reached a $50 billion market, placing AI more firmly in the life sciences mainstream.

“Now, if you look at the parallel movements that are happening in technology, everyone’s in consensus that the utility of what AI can do in drug development is becoming more evident,” says senior vice president at Medidata AI, Arnaub Chatterjee.

The pandemic has shown how critical and fraught the race can be to provide new treatments to patients, positioning the pharmaceutical industry at an inflection point, says Chatterjee.

And that’s because drug development usually takes years. Evidence generation is the industry-standard process of collecting and analyzing data to demonstrate a drug’s safety and efficacy to stakeholders, including regulators, providers, and patients.

The challenge, says Chatterjee, becomes, “How do we keep the rigor of the clinical trial and tell the entire story, and then how do we bring in the real-world data to kind of complete that picture?”

To build more effective treatments faster, drug and vaccine companies are using data iteratively to improve understanding of diseases that can be used for future drug design. Bridging gaps between clinical trial and real-world data creates longitudinal records. AI models and analytics can then be used to enable feedback loops that are key for ensuring safety, efficacy, and value, says Chatterjee.

“We want to create safe and expeditious access to therapy,” says Chatterjee. “So we really have to meet this moment with innovation. With all the new advances happening in drug development, there’s no reason why technology and data can’t be there.”

This episode of Business Lab is produced in association with Medidata.

Related resources

Laurel Ruma: From MIT Technology Review, I’m Laurel Ruma, and this is Business Lab, the show that helps business leaders make sense of new technologies coming out of the lab and into the marketplace. Our topic today is innovation and life sciences. Artificial intelligence and data can help fuel new ways of working on clinical trials and beyond, and the benefits are clear in the heavily regulated industry. Faster time to market, reduced risks, lowered costs, and fewer burdens on patients.

Two words for you: reducing uncertainty.

My guest is Arnaub Chatterjee, senior vice president at Medidata AI at Medidata. Arnaub is also a teaching associate at the Department of Health Care Policy at Harvard Medical School and a lecturer in the Department of Policy Analysis and Management at Cornell University.

This episode of Business Lab is sponsored by Medidata.

Welcome, Arnaub.

Arnaub Chatterjee: Hi, thanks for having me.

Laurel: Could you give us a picture of what innovation and life sciences looks like right now? I know that’s a big question, but with artificial intelligence, cloud computing, and more reliance on data in general, you and your customers must be building some really interesting tech.

Arnaub: Yeah, absolutely. If you look back at the past five years, it’s arguably one of the most innovative periods in drug development and the launches that have taken place in recent memory. So, we’ve seen booms in biotech and drug development, but also in parallel, there’s real advancements in technology that really is across all of the drug development life cycle, from finding a target to designing the trial, to getting a drug approved and launching the drug itself. So, if you look at some of the more promising platform technologies as a starting point and where drug development has taken place, we’ve talked about CRISPR for some time, but we’re actually moving not only into gene editing, but into areas like base editing, RNA editing. There’s a ton of real meaningful work happening here, and we can all point to mRNA and what’s happened with covid vaccines.

And I think the world is now eagerly waiting to see where else mRNA can be applied. Now, if you look at the parallel movements that are happening in technology, everyone’s in consensus that from an AI perspective, the utility of what AI can do in drug development is becoming more evident. Whether you’re looking at finding these difficult targets in the preclinical stages to finding ways to use AI to improve the commercialization of a drug, AI-enabled drug development is like a $50 billion market. There’s 400 different companies that are operating in the space. And we’ve gone from touting potential to actually making this a little bit more mainstream where AI has gone from target identification, but it’s actually gone from lab into clinic where we’re testing the interventions. So, I think those are all really exciting movements just in development and technology and in parallel with each other. There’s even some things that are further upstream.

We talk about the value of data and data liquidity, how much that’s grown. The Broad Institute here in Boston and Cambridge has said that they doubled their genomic data every eight months. So, we’re seeing not only the growth in the data, which has been taking place for some time in omics and imaging in some of these other areas, but computational power is following. Even much of the world’s data now is capable of being linked together through advances in areas like tokenization. So overall, I think we are seeing growth innovation in all areas. When I talk to all the big pharmaceutical companies or the biotechs, there’s still a lot of very similar problems, which is, “How do I make more informed decisions about my trial?” And then, “How can data analytics show a meaningful change?” in what you said earlier, which is reducing uncertainty, improving the probability of success in a drug as we kind of move the needle in the right direction.

Laurel: So mentioning the pandemic, it really has shown us how critical and fraught the race is to provide new treatments and vaccines to patients. Could you explain what evidence generation is and then how it fits into drug development?

Arnaub: Sure. So as a concept, generating evidence in drug development is nothing new. It’s the art of putting together data and analyses that successfully demonstrate the safety and the efficacy and the value of your product to a bunch of different stakeholders, regulators, payers, providers, and ultimately, and most importantly, patients. And to date, I’d say evidence generation consists of not only the trial readout itself, but there are now different types of studies that pharmaceutical or medical device companies conduct, and these could be studies like literature reviews or observational data studies or analyses that demonstrate the burden of illness or even treatment patterns. And if you look at how most companies are designed, clinical development teams focus on designing a protocol, executing the trial, and they’re responsible for a successful readout in the trial. And most of that work happens within clinical dev. But as a drug gets closer to launch, health economics, outcomes research, epidemiology teams are the ones that are helping paint what is the value and how do we understand the disease more effectively?

So I think we’re at a pretty interesting inflection point in the industry right now. Generating evidence is a multi-year activity, both during the trial and in many cases long after the trial. And we saw this as especially true for vaccine trials, but also for oncology or other therapeutic areas. In covid, the vaccine companies put together their evidence packages in record time, and it was an incredible effort. And now I think what’s happening is the FDA’s navigating a tricky balance where they want to promote the innovation that we were talking about, the advancements of new therapies to patients. They’ve built in vehicles to expedite therapies such as accelerated approvals, but we need confirmatory trials or long-term follow up to really understand the evidence and to understand the safety and the efficacy of these drugs. And that’s why that concept that we’re talking about today is so important, is how do we do this more expeditiously?

Laurel: It’s certainly important when you’re talking about something that is life-saving innovations, but as you mentioned earlier, with the coming together of both the rapid pace of technology innovation as well as the data being generated and reviewed, we’re at a special inflection point here. So, how has data and evidence generation evolved in the last couple years, and then how different would this ability to create a vaccine and all the evidence packets now be possible five or 10 years ago?

Arnaub: It’s important to set the distinction here between clinical trial data and what’s called real-world data. The randomized controlled trial is, and has remained, the gold standard for evidence generation and submission. And we know within clinical trials, we have a really tightly controlled set of parameters and a focus on a subset of patients. And there’s a lot of specificity and granularity in what’s being captured. There’s a regular interval of assessment, but we also know the trial environment is not necessarily representative of how patients end up performing in the real world. And that term, “real world,” is kind of a wild west of a bunch of different things. It’s claims data or billing records from insurance companies. It’s electronic medical records that emerge out of providers and hospital systems and labs, and even increasingly new forms of data that you might see from devices or even patient-reported data. And RWD, or real-world data, is a large and diverse set of different sources that can capture patient performance as patients go in and out of different healthcare systems and environments.

Ten years ago, when I was first working in this space, the term “real-world data” didn’t even exist. It was like a swear word, and it was basically one that was created in recent years by the pharmaceutical and the regulatory sectors. So, I think what we’re seeing now, the other important piece or dimension is that the regulatory agencies, through very important pieces of legislation like the 21st Century Cures Act, have jump-started and propelled how real-world data can be used and incorporated to augment our understanding of treatments and of disease. So, there’s a lot of momentum here. Real-world data is used in 85%, 90% of FDA-approved new drug applications. So, this is a world we have to navigate.

How do we keep the rigor of the clinical trial and tell the entire story, and then how do we bring in the real-world data to kind of complete that picture? It’s a problem we’ve been focusing on for the last two years, and we’ve even built a solution around this during covid called Medidata Link that actually ties together patient-level data in the clinical trial to all the non-trial data that exists in the world for the individual patient. And as you can imagine, the reason this made a lot of sense during covid, and we actually started this with a covid vaccine manufacturer, was so that we could study long-term outcomes, so that we could tie together that trial data to what we’re seeing post-trial. And does the vaccine make sense over the long term? Is it safe? Is it efficacious? And this is, I think, something that’s going to emerge and has been a big part of our evolution over the last couple years in terms of how we collect data.

Laurel: That collecting data story is certainly part of maybe the challenges in generating this high-quality evidence. What are some other gaps in the industry that you have seen?

Arnaub: I think the elephant in the room for development in the pharmaceutical industry is that despite all the data and all of the advances in analytics, the probability of technical success, or regulatory success as it’s called for drugs, moving forward is still really low. The overall likelihood of approval from phase one consistently sits under 10% for a number of different therapeutic areas. It’s sub 5% in cardiovascular, it’s a little bit over 5% in oncology and neurology, and I think what underlies these failures is a lack of data to demonstrate efficacy. It’s where a lot of companies submit or include what the regulatory bodies call a flawed study design, an inappropriate statistical endpoint, or in many cases, trials are underpowered, meaning the sample size was too small to reject the null hypothesis. So what that means is you’re grappling with a number of key decisions if you look at just the trial itself and some of the gaps where data should be more involved and more influential in decision making.

So, when you’re designing a trial, you’re evaluating, “What are my primary and my secondary endpoints? What inclusion or exclusion criteria do I select? What’s my comparator? What’s my use of a biomarker? And then how do I understand outcomes? How do I understand the mechanism of action?” It’s a myriad of different choices and a permutation of different decisions that have to be made in parallel, all of this data and information coming from the real world; we talked about the momentum in how valuable an electronic health record could be. But the gap here, the problem is, how is the data collected? How do you verify where it came from? Can it be trusted?

So, while volume is good, the gaps actually contribute and there’s a significant chance of bias in a variety of different areas. Selection bias, meaning there’s differences in the types of patients who you select for treatment. There’s performance bias, detection, a number of issues with the data itself. So, I think what we’re trying to navigate here is how can you do this in a robust way where you’re putting these data sets together, addressing some of those key issues around drug failure that I was referencing earlier? Our personal approach has been using a curated historical clinical trial data set that sits on our platform and use that to contextualize what we’re seeing in the real world and to better understand how patients are responding to therapy. And that should, in theory, and what we’ve seen with our work, is help clinical development teams use a novel way to use data to design a trial protocol, or to improve some of the statistical analysis work that they do.

Laurel: And you touched on this some, but how are companies actually leveraging that novel type of data to build faster, more effective treatments? And then how does the totality of evidence come into this and what the example of that is in practice?

Arnaub: Yeah, that’s a great term, totality of evidence. And ultimately, it’s how do you tell a complete story about safety, efficacy, and value? I think of totality as two things here. One is are we using data iteratively through every decision we’re making around the design of a trial, which could be kind of a cyclical process where you’re using the data to improve your understanding of the disease? All the way through when a drug is close to launch or has launched, what is the evidence being generated that is feedback for future drug design or for better drug design after the approval? The other angle, I think, for totality of evidence is, can we bridge these two data sets together between the clinical and the real world? And you create a longitudinal patient for a single data set for an individual patient that enables that feedback loop.

So, we call this actually integrated evidence, that metadata, you kind of look at it through three different angles. So, one is can you use historical trial data throughout the entire development process to make sharper decisions, avoid those protocol amendments that trip up trials for six to 12 months, and generate really high-quality regulatory-grade output? The second lever we use is can we use something like a synthetic control arm, which is basically using historical data to augment or even replace a control arm in areas like rare disease or in single-arm trials. This saves a lot of time; it saves a lot of cost. It’s kind of a lever that we can use to build and contribute to that totality of evidence. And then the third example is can we find a continuous way, like I mentioned through Medidata Link, which is how do we link the data and bring it all together?

So, there’s two specific examples that I might want to allude to here in terms of work that we’ve done. One is we work a lot in the CAR-T [chimeric antigen receptor T-cell therapy] space, and this is an area that’s just been tripped up time after time for company after company because of safety issues. We’re sitting on our side on the top of the single largest historical trial data set. And we want to use this as a resource where we can use historical trial data to predict when serious adverse events will occur. We built models alongside the Cleveland Clinic, they were near 90% accurate att predicting cytokine release syndrome, which is one of the more serious events that contribute to patient death. And that data point is important on its own, but the totality of evidence question is, how can you use this model to screen for patients in the trial who might be more prone to an event? Or even more importantly, how can you prevent an unnecessary death or event in the future of these trials?

The last example I’ll share is within synthetic control arms as a lever to move trials faster, we’ve seen success on our side in glioblastoma and in acute lymphocytic leukemia. We built these mashed cohorts of patients with historical data that almost perfectly mirrors what the pharma companies we’re working with are looking at in their trial. And this is changing an evidence strategy. This is totality of evidence because you are incorporating this into the trial. If you’re successful, it’s a huge accelerant, six to 18 months worth of time, and this is where we’re seeing a lot of regulatory success on our end. So these are two concrete examples, the CAR T and the synthetic control space, where I think adding these two things to an existing workflow or to an existing way of thinking about a drug could be very beneficial.

Laurel: And this is also where that evidence generation, when done in this modern way, could be a solution to reduce those burdens on patients and lessen the lengthy approval process, as well as the risks and costs, right?

Arnaub: Yeah, exactly.

Laurel: How else will scientists, healthcare providers, and ultimately patients benefit from this kind of innovation? If you have the world’s largest data set, what does that mean? What can you find? What can you see?

Arnaub: Yeah. So, as a technology company, we have to show that using analytics or new technology is worth the investment for a pharma company or someone who is conscious about how it might change an existing process. A lot of our data might challenge preexisting assumptions that scientists have. It might even challenge a current belief that’s in the market. So, historical data is important as a benchmark for what you currently believe in, but it could also change how you address things moving forward. But if we get it right, we’re absolutely helping to reduce uncertainty or accelerate a timeline or get to a “no go” decision more quickly. Why run a costly phase two or phase three trial if the evidence doesn’t make sense in earlier phases? So, I think there’s real value in how we can think about where this applies across the innovation stakeholder spectrum.

So, the idea that if you link data together and you generate that evidence in its totality for a single patient is incredibly promising. If you build longitudinal cohorts for these as single patients, and you actually follow these patients long after the trial, you have significant periods of extended observation, I think you can use that data to answer hard questions. How do you evaluate long-term efficacy? What are long-term outcomes like, and how can we prospectively even design a clinical trial that improves drug performance in the future? There are certain areas like CAR T where the regulatory bodies are asking for 10 to 15 years of evidence after the trial. And I think we need a systematic way of collecting that data for individual patients. So, I see a future here where there’s clearly a benefit for clinical science and for clinical development in this space.

Ultimately, if that package of evidence is good and we’re building a brand-new data asset by linking all of this data together, that is a different way for providers to understand if drugs are safe or if drugs are efficacious. And ultimately, that’s what impacts patients is if we’re pulling this data earlier in the process, there should be downstream benefit. The goal here ultimately is twofold. We spend 10 to 15 years on average in getting a drug to market or to get a drug approved. And our goal, Medidata’s goal, is the same. We want to create safe and expeditious access to therapy. So, we really have to meet this moment with innovation. With all the new advances happening in drug development, there’s no reason why technology and data can’t be there to be in augmentation of what’s happening.

Laurel: What is the regulatory perspective on this new data that is being generated and used for evidence submission? How are pharmaceutical companies adapting to that changing landscape?

Arnaub: Yeah, I think we have to understand that the regulatory perspective here is going to change as use cases become validated, as the regulators have time to evaluate what they’re seeing in the data. So the recent FDA draft guidance that came out in September of 2021 is a really critical step forward for real-world data. They went into great length into defining accuracy and completeness and data providence and data traceability. So what is high-quality evidence? I think this was one of the first real stabs from the regulatory bodies to say, “If you have routinely collected data outside of the trial, this is what good and this is what rigorous looks like.” And I think technology companies or drug developers, we really need to understand how we could design trials effectively based on these guardrails.

So, for the first time for the history of real-world data, real_world evidence, it’s being brought front and center. How do we historically measure the accuracy of this underlying data, and then what is quality? One of the lines being thrown around at the FDA right now is that quality real-world evidence can’t be built without quality real-world data, which is very intuitive. But I think what they’re trying to define here is what that looks like, and they’ve taken a lot of public commentary on how we can work as an industry together. So, I think we’ve seen from our own experience in areas like synthetic control arms where we’ve received regulatory blessing and permission. As a pharmaceutical company, you have to approach the regulatory bodies well in advance. And if you’re going to do things like link data or if you’re going to propose a synthetic or an external control arm, you have to get it pre-specified, you have to put it into a statistical analysis plan well in advance, and kind of get a vetting from the regulators on whether this approach makes sense or not.

And at that time, it’s a great opportunity to hear their feedback, to understand what kind of data you need to address the problem that you’re trying to solve, or the disease area you’re tackling. And I think at that juncture, you will know, and the FDA’s usually very good about this, whether that approach makes sense or not. I think we’ve seen, for example, in synthetic control arms, which have been used for a number of years now, there are disease areas where it makes sense, and then there’s disease areas where the FDA has said, “Sometimes you just need to randomize the trial and you should go through a traditional process.” And I think that’s very reasonable. So, it is a bit trial and error. At the same time, it seems regulators are very amenable to new approaches, but we just have to do it judiciously.

Laurel: As we take all of this in, developing new patient treatments is ever-evolving clearly. But in the next three to five years, what are the opportunities and challenges ahead? What are you really excited about?

Arnaub: Yeah, it’s so hard to guesstimate what’s going happen just given how quickly things are evolving and how quickly the markets are changing, even in terms of where drug development’s going to go. So, my hope is that we start to see biotech and pharma use data more iteratively. And I think right now, there’s a little bit of point-by-point decision making, a little bit of traditional, “This is what the literature is telling us.” But there’s a world where the data will hopefully become high enough quality, whether it’s historical trial data like we use or real-world data, where you could simulate your trial. You could better understand even using synthetically generated patients, which is a new technology that’s evolving through AI. These are models that you can build internally as a pharma co or work with a technology company like ours to predictively figure out what’s going to happen. And the whole idea is to de-risk certain things before bad events happen. So advancing analytics to increase the speed and the effectiveness of how these decisions get made is really important.

I think the second thing I’m pretty excited about is what are we going to see from algorithms that are being developed right now in the world? We are getting really good at measuring algorithmic performance, whether you see that in the imaging space or whether you see that as software as a medical device, a companion. We need to move from a world where we’re just measuring performance to one where we’re really pressure-testing the algorithms in real clinical settings, meaning how does this get into a physician’s or a provider’s workflow? Can this be part of clinical decision workflow tools? Once you’ve cleared this software as a medical device, regulatory acceptance from the FDA or the EMA, where does the actual implementation in clinical care take place?

So right now, we’re probably still arguably in the early days of looking at data and AI and saying, “Are the algorithms good or not?” And there should be a real hard assessment on, “How do we create generalizable and representative algorithms that reflect the diversity of populations that are free from bias?” But in parallel, we should understand what is the rigor required for this to augment physician decision-making or provider decision-making? And I think that’s going to be a real step change in the next three to five years as we get more validation and qualification for some of these algorithms. And then I think the last thing that I’m excited about is where the regulatory environment is changing. We’ve seen a pretty high number of drug approvals, kind of in the 50 plus range over the last couple of years, we know there’s rarely any silver bullets with the FDA or EMA.

But we know there’s growing acceptance from regulatory agencies on leveraging certain types of novel methodologies, such as the synthetic or external control. We’re seeing agreement and approval. We’re seeing the use of EHRs that are part of label expansions. So there is maybe a slow, but also a pretty interested, group of people over there that do believe that if we get the evidence right, even the current FDA administrator, Dr. Califf, is a big proponent of testing out new technologies, new methodologies. So I’m hopeful that as tech companies like ourselves try to create these expeditious ways of getting drugs into the hands of patients, that it’s well received by the regulatory groups in parallel. So, that’s probably maybe two or three ideas of how I see things moving along.

Laurel: That’s pretty amazing, synthetic generated patients, a whole nother topic for another discussion.

Arnaub Chatterjee: For sure.

Laurel: Thank you very much, Arnaub, for joining us today on The Business Lab.

Arnaub: I appreciate the time, thank you.

Laurel: That was Arnaub Chatterjee, senior vice president at Medidata AI, who I spoke with from Cambridge, Massachusetts, the home of MIT and MIT Technology Review, overlooking the Charles River.

That’s it for this episode of Business Lab. I’m your host, Laurel Ruma. I’m the Director of Insights at the Custom Publishing Division of MIT Technology Review. We were founded in 1899 at the Massachusetts Institute of Technology, and you can find us in print, on the web, and at events each year around the world. For more information about us and the show, please check out our website at

This show is available wherever you get your podcasts. If you enjoyed this episode, we hope you’ll take a moment to rate and review us. Business Lab is a production of MIT Technology Review. This episode was produced by Collective Next. Thanks for listening.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.