Transforming public health with unstructured data and NLP in FDA's Sentinel Initiative

Research in Action - A podcast by Oracle Corporation

Categories:

What is the MOSAIC-NLP project around structured and unstructured EHR data? Why is structured data not really enough for drug safety studies? And to what degree is NLP speeding up access to data and research results? We will learn all that and more in this episode of Research in Action with Dr. Darren Toh, Professor at Harvard Medical School and Principal Investigator at Sentinel Operations Center. www.oracle.com/health www.oracle.com/life www.sentinelinitiative.org -------------------------------------------------------- Episode Transcript: 00;00;00;00 - 00;00;26;14 What is the MOSAIC and LP project around structured and unstructured data? Why is structured data not really enough for drug safety studies? And to what degree is NLP speeding up access to data and research results? We'll find all that out and more on this episode of Research in Action. Hello and welcome to Research in Action, brought to you by Oracle Life Sciences. 00;00;26;14 - 00;00;50;14 I'm Mike Stiles. And today our guest is Dr. Darren Toh, professor at Harvard Medical School and principal investigator at Sentinel Operations Center. He's got a lot of expertise in Pharmacoepidemiology as well as comparative effectiveness research and real-world data. So, Darren, really glad to have you with us today. Thank you. My pleasure to be here. Well, tell us how you wound up where you are today. 00;00;50;14 - 00;01;26;22 What what attracted you in the beginning to public health? Good question. So I trained in pharmacy originally, and I got my Masters degree in Pharmaceutical Outcomes Research at a University of Chicago, Illinois, Chicago. And it's where I first learned about a field called Pharmacoepidemiology, which sort of very interesting to me because I like to solve problems with methods and data and pharmacoepidemiology. 00;01;26;22 - 00;02;00;29 It seems to be able to teach me how to do that. So I got into the program at the Harvard School of Public Health, and when I was finishing up, I was deciding between staying in academia and going somewhere and getting a real job. And that's when I found out about an opportunity within my current organization and I've heard great things about this organization. 00;02;00;29 - 00;02;29;26 So I thought I would give it a try. And the timing turned out to be perfect because when I joined, our group was responding to a request for proposal for what is called a mini sentinel pilot, which ultimately became the sentinel system that we have today. So I've been involved in the Sentinel system since the very beginning or before we began. 00;02;29;28 - 00;03;02;25 And for the past 15 years I've been with the system and the program and because I really like its public health mission and I'm also very drawn to the dedication of FDA, our partners and my colleagues to make this a successful program. Well, so now here you are, a principal investigator. What exactly is the Sentinel Operations Center? What's what's the mission there and what part do you specifically play in it? 00;03;02;27 - 00;03;52;26 Sentinel is a pretty unique system because it is a congressionally mandated system. So the Congress passed what is called the FDA Amendments Act in 2007. And within that FDA, the Congress asked FDA to create a new program to complement FDA existing systems to monitor medical product safety and more specifically, the Congress, US FDA, to create a post-market risk identification and analysis system that will be using data from multiple sources that will cover at least 1 million lives to to look at the safety of medical products after they are approved and marketed. 00;03;52;28 - 00;04;33;07 So in response to this congressional mandate, FDA launched what is called a Sentinel initiative in 2008 and in 2009 as I mentioned, FDA issued its request for proposal to launch the Mini Sentinel Pilot program, and the program grew into the sentinel system that we have today. So it's for my involvement. It sort of grew over time. So when I joined, as I mentioned, we were responding to this request for a proposal and we were very lucky to be awarded the contract. 00;04;33;09 - 00;05;04;05 So when it was starting, I serve as a one of the many epidemiologists on the team and I led several studies and I gradually took on more leadership responsibility and became the principal investigator of the Sentinel Operations Center in 2022. So I've been very fortunate to have a team of very professional and very dedicated colleagues within the operations center. 00;05;04;05 - 00;05;27;26 So on a day to day basis, we work with FDA to make sure that we can help them answer the questions they would like to get addressed. And we also work with our partners to make sure that they have the resources that they need to answer the questions for FDA. And most of the time I'm just the cheerleader in chief just to share my colleagues and our collaborators. 00;05;27;28 - 00;06;11;23 Now that's great. And and then specifically, there's the Mosaic NLP project that you're involved with. What is that trying to achieve and what are the collaborations being leveraged to get that done? So Sentinel Systems has always had access to medical claims data and electronic health record data or year data. One of the main goals for the current sentinel system is to incorporate even more data, both structured and unstructured, into the sentinel system and to combine it with advanced analytic methods so that FDA can answer even more regulatory questions. 00;06;11;25 - 00;06;40;09 So the Mosaic and NLP project was one of the projects that FDA funded to accomplish this goal. So the main goal of this project is to demonstrate how billing claims and data from multiple sources when combined with advanced machine learning and natural language processing methods, could be used to extract useful information from unstructured clinical data to perform a more robust drug safety assessment. 00;06;40;11 - 00;07;21;18 When we tried to launch this project, we decided that we would issue our own request for proposal. So there was an open and competitive process, and Oracle, together with their collaborators, were selected to lead this project. So I want to talk in broad or general terms right now about data sharing, the standards and practices around that. It kind of feels silly for anyone to say it's not needed, that we can get a comprehensive view and analysis of diseases and how they're impacting the population without it. 00;07;21;20 - 00;07;46;15 NIH is on board. It updated the DMS policy to promote data sharing. You know, the FDA obviously is leaning into this. So is data sharing now happening and advancing research as expected, or are there still hang ups? So I think we are making good progress. So I think the good news is data are just being accrued at an unprecedented rate. 00;07;46;17 - 00;08;28;21 So there are just so much data now for us to potentially access and analyze. There's always this concern about proper safeguard of individual privacy. And through our work, we also became very appreciative of other considerations, for example, the fishery responsibilities of the delivery systems and payers to protect patient data and make sure that they are used properly. So you mentioned the recent changes, including in data management, ensuring policy, which I think are moving us in the right direction. 00;08;28;26 - 00;08;56;23 But if you look closer at the NIH policy, it makes special considerations for proprietary data. So I would say that we have made some progress, but access to proprietary data remains very challenging. And the FDA, the NIH policy doesn't actually fully resolve that yet. When you think about the people who do make that argument for limited data sharing, they do mostly talk about what you just said about patient privacy. 00;08;56;23 - 00;09;25;20 IT proprietary data. Pharma is especially sensitive to that, I would imagine. So how do we incentivize the reluctant how can we ease their risks and concerns or can we? Yeah, it's a tough question. I think that this require a multi-pronged approach and I can only comment on some aspects of this. So I would say that at least based on our experience, the willingness or ability to share data often depends on the purpose. 00;09;25;23 - 00;09;55;29 That is, why do we need the data? Many data partners participate in Sentinel because of its public health mission, and our consideration is how would the data be used again, Is there proper safeguard of patient privacy and institutional interest? There are other ways to share data. For example, instead of asking the data to come to us, we can send analysis to where the data is. 00;09;56;06 - 00;10;34;22 And that is actually the principle follow by federated system like Sentinel. So we don't pull the data centrally. We send an analysis to the data partners and only get back what we need it. And it's usually in the summary level format. So that actually encourages more data sharing instead of less sharing. I would say that recent advances in some domains, such as tokenization and encryption, might also reduce some concern about a data sharing, a patient privacy concerns in academic settings. 00;10;34;29 - 00;11;24;26 We've been talking a lot about days, for example, for individual who collect the data and the people I propose to offer them authorship or proper acknowledgment if they are willing to share their data. But that is not sufficient in many cases outside of academic settings. If you look at what is happening in the past ten years or so, there are now a lot of what people call data aggregators that are able to bring together data from multiple delivery systems or health plans, and they seem to be able to develop a pretty effective model to convince the data provider to share that data in some way. 00;11;24;29 - 00;11;55;28 And a way to do that could be to help these data providers to manage their data more efficiently or to help them identify individuals who might be eligible for clinical trials. More quickly. So there are some incentives that we could think of to allow people to to share that data more openly but personally, I think that scientific data should be considered public good and hopefully that will become a reality one day. 00;11;56;00 - 00;12;23;21 Yeah, that's really interesting because it sounds like it's both a combination of centralized and decentralized tactics in terms of of data sharing and gathering. Why is it so important to use unstructured data in pharmacoepidemiology studies? And does NLP really make a huge difference in overcoming the limitations and extracting that data? So in the past, I think that that's true. 00;12;23;21 - 00;12;58;07 Now, many pharmaco epidemiologic studies rely on data. They are not collected for research purposes. So we use a lot of medical claims, data that are maintained by payers. We use each our data that are maintained by delivery systems. So this data are not created for research purposes and much of this data, at least for claim, is data stored in structured format using established coding systems like ICD ten. 00;12;58;10 - 00;13;39;06 Coding system and structured data sometimes are not granular enough for a given drug safety study and certain data or set of variables that are required for claims reimbursements or other business purposes might not be collected at all. And people felt that, well, maybe the information that we need could be extracted from unstructured data because as part of clinical care, the physicians or nurse practitioner or the health care provider might include that information in the notes, but use user data also pretty messy, especially that unstructured data. 00;13;39;08 - 00;14;05;25 So instead of going through the unstructured notes manually to extract this information manually, technique by natural language processing could help us do this task much more efficiently so that we can mind a larger model of unstructured data. Well, obviously, when it comes to real world evidence, you're a fan. Tell us what excites you about using it to complement clinical research. 00;14;05;25 - 00;14;42;07 Get us more evidence based insights and help practitioners make better decisions. Yeah, that's a great question. Yes, I'm a fan of so I personally don't quite like the dichotomy between conventional, randomized, controlled trial and real world data studies because they actually sit along a continuum. But is true that conventional randomized trials cannot address all the questions in clinical practice. 00;14;42;09 - 00;15;30;17 So that's where real data and real data studies come in, because real data like we discussed come from clinical practice. So they capture what happens in day to day clinical practice. So if we are thoughtful enough, we will be able to analyze the data properly and generate useful information to fill some of the knowledge gap. The truth is we have been using real data throughout the lifecycle of medical product development for many years now, ranging from understanding the natural history or burden of diseases to using real data as controls for single arm trials, and that we have been doing this before the term real data became popular. 00;15;30;19 - 00;15;57;11 So I see real data to complement what we could do in conventional randomized trials. So real data studies don't replace clinical trials. I see them to be complementary, and real data studies sometimes are the only way for us to get certain evidence. We already talked about Mosaic and LP that project, but I kind of want to go a little deeper with it. 00;15;57;11 - 00;16;42;02 The idea is to tackle the challenges of using link data structured and unstructured at scale. Tell us about a use case for that project and why it was chosen for this project. We actually, Cerner proposed to use the association between Montelukast, which is an asthma drug and neuropsychiatric events as a motivating example. It is also important to note that the project is not designed to answer this particular safety question, because if you look at the label of Montelukast, there's also already a box warning on neuropsychiatric events. 00;16;42;02 - 00;17;18;26 So FDA already has some knowledge about this being a potential adverse event associated with the medication. The reason why or recalls is has proposed this project was because we actually did look at this association in a previous sentinel study that only used structured data, although the study provided provided some very useful information. We also recognized that certain information that we needed was available in such a data, but may be available in unstructured data. 00;17;18;28 - 00;17;42;18 So if we are able to get more data from unstructured data, we might be able to understand this association better. So that's why this motivating example was chosen. Well, this is an Oracle podcast and Oracle is involved in Mosaic, so I think it's fair to ask you about the technology challenges that are involved in what you're trying to do. 00;17;42;19 - 00;18;17;24 What does the technology have to be able to do for you to experience success? So Mosaic in LP is I was at a very ambitious project because it is using an LP to extract multiple variables that are important for the study. That includes the study outcome, which when you look at it, is a composite of multiple clinical outcomes and it's also trying to extract important covariates that could help us reduce the bias associated with real data study. 00;18;17;26 - 00;19;01;24 So I think technology comes in well is powerful in many ways. First, thanks to technology, the project is able to access very large amount of data from millions of patients who seek care in more than 100 healthcare delivery systems across the country. So this was hard to imagine maybe ten or 15 years ago. But now we have access to lots and lots of data at our fingertips because of advances in technology, because of the large amount and the complexity of the data methods side and LP becomes even more important. 00;19;01;26 - 00;19;33;19 And for this project, we are also particularly interested in whether an LP algorithm developed in one year trial system could be applied to another system, which has been a challenge in our field because each year our system is created very differently. So one, an algorithm that works in one system might not work in another. So we are hoping that through advanced methods and technology, we will be able to address this problem. 00;19;33;21 - 00;19;57;15 So without this technology advances, we might not be able to do this study as efficiently as we could all So the task might might not be possible. So where are we going with this? I mean, let's say the project is a success. What will that mean in terms of the FDA's goals and how NLP gets applied in medical therapeutics safety surveillance? 00;19;57;18 - 00;20;38;03 The hope is that Sentinel system can answer even more questions than it can address today. And the way that we are trying to accomplish that is to see whether or how this complex, unstructured data, we combine it with advanced analytic methods can help us answer questions that could not be addressed by structured data alone. I think through this project we also learned a lot about how the challenges associated with analyzing a very large amount of data from multiple sources. 00;20;38;06 - 00;21;11;14 Again, service data is compiled from more than 100 systems, so it is big but also very complex. And in many of our studies we really need that large amount of data just to be able to answer the question because we may be focusing on rare exposures or real come. So you really need to start with very large from our data just to get to maybe the ten patients that are taking a medication. 00;21;11;17 - 00;21;44;15 And what you learn with Mosaic, can that get applied to addressing other public health issues like disparate ease and asthma diagnosis and treatment, especially when you think about diverse groups? Yeah, that's a great question. So is the project is not designed to address these important questions, but if we are able to better understand the completeness of social drivers of health in these data sources, then we will be able to leverage this data to answer these questions in the future. 00;21;44;18 - 00;22;04;26 I think about how a project like this gets a evaluated at various steps along the way. I guess that's my question. How I mean, what what methods are used to ensure the validity of real world evidence? So the good news is in the past few decades we have been using real data, even though we might not be using the term. 00;22;04;28 - 00;22;36;22 So there's been a lot of progress in the field to improve the validity of Real-World Data studies. So we now have a pretty good framework to identify fit for purpose data, and we also have very good understanding of appropriate design and analytic methods. So to target trial emulation and propensity score methods. So this project and many other projects in Sentinel are following this principle. 00;22;36;24 - 00;23;14;03 And one thing to also note that this project is also following the overall sentinel principle in transparency. So everything we do will be in the public domain to allow people to reproduce, so replicate the analysis. So the protocol is available in public domain, and when we are done with the study, everything will be made publicly available. So that's one way to make sure that the the work at least is reproducible or replicable. 00;23;14;05 - 00;23;43;00 And through that process, we hope to be able to improve the validity of this study. And what about comparisons? How do you compare the results from different data sources like claims data, structured data? You know, I extracted unstructured data, all of that. How was that done, the comparisons? So if you're talking about the Mosaic and LP study, so we have a pretty structured approach to address that question. 00;23;43;02 - 00;24;13;14 So we are using this proven principle of changing one thing and keeping everything else fixed to see what happens. So the project will start by using only claims data to replicate the previously done Sentinel study. And then we are going to add on such data to see whether the results are different. And then we add on an LP extract that unstructured data one at a time to see whether the results change. 00;24;13;21 - 00;24;40;24 So by fixing everything else to be constant and changing one thing, we'll be able to assess the added value of each how data, both structure and structure. And that's how we are going to do it within the Mosaic and LP study. And then what about scalability? How would you make sure the NLP models that you develop are scalable and transportable across all these different health systems of which there are many? 00;24;40;27 - 00;25;10;10 Yeah. The question again is about transport ability. So one thing that is unique about this study, as we briefly discussed earlier, was that the the survey yesterday to actually come from multiple healthcare systems. So the end up models that we are developing will be trained in tune on a sample of patients from this system and not from a single hospital network. 00;25;10;10 - 00;25;42;18 So at the development phase, we are already taking into account the potential diversity of different delivery system. And as part of this project, we also include another delivery system to apply and test the method as part of the transport ability assessment. So we are doing that to make sure that the LPI models that we are developing for this project will be useful for other system as well. 00;25;42;20 - 00;26;12;29 Unknown There is a larger question about computational resources, so that will be the issue that would still need to be addressed because a train and tuning this and NLP models within such a huge amount of data requires a lot of computing resources. So that is something that we could only partially address in our study. But if we want to apply or do the same thing in our system, that would be something to consider. 00;26;13;02 - 00;26;43;13 We talked a little bit about the collaboration with your tech partner, but these things usually have so many stakeholders and disciplines and silos. Tell us first why collaboration is a good thing and unavoidable anyway, and then what the challenges of collaboration are. Maybe some tips on how to best make them work. The problems that we face, at least many of the problems that I face quite complex and they require expertise from multiple domains. 00;26;43;13 - 00;27;18;19 So that calls for collaboration from multiple stakeholders. And we always have our blind spots. So we only see things in a certain way and we always miss things. So that's why I think collaboration is important. But it's really hard sometimes because we all have our priorities and perspectives and sometimes they don't align. And I also learned throughout the years that we don't communicate enough and we may also not have time to communicate or we may be under pressure to deliver. 00;27;18;21 - 00;27;47;21 So all of that sort of contribute to the challenges of collaborating effectively, especially when you collaborate across disciplines, because we might be using different languages to mean the same thing or use the same term to describe different things. So even though we can all speak the same language less English, we might not be talking about the same thing and not communicate at all. 00;27;47;21 - 00;28;17;25 Because because we are using different joggers and terminology. So that has been tough. But I think we are getting better. And so I think that it is for us within the center of operation center, we try to communicate honestly and respectfully and we try to understand different perspectives and we try to find common ground. And but I think ultimately what brings us together is that we have a shared common goal. 00;28;17;27 - 00;28;44;17 A lot of the work that we do. So for music and NLP, we are all trying to answer the same question, which is that how do we use unstructured data and advanced analytic methods to answer safety question? So once we apply on this common goal, things become easier because we start to understand each other better or be able to communicate more effectively. 00;28;44;19 - 00;29;19;16 Just out of curiosity, what are the different stakeholders involved in Mosaic? Who falls on the roster? we have people from different disciplines, so we have experts in natural language processing and artificial intelligence. We have epidemiologists, both statisticians, clinicians, we experts in psychiatric conditions and respiratory disease. We have data scientists, we have engineers, we have project managers. So it's a very big group of individuals with different expertise in this project. 00;29;19;18 - 00;29;46;14 Well, you probably noticed Oracle's really thrown itself into and committed huge resources to health and life sciences. Things got really exciting with the acquisition of Cerner and Cerner and Visa. What's Oracle doing right and what do you think it should be doing to make itself even more valuable in health and life sciences? Well, this is a great but very difficult question, so I cannot comment too much what Oracle is doing or will be doing. 00;29;46;17 - 00;30;23;06 But I can say more generally that there have been a number of technology companies that have tried to foray into health or life sciences. I would say with mixed results. And one reason is that our health care system remains highly fragmented and complex, so it takes a lot of energy to break the status quo. So you probably know that we were one of the last countries in the world to transition from ICD nine to ICD ten coding system, and we are soon going to move into the ICD 11 system. 00;30;23;06 - 00;31;00;05 So I'll be interested to see whether the US is ready for that. And that again, is maybe a reflection of just how complex and fragmented our system is and disruptive innovation and I think are great, but they may or may not translate into successes when they applied to health care. That is not to say tempesta mistake. I'm actually pretty optimistic that the perspectives and solutions and ideas brought by technology companies could help us solve a lot of problems that we have today. 00;31;00;07 - 00;31;31;26 But I think that it will be good to engage people who will be struggling with these issues early on and to work together with them to develop solutions that are not just good on paper, but also feasible in practice. So at least in my very limited experience, we have seen some very cool technology that ended up not being useful for health care just because it's very hard to change what people have been doing. 00;31;31;28 - 00;31;56;09 So again, disruptive innovations are good, but sometimes it's just very hard to adopt, at least not quickly enough for for us to see meaningful changes. Yeah, that's really fascinating. It's, you know, it is disruptive innovation, but it's not always applicable to the to the goals you're pursuing. But it does feel like technology where that's concerned, the future is coming at us faster and faster. 00;31;56;11 - 00;32;32;21 So what are the technologies that are most interesting to you? Is it A.I. or what big advances in public health do you see coming? Maybe sooner than we thought. Yeah. Yeah. You know, I feel like you said some of this came too fast. Like, I wish I. And closer to retirement, I don't worry about this. But so even though I say disruptive innovation sometime might not work in health care, but I will say generative A.I. seems to be a recent exception. 00;32;32;24 - 00;33;10;14 So I would say that generative is definitely on the list of things that surprised me in a very nice way. I will also say that the continue fast accrual of better real data is also something that excites me and the continue recognition or increased recognition of the potential real data of. It's also something that I think is good to have for things that came sooner than I found it again, generative. 00;33;10;19 - 00;33;44;13 AI So if you ask me when, we'll be ready for generally. AI Last year or two years ago, I would say not yet, but now we in the era where everything seems possible. So I remain extremely optimistic about generative in some of these last language models that will help us analyze unstructured data even more efficiently. Well, therein it's deeply fascinating and exciting stuff. 00;33;44;14 - 00;34;10;27 Thanks again for letting me pester you with these questions. If our listeners want to learn more about Sentinel, Operation Center or Mosaic or you, what's the best way for them to do that? So Sentinel has a poverty website where we post everything that we do. So is Sentinel initiative dot org. So I am a member of the Department of Population Medicine at Harvard Medical School. 00;34;10;29 - 00;35;00;16 So our website's population is a thought, but these would be two places that would be very informative for audience. Who wants to know more? All right. We appreciate that. And to our listeners, go ahead and subscribe to the show. Feel free to listen to past episodes because they are free. There's a lot to learn here. And if you want to learn more about how Oracle can accelerate your own life sciences research, just go to Oracle dot com slash life dash sciences and we'll see you next time on Research in Action.