Decision Threshold Optimization for Diagnostic Tests using a Genetic Algorithm

Decision Threshold Optimization for Diagnostic Tests using a Genetic Algorithm

By Roland Luethy and Ljubomir Buturovic, Inflammatix, Inc.

Introduction

Typically, a clinical classifier generates a score that corresponds to likelihood of disease presence or future outcome. In order to facilitate decision-making, the score is sometimes converted to a discrete classification label using decision thresholds [1]. For binary classification, there is a single threshold that can be chosen using a trade-off between sensitivity and specificity based on a receiver operating characteristic (ROC) curve or similar. However, it is often desirable to partition the range of output scores into multiple bands, corresponding to different likelihoods of the disease/outcome, which in turn requires multiple thresholds that cannot be determined by the inspection of ROC curves. To our knowledge, no effective solution to this problem has been described.

Here, we developed a genetic optimization algorithm for the determination of decision thresholds for multiple output bands, called Genetic Algorithm Thresholds (GAT) (the term “genetic” applies to an optimization method, not genome). We have applied this method to a three-class classifier which diagnoses the presence and type of infection in patients suspected of an acute infection and/or sepsis. The classifier uses the gene expression profile of patients’ immune response as input features and produces scores for a patient sample indicating the probability of bacterial infection, probability of viral infection and probability of no infection.

Methods

To improve interpretability and guide treatment actions, each probability (score) is partitioned to likelihood bands such that each probability range [0, 1] is divided into five disjoint decision intervals (Fig. 1). Thus, in our application, each of the three probabilities of disease (bacterial, viral, and no infection) is divided into five bands. For example, if a given patient’s scores are in the very likely band for the viral infection (very high probability of viral infection) and very unlikely band for bacterial infection (very low probability of bacterial infection), treatment with antibiotics may not be beneficial.

Figure 1: Partitioning of a classifier probability in five decision bands. Each of the three class probabilities computed by the classifier is partitioned into five such bands. The thresholds are specific to the probabilities and are computed independently for each class.

The decision thresholds, which define the bands, should be chosen using clinically meaningful criteria. For example, we could specify that both the confidence in and the number of patients assigned to the “extreme” bands (lowest or highest probability band) should be as high as possible because those are the clinically most actionable bands. We represent stringency for each band using diagnostic likelihood ratios (LR) [2]. For example, for the bacterial and viral scores the clinical considerations (obtained through input from clinician community) suggest LR for the lowest band should be at most 0.075 and for the highest band at least 7.5. Furthermore, for the test to demonstrate utility at a population level, a meaningful percentage of patients should fall in the extreme bands and few patients should result in the non-informative middle (indeterminate) band (e.g., at least 50% patients should be in the extreme bands and at most 10% in the middle band). To balance these requirements and to find thresholds that generate bands (decision thresholds) meeting them, we developed a tool using a genetic algorithm with a cost function encapsulating the desired criteria.

Dataset Overview

The training set for the classifier consists of 29 genes (input features) profiled in 3159 patients from 42 clinical studies, assayed on gene expression microarrays. The validation set comprises 741 samples from 9 independent clinical studies, using the same 29 input features measured on the Nanostring nCounter(R) platform [3]. To ensure consistency and accuracy, both the training and validation sets were normalized using samples from healthy patients, with the Nanostring platform serving as a reference. The classifier is an advanced version of a previously published one [4].

Algorithm Overview

We implemented the Genetic Algorithm Thresholds (GAT) algorithm, shown below, in python using the DEAP library [5, 6]. We apply the steps 1 through 4 independently to each output class (i.e., set of bacterial, viral, and non-infected probabilities) to optimize the corresponding decision thresholds. At the completion of the analysis, 12 thresholds are generated (4 for each class).

  1. The initial population for the evolutionary (genetic) algorithm is randomly generated. A set of ‘chromosomes,’ each representing a potential solution to the problem, is created (the term “chromosome” applies to a vector representing a set of thresholds, not a DNA fiber in cell nucleus). The chromosome corresponding to a solution has four values, representing the 4 thresholds needed to split the probabilities into 5 bands.
  2. The fitness of each chromosome in the population is evaluated using a fitness function. The function assigns a fitness score to each chromosome based on how well it fits the desired criteria for LR1, LR5, percentage of patients in band 1 and band 5 combined (coverage) and percentage of patients in band 3. The coverage is only considered if the value is below the target value, so that coverage exceeding the target is not penalized.
  3. A new generation is created by selecting parents according to their fitness. Offspring are created using crossover and mutation operations. The individuals with the top 20% fitness are always kept in the population.
  4. Steps 2 and 3 are repeated for a given number of iterations until there is no improvement of the best solution.
Results

We trained the classifier on the training set and applied the tuned classifier to the validation set. Then we applied GAT to the validation set probabilities. The results are summarized in Table 1. It shows the target values that were used for GAT and the actual values with the best thresholds found by GAT for the bacterial and viral probabilities.

Target value Achieved value (bacterial) Achieved value (viral)
LR1 (lowest band) 0.075 0.089 0.101
LR5 (highest band) 7.5 8.688 9.678
% in band 1 and 5 50 69.8 69.1
% in band 3 7.5 8.6 10

Table 1: Target values and achieved values for the three-class infectious disease classifier. GAT does not guarantee that all target values will be achieved. Nevertheless, overall performance was deemed adequate.

Figure 2 shows the bacterial and viral probabilities and the thresholds for the classifier, where blue and red dots represent patients with bacterial or viral infection, respectively. The green dots represent patients with inflammation that are not caused by bacterial nor viral infections. The dotted lines represent the thresholds determined using GAT. The thresholds let us assign each sample to one of five bacterial and five viral likelihood bands. Samples that fall in bacterial band 1 are very unlikely to be bacterial infections, whereas samples in bacterial band 5 are very likely bacterial infections. Similarly, samples that fall in viral band 1 are very unlikely to be viral infections, whereas samples in viral band 5 are very likely viral infections. Figure 3 shows that most patients with bacterial infections are in bacterial band 5 and viral band 1 and most patients with viral infections are in bacterial band 1 and viral band 5.

Figure 2: Assignment of bacterial and viral probabilities to likelihood bands. The dotted lines are the thresholds determined using GAT.

Figure 3: Frequency of patients with bacterial infections, viral infections, and no infections in each of the five bacterial and viral bands defined by GAT. “Coverage” is percent of patients in bands 1 and 5.

Conclusion

We found that GAT enables efficient optimization of decision thresholds using an arbitrary number of decision regions (bands) and an arbitrary fitness function. We intend to use this method to define decision thresholds for the TriVerityTM Acute Infection and Sepsis Test, currently in development at Inflammatix.

References
  1. https://www.canassistbreast.com/sample-report.php
  2. Hayden SR, Brown MD. Likelihood ratio: a powerful tool for incorporating the results of a diagnostic test into clinical decision making. Annals of emergency medicine. 1999 May 1;33(5):575-80.
  3. Kulkarni MM. Digital multiplexed gene expression analysis using the NanoString nCounter system. Current protocols in molecular biology. 2011 Apr;94(1):25B-10.
  4. Mayhew MB, Buturovic L, Luethy R, Midic U, Moore AR, Roque JA, Shaller BD, Asuni T, Rawling D, Remmel M, Choi K. A generalizable 29-mRNA neural-network classifier for acute bacterial and viral infections. Nature communications. 2020 Mar 4;11(1):1177.
  5. Fortin FA, De Rainville FM, Gardner MA, Parizeau M, Gagné C. DEAP: Evolutionary algorithms made easy. The Journal of Machine Learning Research. 2012 Jul 1;13(1):2171-5.
  6. Wirsansky E. Hands-on genetic algorithms with Python: applying genetic algorithms to solve real-world deep learning and artificial intelligence problems. Packt Publishing Ltd; 2020 Jan 31.

Novel Technologies, Machine Learning are The Keys to Improving Acute Infection Diagnostics

Diagnostics for acute infections and sepsis typically focus on “finding the bug,” but most patients with infections do not have pathogens in their bloodstream.

Companies like Inflammatix are committed to this issue, using host-response diagnostics to read a patient’s immune system. Inflammatix’s CEO, Tim Sweeney, believes novel technologies coupled with machine learning are the answer for physicians when seeing a patient they suspect of having an acute infection.

On behalf of Diagnostics World News, Kaitlin Kelleher spoke with Sweeney about the challenges in developing these types of diagnostics, and what downstream effects they have on the healthcare and patient care systems.

Diagnostics World News: Novel technologies for host response are popping up left and right. Is this a technology that will finally be able to quickly and accurately diagnose sepsis and other acute infections at the point of care?

Tim Sweeney: The short answer is, yes, I think that measuring the host response is the technology that can answer the clinical questions a physician has when they see a patient suspected of having an acute infection or sepsis. The longer answer is that people have been trying to do this for literally decades and have not yet been successful. I think we have to have a dose of humility when discussing where the field is now and where it’s likely to go.

When we at Inflammatix have surveyed physicians, there are really three key questions that come up for patients suspected of an acute infection: Should the patient be treated with antibiotics? What downstream diagnostics are necessary? What level of care (e.g. can the patient be sent home or requires admission)? With those three questions, you can basically get through the next several hours of the patient’s clinical care and move on to the next patient in the queue. The host response is uniquely suited to answer those questions where pathogen-based tests don’t always get the whole picture.

We know that the majority of patients seen in an emergency department who are ultimately judged to have bacterial infections do not have a bloodstream infection. So patients come in and they’re suspected of maybe having an infection, but by the time all of the microbiology and the imaging, etc., is back, only about 10% of the patients who are eventually judged to have a bacterial infection anywhere in the body actually had positive blood cultures. This means that technologies that detect infections by looking for bacteria in the blood are reasonably accurate as a rule-in test but are terrible as a rule-out test.

The host response has the promise of being able to say, “Let’s understand why this patient is having symptoms.” In other words, the patient’s symptoms are due to the immune response, and that immune response is caused by something. Measuring host response can tell us what that something is, whether it’s bacterial or viral or maybe noninfectious. Maybe it’s a complication of surgery or a blood clot or any of the other noninfectious causes of acute inflammation. Then, in addition to getting the diagnosis right, of being able to say, yes, the patient has a bacterial infection, we can also gauge how severe that host response is in terms of whether the patient has sepsis because sepsis, ultimately, is an immune phenomenon. So the host response is the right place to look for both diagnostic and risk-stratification information.

All of that being said, a single host biomarker is not up to the task of answering these key questions simultaneously. If there was some single protein that had these qualities, we would have found it by now. In addition, frankly, we’re really answering two separate questions. One is, is there an infection present? The other is, how severely sick is this patient? One biomarker just cannot produce two separate axes of information. So the field is moving more broadly to multi-marker diagnostic panels for diagnosing and prognosing acute infections and sepsis. The caveats then are, first, those multi-marker panels have to be comprised of robust individual markers. Then, second, (and this historically has been the biggest challenge) they also have to have algorithms that integrate multiple independent biomarkers into a single clinically useful score. Overcoming both of these challenges is difficult, but that’s where Inflammatix has made a lot of progress.

So I think the answer to your original question of whether this is the right technology, is yes. Resoundingly. I don’t know that we’re there yet across the industry, but I think that there are teams (like ours) who have demonstrated an ability to select biomarkers focused on answering relevant clinical questions, and then confirmed their external validity. This shows substantial proof of concept that host response really can be a killer app for point of care diagnostics.

You’ve mentioned that we’re not there yet. What are the biggest challenges to developing these types of diagnostics?

Obviously, some are those that I just mentioned, which are choosing the right biomarkers and getting the algorithms right. Another is putting the results into a format that makes sense for the clinical question that a clinician is trying to answer. Getting the clinical outputs right is something that takes a lot of clinical understanding, clinical domain expertise, combined with world-class machine learning. We do this through understanding the clinical settings of our training data, combined with cutting edge techniques in building stable, generalizable classifiers. Those two together make the biggest difference in our ability to generate a stable classifier.

Another challenge is in making sure that the test fits into workflow, which of course depends on the clinical question. If we really want a test to fit in an emergency department at the point of care, the turnaround time for that technology should be probably less than 30 minutes. A physician needs to make a decision within an hour, and we still have to get the sample drawn and get the test report to the physician with them to have time to spare to make that decision about whether to treat. That being said, not every patient needs to be treated within an hour. In fact, the US healthcare system probably sees 20 million patients a year in a hospital setting where the patient is primarily suspected of having an acute infection. For the vast majority of those patients, it may be perfectly acceptable to send a sample down to a clinical lab and have an answer in more like 60 minutes. So translating host response technologies into the clinic depends on what market is being approached. Many scenarios are valid, but of course the faster turnaround times require more advanced technologies.

In any case, for host response diagnostics, the challenges of getting multiple measurements from multiple analytes correct in a cartridge-based format takes some development time. We, among others, are developing technologies that will meet that market need. But it’s historically proved challenging to get multiplex measurements in cartridge-based formats. That being said, I think one of the advantages that we have now is that others have come before us. Several cartridge-based multiplex solutions have been brought to market, and the lessons learned from those development processes are now known in the industry. In setting out to design a next-generation point of care diagnostic test system, we have been able to leverage those lessons. We expect to be able to build faster and more cost-effectively than historical averages,  while increasing the chance that our test system will be robustly adopted either at the point of care or in the clinical lab setting.

When you do have these technologies in the clinic, what downstream effect will we see on patient care and our healthcare system?

I will just give some conjecture here because of course we don’t really know. I think that the advent of procalcitonin brought with it the promise of a massive reduction in unnecessary antibiotic use and better adjudication of level of care. Unfortunately, its accuracy level didn’t really support that promise. We’ve seen multiple studies in the US, at least, of procalcitonin not really impacting care. So I think we have to be guarded in understanding how best the host response may ultimately change the healthcare system. On the other hand, procalcitonin does do a reasonable job at reducing antibiotics if measured in a guideline-based system where physicians are adhering to that guideline. It’s undoubtedly true that early adopters of host-response diagnostics will be successful if they implement these technologies as part of an overall diagnostic or antibiotic stewardship framework. In general, the promise of the host response is a decrease in the rate of inappropriate antibiotic prescriptions, a decrease in the rate of patients who progress to sepsis, and reduced unnecessary care. Improvements in screening should reduce both false positive and false negative results.

I think, if you look five or 10 years out, the whole field of infectious disease diagnostics will really have changed to encompass three separate, game-changing technologies. I think the patient pathway will have, up front, a host-response screening test that’s broadly used for anybody who has symptoms of acute inflammation. This will inform a physician what downstream diagnostics are needed, and whether to prescribe antibiotics. If the host-response screen is positive for bacterial infection, those patients may then deserve one of the ultra-rapid direct-from-blood pathogen identification and antibiotic susceptibility tests (ASTs). So if the host response screening upfront is positive for bacterial infection you can start empiric antibiotics, and then the downstream phenotypic AST test says whether you can narrow your antibiotic choices. I think that will cover the vast majority of patients.

In those patients who remain very ill despite that early one-two step of the host response screening followed by phenotypic AST, I think that’s where clinical metagenomics and deep sequencing will have value. Diagnosing those rare patients that are very sick but are not adequately diagnosed by either of the first two technologies. I think those three technologies together can answer all of the key diagnostic questions in patients in a sepsis care pathway. These are, initially, does the patient need antibiotics, then what downstream diagnostics, and what level of care? If it’s bacterial, it goes to rapid ID and AST. If they’re still sick despite treatment, that’s when clinical metagenomics will really make sense. So a care pathway progresses from a broad, fast, general solution through to more niche high-value but expensive solutions. I think that’s the spectrum that we’ll end up seeing in place over the next five to 10 years.


Editor’s note: Kaitlin Kelleher, Conference Producer at Cambridge Healthtech Institute, is planning a conference dedicated to Molecular Diagnostics for Infectious Disease next month at the Molecular Medicine TriConference, March 10-15 in San Francisco. Sweeney will be speaking on the program; their conversation has been edited for length and clarity.

Originally published in Diagnostics World, 2019.