Credit: CC0 Public Domain

Computer science researchers at Rice University have found bias in widely used machine learning tools used for immunotherapy research.

PhD students Anja Conev, Romanos Fasoulis and Sarah Hall-Swan, working with computer science faculty members Rodrigo Ferreira and Lydia Kavraki, reviewed publicly available peptide-HLA (pHLA) binding prediction data and gravitated towards higher income communities. Their paper examines the way that biased data input affects algorithmic recommendations used in pivotal immunotherapy research.

Peptide-HLA binding prediction, machine learning and immunotherapy

HLA is a gene in all humans that encodes proteins that function as part of our immune response. Those proteins bind to protein fragments called peptides in our cells and mark our infected cells to the body’s immune system, so it can respond and ideally eliminate the threat.

Different people have slightly different variants, called alleles, in their genes. Current immunotherapy research is looking for ways to identify peptides that can bind more effectively to a patient’s HLA alleles.

The end result, ultimately, could be customized and highly effective immunotherapy. This is why one of the most important steps is to accurately predict which peptides will bind to which alleles. The higher the accuracy, the better the potential efficacy of the therapy.

But calculating how effectively a peptide will bind to an HLA allele takes a lot of work, which is why Tools are being used to predict binding. This is where Rice’s team found a problem: the data used to train these models geographically favors high-income communities.

Why is this a problem? Without calculation From low-income groups, immunotherapies developed for them in the future may not be as effective.

“Each of us has different HLAs that they express, and those HLAs vary between different populations,” Fasolis said. “Given that machine learning is used to identify potential peptide candidates for immunotherapies, if you basically have machine models, those treatments don’t work equally well for everyone in every population. Will do.”

Redefining ‘pan-allele’ binding predictors

Regardless of the application, machine learning models are only as good as what you feed them. Bias in data, even an unconscious one, can affect the results drawn by algorithms.

Machine learning models currently used for PHLA binding prediction emphasize that they can extrapolate to allele data not present in the datasets on which the models were trained, which Call yourself “pan allele” or “all allele”. The Rice team’s results call this into question.

“What we’re trying to show here and sort of debunk is the idea of ​​’pan-allele’ machine learning predictors,” Conniff said. “We wanted to see if they really worked for data that’s not in the datasets, which is data from low-income populations.”

Fasoulis’ and Conev’s group tested publicly available data on pHLA binding prediction, and their results supported their hypothesis that bias in the data was creating a concomitant bias in the algorithm. The team hopes that by bringing this discrepancy to the attention of the research community, a truly pan-allelic method for predicting PHLA binding can be developed.

Ferreira, a faculty advisor and co-author of the paper, explained that the problem of bias in machine learning cannot be solved unless researchers think about their data in a social context. From a certain point of view, the dataset may simply appear as “incomplete,” but it is biased to make connections between what is or is not shown in the dataset and the underlying historical and economic factors affecting these populations. The key is to identify.

“Researchers using machine learning models sometimes naively assume that these models can adequately represent the global population,” Ferreira said, “but our research points to the importance of indicates when it does not.” “Although the databases we studied contain information on people from many regions of the world, that does not make them universal. What our research found is that certain populations’ socioeconomic status and There was a correlation between their representation. database or not.”

Professor Kawarki echoed this sentiment, stressing how important it is that the tools used in medical work are accurate and honest about any shortcomings they may have.

“Our study of PHLA binding is in the context of personalized immunotherapies for cancer – a project conducted in collaboration with MD Anderson,” Kavorki said. “Eventually the tools developed make their way into clinical pipelines. We need to understand the biases that may exist in these tools. Our work aims to educate the research community about the difficulties of obtaining unbiased datasets. To inform.”

Conev noted that, although biased, the fact that the data was publicly available for his team to review was a good start. The team hopes its findings will lead new research in a positive direction—one that includes and supports people across demographic lines.

There is paper. published In the journal iScience.

More information:
Anja Cuneo et al., HLAEquity: examining biases in pan-allelic peptide-HLA binding predictors, iScience (2023). DOI: 10.1016/j.isci.2023.108613

Provided by
Rice University


Reference: Widely used machine learning models reproduce dataset bias: Study (2024, February 18) Accessed 18 February 2024 at https://phys.org/news/2024-02-widely-machine-dataset-bias. Retrieved from html

This document is subject to copyright. No part may be reproduced without written permission, except for any fair dealing for the purpose of private study or research. The content is provided for informational purposes only.