Jehad Aldahdooh defends his PhD thesis on Building a comprehensive drug-target knowledge base using biomedical text mining

On 24th of May 2024, M.Sc. Jehad Aldahdooh defends his PhD thesis on Building a comprehensive drug-target knowledge base using biomedical text mining. The thesis is related to research done in the Doctoral Programme in Computer Science and in the Network Pharmacology for Precision Medicine group (Faculty of Medicine).

M.Sc. Jehad Aldahdooh defends his doctoral thesis "Building a comprehensive drug-target knowledge base using biomedical text mining" on Friday the 24th of May 2024 at 12 o'clock in the University of Helsinki Main building, hall Tekla Hultin (F3003, Fabianinkatu 33, 3rd floor). His opponent is Professor Hong-Gee Kim (Seoul National University, South Korea) and custos Professor Ville Mustonen (University of Helsinki). The defence will be held in English.

The thesis of Jehad Aldahdooh is a part of research done in the Doctoral Programme in Computer Science and in the Network Pharmacology for Precision Medicine group in the Faculty of Medicine at the University of Helsinki. His supervisors have been Associate Professor Jing Tang and Senior Researcher Ziaurrehman Tanoli (University of Helsinki).

Building a comprehensive drug-target knowledge base using biomedical text mining

Recently, the focus of cancer drug discovery has shifted towards developing targeted drugs that specifically target deregulated proteins in cancer tissues. Despite extensive efforts to sequence cancer genomes and identify potential drug targets, the efficacy of targeted drugs in clinical trials has often been disappointing due to inconsistent treatment responses. This can be attributed to a lack of comprehensive understanding of drug-target interactions (DTIs) and how they contribute to treatment efficacy and adverse effects. This thesis aims to address this gap by FAIRification the drug screening experiments and utilizing text mining techniques to build a comprehensive knowledge base of drug targets, which is significant for enhancing precision medicine.

The high-level objective of this research is divided into three main tasks. Firstly, we have developed the Minimal Information for Chemosensitivity Assays (MICHA) pipeline, which enables the FAIRification (Findable, Accessible, Interoperable, and Reusable) of drug screening experiments. MICHA provides a web server and database that integrate compound annotation, including chemical structures, targets, and disease indications. It also facilitates the annotation of cell line samples, assay protocols, and literature references through curated catalogues.

Secondly, we tackle the challenge of handling the massive amount of scientific articles published in drug discovery research. While text mining techniques have been widely applied to extract relationships in other data types, such as protein–protein interactions (PPIs) and disease-gene interactions, there have been limited studies on automatically identifying DTIs articles. To achieve that, we employed Bidirectional Encoder Representations from Transformers (BERT) to classify articles that potentially contain DTIs. Furthermore, we aim to predict the assay format, as DTI data is closely tied to the specific assay used for its generation. Our novel method identifies a significant number of articles (0.6 million) not previously included in public DTI databases. We achieved a high accuracy in identifying articles with quantitative drug-target profiles and demonstrated room for improvement in predicting assay formats.

Finally, we explore the challenge of drug-target interactions (DTIs) extraction, as an entity-relationship extraction, using advanced pre-trained transformer models like BERT. To enhance the extraction accuracy, we incorporate distinct ensemble strategies. The first strategy fuses a pre-trained language model with Convolutional Neural Networks (CNN) to discern the relationships more effectively. Simultaneously, our second strategy synergizes gene descriptions derived from the Entrez Gene database with chemical descriptions obtained from the Comparative Toxicogenomics Database (CTD). Remarkably, the ensemble model that utilizes descriptions proves superior, registering a commendable F1 score of 80.6 on the concealed DrugProt hidden test set. This performance outpaces other competing models. Furthermore, our analysis comparing gene textual descriptions from both the Entrez Gene and UniProt databases provides valuable insights into their influence on the extraction's success.

The importance of this research extends beyond its technical contributions. By enhancing the accuracy and depth of drug-target interaction data, the research has potential implications for improving the prediction and understanding of drug efficacy and adverse reactions in cancer therapy. It sets the stage for more precise and individualized therapeutic strategies, which are the cornerstone of personalized medicine. Ultimately, the methods and findings of this research have the potential to impact the successful development of new drugs and the re-purposing of existing ones, underscoring its significance in the ongoing battle against cancer.

Avail­ab­il­ity of the dis­ser­ta­tion

An electronic version of the doctoral dissertation will be available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-952-84-0128-5.

Printed copies will be available on request from Jehad Aldahdooh: jehad.aldahdooh@helsinki.fi.