Data-driven target identification | ONTOFORCE

BLOG

Enhance Target Identification through data integration and management

Integrated data is an essential resource for identifying potential targets for life sciences research and development. A knowledge graph sets a strong foundation for querying a wide range of public and privately generated data using AI and machine learning. 

ONTOFORCE team
5 min

Identifying new biological targets for drug development is challenging for many reasons. Pre-established processes, molecules or pathways will be a popular choice for investigation, so the chance that a competitor will beat you to the market means there is an ever-present risk that your investment, both time and money, will be in vain. This is even the case for the more easily identified novel targets, which could also be identified easily by competitors and come with the added risk of high failure rates during development due to lack of prior knowledge.  

How can data be used to optimize target ID? 

High-quality data and effective data management play a critical role in optimizing target identification (target ID) in pharmaceutical research, making the process more efficient, effective, and likely to lead to successful therapeutic outcomes. The sources of these data can be either internally generated or publicly available, and both provide valuable insights leading to potential target ID. 

Current methods of identifying new drug targets can be extremely time-consuming due to data being dispersed across multiple sources and in very different formats. This requires a high degree of collaboration between technical teams and scientists to ensure that the available data is of a reliable quality and in a manner that can be queried and retrieved easily. Even then, there is a high chance of missed opportunity due to the disparate and dispersed nature of publicly available data. 

  1. Omics technologies generate vast amounts of data that are rich in insights into disease mechanisms at the genetic, mRNA, protein, and metabolic levels. Integrating these data, whether internally generated or publicly available, can allow researchers to pinpoint molecular changes associated with diseases, allowing them to identify potential targets for therapy.  
  2. Real-world and patient-derived data, like those found in Electronic Health Records (EHRs), can provide a wealth of information about disease progression and treatment outcomes in the real world. 
  3. Public databases and literature, such as PubMed, clinical trial databases, and gene expression databases, can elucidate disease biology, target function, and potential drug candidates. 
  4. Preclinical studies provide valuable insights into the efficacy and safety of compounds targeting specific molecules and can help prioritize targets based on their potential therapeutic benefits and feasibility for drug development. 

Integrating data in a knowledge graph for optimized, data-driven target identification 

A knowledge graph is a structured representation of information that captures the relationships between different entities, which in life sciences could include genes, proteins, drugs, diseases, and biological processes. It assists with optimizing data integration while also enabling better exploration and analysis of that data. 

One of the greatest issues with creating such a knowledge graph is not the data integration itself but getting that data into a state that can be integrated.  For example, a gene, protein, and pathway may all have different names but refer to the same thing, so a consistent ontology is essential. 

DISQOVER from ONTOFORCE is a comprehensive data integration and analytics platform that is based on semantic technology and an ontology-based knowledge graph. It seamlessly integrates an organization’s internal, siloed data with licensed and public data in one easy-to-use, customizable platform, thereby allowing efficient data exploration and analysis. 

DISQOVER pulls together data from a wide range of public data sources, all with full provenance tracking to aid with regulatory compliance and risk mitigation. It offers an out-of-the-box target ID solution using pre-ingested sources like MONDO and Open Targets, that can be integrated with internal data like electronic laboratory notebooks and target repositories, and combined with prebuilt connectors and pipelines to third party sources like Clarivate and Cortellis.  

Target identification powered by AI 

Recent developments in AI build on the power of integrated data through analyzing gene expression profiles and protein-protein interaction networks to predict potential biological targets and screen virtually for associated drug candidates. Once a suitable candidate has been identified, AI can then also predict its physical and chemical characteristics, or even generate novel compounds to expand the range of potential drug candidates. 

Large language models can help here by identifying patterns, semantic relationships and syntactic structures within the data and generating a coherent output to answer questions and generate hypotheses. However, these models are prone to inaccuracies, bias, and hallucinations that lead to the generation of irrelevant content that is inconsistent with the input data, since they have no true understanding of the words they predict.   

Generative AI is focused on generating content rather than analyzing existing data and has applications in various fields such as molecule generation, compound optimization and de novo design. Combined with knowledge graphs, these machine learning models can significantly boost the drug discovery process through relationship modeling and contextual understanding to discover hidden insights. 

Accelerating therapeutic target discovery and understanding disease mechanisms can be achieved through link prediction on the knowledge graph, revealing potential drug-target interactions and disease-gene associations crucial for informed development decisions. Machine learning utilizes the knowledge graph to generate hypotheses, make predictions, and construct comprehensive networks, aiding in triage via large language models. This process involves analyzing node interactions, considering factors like known gene-target interactions, shortest paths, and common neighbors, or removing time-based edges to determine potential regeneration avenues. In this way, the process from target identification to IND can take as little as 12 months. 

Watch this video to learn how our DISQOVER platform has helped our client e-therapeutics to accelerate the discovery of their life-transforming RNAi medicines for liver diseases. Plugging their own HepNet™ data into DISQOVER’s pre-ingested public data sources to create an integrated and user-friendly interface for querying this comprehensive data set has transformed their target ID and drives them closer to their goal of fully automating their preclinical drug discovery process.