Integrating public data sources for life sciences research

Using global public research to enrich proprietary data can provide pharmaceutical companies with valuable new insights which can help inform strategic decisions. The reuse of this public data can be a very time and cost-efficient way to enable novel scientific exploration by fast tracking discovery and streamlining the process from R&D to clinical trial to market.


The challenges of integrating public data sources

In life sciences, there are a plethora of research groups and public institutions that are each specialized in their own research space, with their own methods of data access, along with large public repository like NCBI (the National Center for Biotechnology Information) and EBI (European Bioinformatics Institute). All these datasets contain extremely valuable information, but due to their highly varied nature may be overlooked by researchers who either do not know they exist or are unable to efficiently query them.

Traditionally, a lot of manual work is involved in bringing all the available data together to generate novel insights. Most of these data sources are stored in different repositories with custom browsing and querying mechanisms. On top of that, no standard naming convention exists so classification systems can vary greatly. For example, a gene might have one reference in NCBI, another in Ensembl, and yet another in the HGNC (HUGO Gene Nomenclature Committee), while being the same. Every piece of information like genes, proteins, compounds targets, … tend to have different names and classification across sources. All these sources can provide potentially useful information for future research, providing they can be harmonized, consolidated, and interpreted effectively.

Integrating harmonized data into a knowledge graph using a semantic layer is providing an excellent layer for the implementation of AI solutions which can discover hidden relationships between drugs, genes, and diseases across multiple datasets. This can be achieved through using cross-references, contextualization for all sources, synonyms which allows information about a single entity to be combined across multiple data sources, while maintaining the original source for tracking back the data provenance.

DISQOVER: An integrated data solution

ONTOFORCE has developed a platform, DISQOVER, which pulls together a wide range of public data sources which can also be used to integrate a company’s own proprietary data. DISQOVER offers a customizable user interface which users can interact with on a range of levels, regardless of their data science experience. When a researcher queries the system, any name, reference or synonym can be used for a certain subject, whether it’s a gene, disease, compound, or something else. No matter which identifier is used, the data can be found back whatever its source. This consolidated information across the various silos allows the data to be traversed very efficiently.

For example, a person may be named as an author in a publication and may also be a principal investigator in a clinical study, also appearing in the clinical registries. That clinical study will in turn be testing a drug that will have a related clinical data set. Using semantic information, this single person from various references can be reconsolidated into one “object” in the knowledge graph, which then creates links between the person in the middle with the publication, clinical data, drug information, etc. This type of relationship linking is carried out over all the data sources to create a fully integrated knowledge graph.

DISQOVER has pre-ingested licensed data from more than one hundred sources, which adds to the richness of the available pool and the data sources can be specifically tailored depending on requirements. It incorporates a wide range of different sources, including: 

Up-to-date, consolidated data

This data integration is not a one-time process. Due to the nature of science and research, all databases are updated continuously with new data from the immeasurable different publications, clinical trial results, etc. that happen daily. Some sources update on a predictable basis; there is a new version of PubMed every day, ChEMBL every three months, and Drug Central every month. Keeping up with these developments requires a great deal of maintenance. This is done through a fully automated process which screens for new data every day and triggers an update of the knowledge graph when needed.

Sometimes these database updates result in format changes. Therefore, simultaneous checks of data quality and integrity also occur automatically to identify whether the intervention of a dedicated team is needed. This team follows any changes in the underlying data sources and ensures the larger changes are incorporated as quickly as possible, as well as also adding new sources as they become available. In this way, the knowledge graph is usually rebuilt every five to ten days.

Value and advantage of DISQOVER

DISQOVER’s knowledge graph with pre-ingested data can be effectively queried to retrieve novel results and insights through an easy-to-use interface that can be tailored according to the user’s requirements. A researcher who wishes to investigate a drug data concept may rely solely on single source of information like (Drug Central or Inxights) for manual searches, but through using DISQOVER, a much wider variety of information is easily accessible, adding information about drug targets, biological mechanisms, real-world genomic information, etc. The data sources can also be tailored to select the most relevant for each individual case and the resulting knowledge graph integrates the harmonized data to make use of all the links and relationships that exist.

Since DISQOVER is already available with all these public data sources pre-ingested into a knowledge graph, it is much easier for the customer to implement. All sources are fully licensed, validated, and documented in a manner that preserves the diversity and full lineage of the original sources for traceability. This means that the customer can benefit from all the public data without having to maintain it and verify sources themselves. The consolidated knowledge graph preserves the data diversity, and this preserved diversity can then be used to normalize the data. If necessary, the source of the data can be retrieved whether it came from ICD-10, MONDO, Mesh, or somewhere else, a disease will still be recognizable by the ICD 10 ID, and vice versa. In much the same way, the customer’s own data has similar types of sources to public data. Therefore, the platform can assist in normalizing internal data, even if they have disparate references.

Aligned with FAIR data principles

DISQOVER is fully aligned with the FAIR data principles. The data is findable and accessible; it can be discovered by and made available to others, interoperable which allows it to be integrated with other data and cross referenced while preserving its diversity) and reusable by others. The transformation performed by DISQOVER’s knowledge graph applies and uses these principles ensuring all information contained therein retains maximum accessibility and usability.

Book a demo with ONTOFORCE today to learn how the latest version of DISQOVER, with optimized pre-ingested public data, can supplement your proprietary data and benefit your organization.