A new focus on diseases, active substances, targets and the links in between them
In the last few years, the number of public data sources integrated into DISQOVER has grown steadily and we crossed the triple digit barrier in 2016. Nevertheless, the number of data sources on our waiting list for integration is growing just as fast.
The initial situation
In the beginning, the first data sources were the so-called ‘fundamental research databases’: genes from NCBI Gene, proteins from UniProt etc., or the basic tool kit of a molecular biologist or geneticist. These sources are mostly well structured, contain good annotated links to each other and the literature (PubMed), and were relatively easy to start with.
The next step was to expand our data coverage in the direction of drug discovery and drug development. New data types were introduced such as ‘Clinical Study’, ‘Disease’, ‘Organism’ and ‘Drug/Chemical’. This last data type became the collection of chemical data from major contributing sources like ChEMBL, ChEBI, and PubChem… and the collection of biological active compounds under research or used in practice for the treatment of humans and animals, originating from sources like Drugbank, ChEMBL again, HSDB, UNII, ATC, etc.
More sources were added resulting in more links between the ‘Drug/Chemical’ data type and ‘Publication’, ‘Clinical Study’, ‘Protein’, ‘Gene’, ‘Enzyme’, ‘Disease’, ‘Variant’, ‘Organism’, ‘Pathway’ and ‘Medicine’ (see Fig. 1). Each of these are logical links: drugs are related to the clinical studies where they are tested under controlled conditions; drugs are related to proteins because they target, in specific cases, a protein, etc.
Fig. 1: Example of Linked Data of the Drug/Chemical ‘Trastuzumab’ (DISQOVER v 3.01.0 mid-April 2017)
The triangle active substance, target and disease
With the growth in the number of data sources, the complexity augmented of the data type ‘Drug/Chemical’. It became hard to add new data or add new features to solve specific use cases in (medicinal) chemistry of later stages of drug development. We asked our users for feedback and we came to the conclusion that the time had come to split the data type in two. We started to plan this surgical process and our data scientists re-analyzed all related data sources to create two new data types: ‘Chemical’ and ‘Active Substance’. The former has become the container of all pure chemical data and is now ready to contain more complex and advanced cheminformatics data – more about that in a future blog – and the latter is the place for all pharmaceutical, bioactive substances and will be expanded to include other bioactive molecules.
Small molecules are present in the two data types and biologicals will be found more exclusively in the ‘Active Substance’ data type. Together with the release of this split, new filter options are created, links with other data types become clearer or are better annotated thanks to our recently released feature ‘typed links’ (see Fig. 2).
Fig. 2: Example of Linked Data of the Active Substance ‘Trastuzumab’ (DISQOVER v 3.10.0 end of April 2017)
The typed links are used to clearly link active substances to their targets (see Fig. 3).
The concept ‘target’ isn’t defined as a separate data type in DISQOVER since it’s only a target in the context of a relationship. Consequently, we use it in a typed link between an Active Substance and a Protein, as in the example in Fig. 3. Targets will mainly be proteins but could also be a transcript, gene or organism, among other things.
Fig. 3: Typed links from the Active Substance ‘Transtuzumab’ to the data type Protein (DISQOVER v 3.10.0 end of April 2017)
If you take a closer look at the active substance details, you’ll see that information is aggregated from different sources: DrugBank, DrugCentral, HSDB, MeSH, UNII, UMLS, RxNorm, etc. This data is organized in sections containing identifiers and links to external sources, pharmacological classifications, pharmacological properties, related diseases, and interactions with targets, transporters, enzymes and other biological entities (see Fig. 4).
Fig. 4: Example of a few identifiers and external links of the Active Substance ‘Trastuzumab’ (DISQOVER v 3.10.0 end of April 2017)
Also new are the typed links between Active Substances and Diseases that specifically define which disease(s) an active substance is indicated, contraindicated or off-label used for (see Figs. 5 a & b).
Fig. 5: (a) Links between ‘Trastuzumab’ and diseases. (b) Example of one disease that Trastuzumab is indicated for (DISQOVER v 3.10.0 end of April 2017)
As you may know, we always keep track of your complete search strategy. This can be visualized at any moment and is a help to easily go back and forth in your search (see Fig. 6).
Fig. 6: Overview of a search for targets and indications of Trastuzumab
We’re eager to release this newly packaged data to our customers. Additional public data sources about Active Substance and Chemical will be included in DISQOVER during the course of this year. Our goal is to keep DISQOVER as one of the most complete open access platforms of active substance and chemical data in order to help drug research find the cures of the future.
If you’re new to DISQOVER, you can sign up for the free version.