Bring epidemiology data and disease genes in closer contact

In my previous blog, I tried to explain that the usage of different disease classifications or encodings in data sources like the US and EU clinical trial registries, doesn’t hamper the integration and linking of this kind of data.

Disease classifications are also used to precisely define diseases in other contexts like epidemiology, pharmacovigilance, toxicology, pharmacology, genetics, etc. This data is scattered across a plethora of data sources, maintained by different governmental and other non-profit organizations like research consortia and institutes or individual research groups. If they are keen on providing meaningful and useful data, data providers try to avoid using disease terms that aren’t defined precisely in an ontology.

The trouble with search

Just imagine the potential confusion when in literature or in a data source the term ‘diabetes’ is used without any further explanation. Do they mean all types of diabetes or only the most common one, which is, in the Western world, type 2 diabetes? These discussions can’t arise when a term is chosen from an ontology where a precise definition is added, by default, and the relationships with closely related terms are explained (e.g. ‘type 2 diabetes’ is a more specific form of the general term ‘diabetes’ and has sibling terms like ‘type 1 diabetes’).  Most of us know the difference between type 1 and 2 diabetes but it doesn’t end there if you dive deeper into the ‘diabetes sea’.

Healthcare professionals widely use ‘Systematized Nomenclature of Medicine Clinical Terms’ or SNOMED CT terminology in medical reporting. SNOMED CT is extensive, very extensive, which makes it appropriate to fully describe a medical situation. The major drawback is that you can lose yourself in the multilevel hierarchy of terms. Diabetes mellitus – the official collective term – covers more than 100 different kinds, organized in at least 3 different sublevels in SNOMED CT alone. Same situation if you take a closer look at diabetes in ORDO or ICD10, which have more than 45 rare types and more than 180 diabetes varieties respectively. A hierarchical representation that allows you to select a parent value and all its children, grandchildren, etc. could be part of a solution to the problem.

 Searching traditionally

Assume you want to find out what the average age of onset is for all types of diabetes and link this with the known associated genes and variants. Why? Well, you want to investigate the potential of a predictive screening test for diabetes using a relevant gene panel. How do you start? First, capture all diabetes-related diseases by diving into a disease classification like the ones explained above. Orphanet is one of the online data sources that provides epidemiological information for the rare kinds of diabetes. So grab the data from there. You can further complete this for the more common forms of diabetes with the aid of the Genetic Home Reference (GHR). Having this at hand, we proceed to get the genes and related variants known to be involved in diabetes diseases. DisGenet is a good example of a source that can help you to make the link between genes and diseases. It also contains data about disease-variant associations. Stitching this all together won’t be easy because these sources tend to use different disease classifications. In addition, you only get a view on one snapshot in time. After a while, your information is outdated unless you redo the complete process.

Searching semantically

 As explained in a previous blog, our semantic search engine DISQOVER can help to link the different diseases classifications. We aim to bring together and link disease-related data from different sources. Our goal is to create an integrated dataset to cover key aspects of such diseases and to display these in a disease-centric way. You could compare it with Wikipedia but then automated, striving to be more complete, and in sync with the data sources.

Disease-related population and genetics data comes together in DISQOVER. You can browse through extensive classifications and select a subtree of disease terms, sometimes specified in more detail than foreseen. This operation captures the diseases in the subtree and our user interface makes it possible to go to all related data types like genes or variants.

In my next blog, I will explain how you can find cell lines, animal models and antibodies linked to your gene(s) of interest.

Get your free DISQOVER access today and start searching 130+ open databases.