Our workshop, ONTOFORCE at Knowledge For Growth 2015 – part II
Around 40 participants joined us for our ‘Big Data Workshop: Getting knowledge out of (Big) data’. Hans Constandt, CEO, and Filip Pattyn, Product Manager and Bioinformatician at ONTOFORCE, introduced and demonstrated DISQOVER – our platform to quickly discover insights out of vast amounts of data.
Data discovery is chiefly about next-generation business intelligence and analytics. But all too often, expert researchers depend on data specialists to disclose the relevant data resources. What DISQOVER does, is close the gap between the data specialists and the experts. With our platform, users discover the knowledge themselves. ONTOFORCE’s mission is to let users derive their own insights and DISQOVER is built around this.
The data avalanche
The huge and continually growing amount of data isn’t troublesome, searching the data is. All too often, you only get what you know already. If you don’t know where to search or what to search for, it becomes very difficult to discover the real, hidden value. We therefore need to change the modes and techniques of searching. We need new systems that can quickly integrate new data and which facilitate searching. This enables you to discover insights that you might not even have been looking for.
To do that, DISQOVER uses semantic search, which is still a fairly novel term. When asked who knows about semantic web technologies, only a few hands go up. Consider going on a city trip to Paris, and you want to stay at the Hilton. A Google search on the combination ‘Paris’ + ‘Hilton’ generates 200+ million results, but you need to scroll through quite a few pages of celebrity gossip before you reach some results on the Hilton hotel in Paris. Semantic search separates different meanings, categorizes them and presents the results in a filtered and clustered manner, so that you can pinpoint the most relevant information quicker and easier. The advantage of semantics is the extreme scalability and, thus, it can be applied to a virtually unlimited number of data sources. Relevant data is dynamically ‘glued’ together as Linked Data to lead into meaningful insights.
How does this apply to life sciences? DISQOVER is linked to a wide variety of open data sources. As
the speakers in the morning already highlighted, progress comes from open collaboration and open innovation. When looking for, say, an active pharmaceutical ingredient, you may have to search the protein databases, the molecule DBs, the gene DBs, the disease DBs… Today, too many people still have to consult each of these DBs separately, after which different files need to be collated. A typical search use-case such as the one described above, can quickly take up to 50+ hours and 4 weeks of throughput time to deliver the desired results. And much knowledge – such as search queries – gets lost along the way. What DISQOVER does, is glue all these sites and DBs together. On top of that, there’s an intuitive and simple interface to visualize the aggregated data. All very scalable. The time this takes? One hour. The insights at your fingertips.
The proof of the pudding is in the eating
Now, imagine you’re searching companies producing antimetabolite drugs that are involved in phase II clinical trials where drug cancer patients can still enroll, while EGFR must be mentioned and, in addition, you also want to know who are the top authors publishing about these drugs.
Researching this the old way would mean…
- You have to search in different databases,
- Digest all the information yourself,
- Stitch it all together in a workable format.
The trouble: it’s a hell of a lot of work while the results remain a snapshot at a fixed moment in
time. Any updates or changes in any of the consulted databases is out of sight. So if you want to update your information, you have to run the process all over again. Not the most efficient or effective way of searching.
What DISQOVER does is link all the information while keeping track of the search patterns and logic. Often, when we do a classic web search, we start at one place but ultimately end up somewhere else with a lot of open ends. Retracing your steps becomes extremely difficult. While the search logic itself – the reasoning during your search – is valuable and can be modified during a subsequent search.
Time for the rubber to hit the road.
Filip starts up DISQOVER to try the above search example.
On the start page, there’s no extensive overview of all possible functionalities.
Instead, a simple keyword search field is the starting-point.
- Filip kicks off the query with the keyword ‘EGFR’. A total of 26 databases are being searched and, almost immediately, the DISQOVER platform shows the different results found across a wide variety of data types.
- Each of these data types is a first filter: clicking on ‘clinical trials’, for instance, leads you to the clinical trial results.
- Filip then filters further on phase II studies. With every new filter setting, all other filters are instantaneously updated. This prevents filtering combinations from producing no results.
- The next filter is the status ‘still recruiting’. A global map then indicates where these clinical trials are happening.
- As there are too many results still, one more filter is added, looking for everything related to lung cancer.
Filip meanwhile explains to the audience that the data is being visualized instantly. Although DISQOVER is not a tool for creating dashboards, it automatically includes these along with the actual search results, each containing references to the original sources.
Filip also points out that ONTOFORCE is not a data owner but a data broker. The DISQOVER platform queries
open databases. ONTOFORCE continually discusses the merits of unlocking additional databases with new partners. Within DISQOVER, additional databases – be they open, licensed or private – can be added.
- In the meantime, Filip follows the link from clinical trials to drugs. These are the drugs related to the resulting clinical trials. A new series of filters becomes available. What is cool about the tool, is that it automatically adjusts the available search possibilities based on the available criteria. Like gentle nudges offering you opportunities to discover new places where you might not have searched before.
- Filip includes ‘antimetabolites’ as an ATC classification filter.
- The filter that shows all manufacturers contains the first portion of information requested.
An extra neat feature is that you don’t have to visit different websites: DISQOVER already compiles different pieces of content. To make sure the origin of each section is clear, the data can be highlighted with a different color code per individual source.
- Filip now switches his attention to retrieving the key authors and publications:
- He follows the link from the resulting drugs to publications.
- To make the result more refined, an additional filter is added, looking only for publications of the last 5 years.
- Automatically, the system has semantically identified the top publishers; one of them is in fact someone from the KULeuven.
All in all, this search took around 15 minutes (including the explanation to the audience). What is extra helpful, is that the search pattern is saved for later use. DISQOVER automatically keeps track of the search pattern as a sort of timeline. Clicking on a step in the search path automatically reproduces a search result. Each adjustment to a previous search step is also stored as a new branch. As such, the thinking path or pattern is stored and you (or any of your colleagues) can reproduce the search to see what was updated since the initial search was made. Just as easily, you can name your search, save it, share it with others, ‘rebranch’ the data, collaborate with peers on search patterns and extend insights accordingly.
Someone in the audience asks: “what exactly do you mean by ‘links’?”
A link is simply a piece of content available somewhere in the vast number of databases that connects two or more data concepts. An example is the connection between a drug and the clinical trial in which the drug is tested. What is cool about DISQOVER is that it also includes `mentions’: the number of times a content item has been mentioned per data type. This is something unique due to the vast amount of data sources covered in DISQOVER.
Combining resources and integrating data
The value of semantic search increases with every additional database that’s added. ONTOFORCE continues to partner with as many different data providers as possible, requesting them to open up their data. ONTOFORCE is working with, amongst others, Harvard Catalyst Group in Cambridge US. They developed the Eagle-I network, which aims to open up biomedical scientific research. The consortium already has 41 universities and research institutions in the US and ONTOFORCE is assisting to bring the platform to Europe. The case of the BCCM/LMBP Plasmid Collection, hosted at Ghent University (UGent) and part of the Belgian Coordinated Collection of Microorganisms (BCCM), clarifies further how additional data sources can be disclosed. This organization already has a searchable catalogue, but it is isolated and not linked to publications. ONTOFORCE assisted in integrating the plasmid catalogue into UGent’s eagle-i platform, which now can be searched semantically, while a referral to the original page remains in place.
Another good example of the value of semantics is applicable to The Antibody Registry. Because
there’s no nomenclature used for how to define an antibody, antibody descriptions are notoriously heterogeneous. For researchers it’s extremely hard to know which antibody is the correct one. Through the application of the eagle-I ontology*, relationships between different antibody descriptions and other data types can be established, relevant filters can be defined and synonyms are identified more quickly, making antibody data retrievals much easier.
ONTOFORCE aims to continually simplify its DISQOVER user interface (UI), so that it becomes straightforward for anyone to use. DISQOVER can become the new standardized way to smartly research dozens of different data sources. You don’t need IT people or data scientists to catalogue all new information. It’s about bridging the data gap between research, academics and industry.
In our next blog, we will cover the Q&A that followed our workshop.
* In computer science and information science, an ontology is a formal naming and definition of the
types, properties and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse. An ontology compartmentalizes the variables needed for some set of computations and establishes
the relationships between them. Source