FAIR data management and DISQOVERability
I recently spoke at the iRODS User Group Meeting 2018 (June 5-7 2018, Durham, NC, USA) on the FAIR principles and how our research community is using the semantic platform DISQOVER in our DataHub infrastructure. Here’s the story from that session explaining how we link on-premise clinical data with other sources to gain more, better and faster insights.
This is a guest blog post by Maarten Coonen, Data Architect @ DataHub, Maastricht UMC+ and Maastricht University
At DataHub Maastricht, we are providing data management services to research groups in both the Maastricht University Medical Center and the life sciences faculty of Maastricht University. Our role is that of a data broker who enables the reuse of data by researchers in the hospital, the university and beyond. Our solution is currently (June 2018) serving a research community of approximately 170 users and managing 48 TiB of data.
To make the data available to all stakeholders in the most optimal way, we work according to the FAIR principles (Findable, Accessible, Interoperable, Reusable). With regard to our DataHub implementation, this entails a series of actions:
- Each data set registered and stored in iRODS is given a unique and persistent  identifier (PID) [FINDABLE]
- Metadata is structured and enriched with knowledge from ontologies using EBI’s Ontology Lookup Service (OLS) [FINDABLE + INTEROPERABLE + REUSABLE]
- The metadata is registered in iRODS and indexed in DISQOVER [FINDABLE]
- Data sets can be retrieved by their PID and metadata via a HTTP landing page. Metadata stay accessible, even when the data have been deleted [ACCESSIBLE]
“It’s key that data sets are both human and machine-readable.”
A linked data cloud
Performing these actions enables our data sets to be part of a massive decentralized linked data cloud. The DISQOVER technology is used to traverse this cloud, in our case comprising:
- Research project data in iRODS
- Multiple on-premise research databases
- Electronic Medical Records databases
- Over 130 public data sources
The data from public sources, in fact, comes with DISQOVER: the semantic platform that we use to search through all the data. Coupling in-house data with public data sources via DISQOVER’s data federation is crucial here, as it greatly extends our view on the data. With DISQOVER, it becomes possible to simultaneously aggregate results from data residing at public and private sources that otherwise would have to be collected or searched separately, thereby improving end-user’s efficiency. Thus, DISQOVER makes it possible to bring semantic searching to a wide research community. Key herein is the user-friendliness and the intuitive user interface that DISQOVER provides.
How it all works
The data process workflow consists of five major steps:
- Data and metadata, captured in various source systems, are initially centralized and managed in iRODS;
- All metadata passes through a staging environment and is semantified in a series of Extraction Transformation and Loading (ETL) steps;
- Semantified metadata (RDF) is loaded in DISQOVER
- End-users use the DISQOVER front-end to search for the most interesting insights and data sets.
- Via a persistent identifier linkout (ePIC handle.net system) to a landing page, the actual dataset can be downloaded via the iRODS cloud browser or a WebDAV connection.
Expanding our reach
Systems like DISQOVER generate their greatest impact through reach and accessibility. Within Maastricht UMC+, the following groups are already actively using DISQOVER:
- Heart and Vascular Center (Hart en Vaat Centrum)
- The Maastricht Study
- Maastricht Multimodal Molecular Imaging Institute (M4i)
- Institute of Data Science (IDS)
And this continues to expand. All Maastricht University and University Maastricht Medical Center staff can simply access our local DISQOVER instance online. After logging in with their institutional account through SURFconext, they can start their discovery through the public and in-house data sets. Of course, access to some data is limited by the user’s authorization level.
Want to know more about how we use DISQOVER within DataHub? Do reach out to me.
DataHub is a cross-organisational initiative within Maastricht UMC+ to help researchers from both the academic hospital and the university. We provide an institutional repository for research data, that is more than just a data archive. We continuously improve our services in order to provide added value to researchers who want to do more with their data. https://datahub.mumc.maastrichtuniversity.nl/
The FAIR principles (Findable, Accessible, Interoperable, Reusable) are a set of 15 principles that form a guideline for proper research data management and data stewardship. Originating from a Netherlands-based workshop in 2014, these principles have now gained more and more interest from researchers, publishers, funding bodies and government agencies worldwide. A key aspect of the FAIR principles is to make human and machine-readable representations of data sets in order to achieve semantic interoperability.
iRODS stands for ‛Integrated Rule-Oriented Data System’. It is open source data management software that links unstructured data to metadata and is used for distributed storage and data management automation.
 A persistent identifier (PI or PID) is a long-lasting reference to a document, file, web page, or other object.