Data Ingestion Engine
DISQOVER is equipped with a powerful and ultra-fast data ingestion engine. As a data scientist you can import, manipulate, link and integrate your own data into DISQOVER, using an innovative visual pipeline environment.
Building the indexed knowledge graph
DISQOVER stores data in an indexed knowledge graph and imports the source data via a configurable data ingestion process. During this process, you can standardize, integrate and link data from a variety of siloed sources.
Visual pipeline building
When you are configuring the integration of your source data in DISQOVER, you can manage the data-ingestion process by building a visual pipeline, using a wide range of powerful reusable components. There is no need to write extensive code, which means that, compared to a conventional approach relying on RDF SPARQL, fewer specialized skills are needed and development time is reduced, while retaining the same level of power and flexibility.
A visual pipeline makes it easier to communicate the choices made during data integration, resulting in increased transparency and auditability, and fewer chance of error. Stakeholders with only basic IT knowledge can understand, review, challenge and contribute to the data integration process.
Ultra-fast, scalable and efficient
DISQOVER’s data ingestion engine uses a unique proprietary technology to efficiently process extensively linked big data, relying on a partially denormalized triple store using column-oriented storage. The engine is designed for ultra-fast bulk-linking and inferencing. Each action is executed as a sequence of full table scans, leveraging fast-block sequential I/O and temporary in-memory indexes. As a result, on equivalent hardware, DISQOVER can integrate and link data into a semantic knowledge graph much faster than conventional technology, such as relational databases or triple stores.
“Example performance comparison: Linking 18 million authors to 28 million publications, with 280 million author/publication links (hardware: Intel® Core™ i9 6 cores, 32GB RAM).”
Bi-directional lineage analysis
The data ingestion engine is capable of tracking data dependencies throughout the entire pipeline. Thanks to this, you can see what source data field(s) contributed to every information field in the DISQOVER database. Conversely, you can also see every source data field that DISQOVER is contributing to.
Read more about the other technologies of DISQOVER. Next, we will talk about integrating public data via federation.