Up to 3 times faster indexing
Companies that host their own setup of DISQOVER know that a crucial step in the process of making data available is the indexing. In the past few months, we have optimized this process. Indexing now runs up to 3 times faster while the logging has been more readable and more informative.
Behind the screens
During the indexing process we transfer the triples from a triple to an indexed database. This allows DISQOVER to show the data in near-real time. Indexing consists of two disjoint steps:
- The dumping from the triple store to disk,
- The loading in the index.
We have now made critical changes in both steps.
In the dumping step, we address the triple store from multiple independent threads. This boosts the dumping speed by up to 3 times. How fast you can go depends on different system parameters like memory, number of cores, disk speed, etc. However, a simple metric in the log makes sure that every user can get the most out of his/her system.
Dump time measured for a different number of threads
We have also improved the loading-to-Solr-step in 2 ways. First we parse, prepare and send the dumped data from different threads. Secondly, we store a hashing value of every document and only send that document to Solr when it has been changed. This not only saves bandwidth and CPU cycles, but it also ensures that the Solr does not have to be emptied during the indexing process. The index will not have a downtime and we can avoid working with a (swapping) extra Solr core!
Load time measured for a different number of threads
By combining these efforts, we have succeeded in reducing the total indexing time for our own production environment from more than 40h to a staggering 15h! This means that the waiting time for indexing is substantially reduced and updated data will appear faster in your DISQOVER.
Within the next few weeks, we will release version 2.05 which contains this huge improvement. This will enable all our standalone customers to experience the benefits of this on their own system. For all you data scientists out there: all the technical ‘nuts and bolts’ are clearly explained in the manual.