Query Expansion technical note
When launching a text query in DISQOVER, the aim is to find as many correct hits as possible (high recall and precision). Expanding the search query with known synonyms of the entered search term is a powerful tool to help to achieve this. Suppose the user wants to find all information about the concept ‘Aspirin‘. Some documents may use the actual term ‘Aspirin‘ but others may only use the more scientific term ‘Acetylsalicylic acid‘. DISQOVER excels at integrating multiple data sources and for the public endpoint, it has access to a multitude of synonyms from different data sources, even about a single concept.
However, not all these synonyms have the same prevalence or relevance. Some might even cause ambiguity with other concepts. Therefore, just adding all synonyms will not necessarily yield a better result, due to a reduced precision. Moreover, extending a search with a very large number of synonyms may reduce the query performance. To face these challenges DISQOVER contains an algorithm which uses heuristics to score synonyms and aims at avoiding reduced precision caused by addition of ambiguous synonyms, while still retaining sufficient recall by including the most relevant synonyms. In this technical note, we outline the basic principles of this algorithm. Since no heuristic is perfect, DISQOVER presents the results of this algorithm in the user interface only as a suggestion to the user. The user can at each time chose to view the complete set of synonyms retrieved from the data, and override the proposed selection by cherry picking, or even completely bypass it.
When searching for a term often multiple semantic hits turn up. For example ‘Lung Cancer‘ returns 5 semantic hits. One of them is the journal with the title ‘Lung Cancer‘, while the others are all in the canonical type ‘diseases’. It makes sense to merge the synonyms of the 4 diseases, but adding the ISBN number of the journal as a synonym to the list of the disease synonyms will not give meaningful results. Hence it is desirable to merge instances to broader semantic concepts when looking for synonyms, but some caution is needed. In order to avoid merging unrelated synonyms, the following conditions must be met:
- Instances must have the same canonical type. For example, it does not make sense to merge the synonyms of the DOG gene and the dog organism. Therefore instances are only merged if there is at least one overlapping canonical type. Note that some instances have multiple canonical types: Aspirin is both a Chemical and an Active Substance.
- Instances must share a minimum number synonyms. For example, the search term ‘ALS’, yields 2 semantic hits in the ‘chemical’ canonical type. By definition they share one synonym (‘ALS‘). Nevertheless, it is clear that Antilymphocyte Serum and Ammoniumlaurylsulfateare are completely unrelated chemicals which just happen to share an abbreviation. This second rule prevents the merging of these instances into the same semantic concept. The minimum number of shared synonyms is set to 3 as this gave best results during calibration tests.
Example view of the query expansion dropdown, with three different semantic concepts reported.
For each semantic concept reported by the query expansion, the user can select all synonym terms by clicking the switch box on the left (). Alternatively, individual terms can be selected or unselected by clicking on the corresponding tag.
Often certain synonyms are an extension of other, simpler synonyms. In the case of Aspirin, we encounter synonyms like Aspirin lysine, Aspirin sodium, Aspirin calcium and Aspirin potassium. However submitting these will not add extra search hits: they will already be found by submitting Aspirin. We call these synonyms shadowed synonyms. Visualizing them in the application might seriously clutter the oversight of the synonyms, and therefore they are by default hidden the user interface.
In the example of Aspirin, the word Aspirin has a little arrow next to it (red arrow) to indicate that it hides other synonyms:
Clicking it expands to the full list:
Clicking on a tag in the expanded set adds the synonym to the search list. Note that a close arrow appears at the end of the expanded list of shadowed synonyms (red arrow), allowing the user to close the expanded set.
To further optimize the synonym search DISQOVER leverages its ability to bring together many data sources. The record for Aspirin, for example, collects data from no less than 13
different databases (HMDB, DrugCentral, DrugBank, HSDB, UNII, ChEMBL, UMLS, ChEBI, IUPHAR Compendium, SureChEMBL, RxNorm, MeSH, Pub-Chem ), from which 8 contribute to the synonyms. In total the public databases gives 75 distinct synonyms. Not all these synonyms have the same relevance, and some are even quite exotic. By requiring synonyms to be present in at least two data sources (if multiple data sources available) DISQOVER is able to separate the relevant from the less relevant synonyms. We call this the core set of the synonyms. Evaluating this method, we found that for the 500 most used search terms, on average, this filter retains only 16% of the synonyms in the core set, but it results in only 1.7% fewer search hits. One of the reasons that this reduction in recall is so low, is that in those cases where instance text contains synonyms that were removed by the filter, that text most often also contains one of the more commonly used synonyms present in the core set. In the application user interface, by unchecking the “Core set” checkbox () the user can inspect the synonyms found to be less relevant and, if desirable, override the default choice by cherry picking or even selecting everything by clicking the switch box for that semantic concept ().
The addition of some synonyms may significantly deteriorate the outcome of the search in terms of precision. For example one of the synonyms of Aspirin is ASA. However, this term can also be an abbreviation of anti-sarcolemmal autoantibodies Mus Musculus gene, of the disease Argininosuccinic aciduria, and much more. Therefore, adding ASA as a synonym is not desirable because it does not increase the search recall (since it is very unlikely to be used as the only term to refer to Aspirin), but drastically reduces the search precision. In order to address this problem, DISQOVER flags synonyms like this as being ambiguous with an exclamation triangle (example: ). It does this for synonyms which are either:
- Too short: 2 characters or 5 digits.
- Have semantic hits in other canonical types. For example, ASA has semantic hits in 7 canonical types.