Understanding retrieval augmented generation (RAG). A response to hallucinations


Understanding Retrieval Augmented Generation (RAG). A response to hallucinations?

Large language models (LLMs) are having an immense impact in the life sciences industry, but how can you be sure their responses are accurate and complete? Introducing retrieval augmented generation, or RAG, to supplement your LLM queries.


Recently, a great deal of attention has been paid to retrieval augmented generation, or RAG. For example, Microsoft’s Azure OpenAI’s RAG-enabled 'On Your Data" service has only been available for general release since February 2024, and just three months later the features have already been significantly expanded to include new data sources, the latest GPT-4o model, Teams integration, and security enhancements. This is no surprise considering the explosion of large language models (LLMs) and other generative artificial intelligence (GenAI) related products, since RAG is a highly effective method of improving the accuracy and completeness of insights generated by LLMs. 

Large language models for the life sciences

LLMs alone are a powerful resource for the life sciences industry. They use an AI algorithm trained on large data sets to understand, summarize, and predict new content. Similar to OpenAI’s well-known ChatGPT, there are also life-sciences-specific options, such as BioMedLM and DRAGON from the Stanford Center for Research on Foundation Models (CFRM), and Microsoft’s BioGPT. These LLMs are trained on biomedical data, PubMed abstracts through a biomedical knowledge graph, and biomedical publications, respectively.

However, LLMs are not designed to store knowledge or allow that data to be corrected or governed. As a result, the data that LLMs rely on can quickly be outdated or inaccurate as new insights are uncovered. Furthermore, LLMs extrapolate answers when facts are not available, thereby generating hallucinations - factually incorrect or nonsensical outputs, due to limitations in the training data, biases in the model, or the inherent complexities of language. These hallucinations can, of course, cause substantial problems in life sciences, resulting in a loss of profits and efficiency at best, and serious consequences for patient health at worst.

How can retrieval augmented generation (RAG) improve LLMs?

In its simplest form, RAG is a natural language processing (NLP) approach that allows an LLM to check the accuracy of its response against, for example. a knowledge graph, thereby supplementing its own internal knowledge with a more structured, flexible, and well-curated data source. There are four stages to the process.

  1. Question: the user asks a question of the LLM, usually through a chat interface.
  2. Key concepts: the LLM extracts the key concepts from that question and passes this to the knowledge graph.
  3. Additional knowledge: the knowledge graph sends the text back to the LLM with any potential new input.
  4. Augmented response: the LLM processes the additional information from the knowledge graph and creates a response.

There are two major benefits of supplementing LLMs with RAG. Firstly, it reduces the chance of hallucinations generated by the LLM through using additional information from a reliable, high-quality source, and even live data, rather than only relying on its own underlying internal model which can be rapidly outdated. The knowledge graph also records the sources on which the results are based, allowing users to check the accuracy of the generated summary, which is invaluable for auditability; and the provenance as well, as the user can see exactly where the knowledge originally came from. Moreover, the new insights generated by the LLM could be used to create new edges in the knowledge graph, like a new relationship between a drug and target, or a new analysis, enhancing the completeness of the data.

Maintaining generative analyses in a knowledge graph

Any new insights generated by LLMs using RAG can be fed back into the knowledge graph, making them available to all other researchers using that knowledge platform. This all occurs while keeping the links back to the original source which maintains provenance, allowing data quality checks. The researcher can then decide to only use results with very strong confidence, or even to completely ignore any generative concepts and stick to the human-generated knowledge. The benefit of this combination is that it allows full flexibility over how the analyses are recycled and used.

In order to achieve this, however, the evolution of user interfaces with knowledge platforms will be crucial, including human-in-the-loop validation to provide important oversight on what knowledge is present and how it is used. Already, knowledge graphs are evolving to become hubs to manage metadata across different types of multimodal data, including AI-generated data, and a supporting foundation for AI and machine-learning analytics. This is the one place where humans and AIs can interact in a common, well-defined, well-structured language about the domain that they work in.

For this reason, there is growing interest in architectures that not only situate the machine learning stack and knowledge graph, but also orchestrate capabilities to connect the team more closely.

Martin Robbins Head of Product in Product ONTOFORCE-1 Martin Robbins, Head of Product, ONTOFORCE states: We’re moving towards a world of knowledge graphs that are continuously updating in real time, with machine learning processes hanging off the back of that, watching and responding to new data, reprocessing that data, extracting new knowledge, new trends, and feeding that back into the graph. And humans being able to inspect that new knowledge, provide some level of validation, reason, and more analysis along the way. 

RAG can greatly enhance life sciences data quality

Accuracy and reliability are paramount in the life sciences industry, where all decisions can directly impact human health and well-being. These are both limited in LLMs, since they are trained on a snapshot of data which cannot be easily updated or corrected. RAG can effectively tackle these limitations of LLMs by integrating additional knowledge sources, including an institution’s internal, unpublished data. The ability to handle multimodal information provides a wide range of benefits to enhance data quality and accuracy, ultimately improving patient outcomes and reducing healthcare costs.

  1. Reduce hallucinations and improve accuracy through supplementation with a reliable, well-curated data source.
  2. Ensure optimal data (re)use of up-to-date, relevant, and high-quality information through a single platform.
  3. Enhance transparency and interpretability through tracking data provenance, which offers huge benefits for regulatory compliance.

Drug discovery and development

The RAG framework can improve the potential for target identification and validation, chemical compound analysis, and even generative de novo molecular design by facilitating multimodal information retrieval from a range of sources, such as clinical trials, chemical databases, medical records, etc. This can generate new insights about drug-target interactions, including potential biomarkers and target expression patterns, and even predict potential adverse events.

Gene and protein annotation

As sequencing technologies become more and more effective, the interpretation of these vast numbers of generated sequences often falls behind. Early attempts at gene and protein function annotation were manual and laborious, relying on comparisons to known orthologs. Through incorporating a much wider range of data sources and improving search functions, RAG can provide a robust framework for producing detailed, up-to-date, and reliable annotations to significantly accelerate this process and improve the quality of predictions.

Regulatory compliance

The enhanced transparency of data sources provided by RAG allows detailed data provenance tracking which is essential for proving due diligence and data quality for regulatory compliance. Furthermore, RAG can assist with retrieving the most up-to-date regulatory guidelines and generating checklists to help companies address all necessary aspects of compliance.

Personalized medicine and orphan indications

By their nature, personalized medicine and orphan indications suffer from a lack of prior data on which to base research and treatment recommendations. Using RAG-enhanced LLM searches ensures that the latest, most complete knowledge can be applied to each case.

A data solution for non-data scientists

The greatest benefit of supplementing LLMs using RAG via an intuitive interface will be to the average researcher, someone who has an interesting question but does not have the will nor the time to dive into the underlying technology, data model, or AI. This type of user just wants an answer on their screen with some accurate data and evidence to back it up. RAG offers a reliable means of keeping LLMs up to date and effective, giving researchers confidence in the completeness and accuracy of the responses generated.

Learn more about RAG, together with the strengths of knowledge graphs and best practices in harnessing their power for your AI and ML endeavors. Watch our recent webinar featuring ONTOFORCE’s Martin Robbins, Head of Product, and Michael Vanhoutte, Vice President of Engineering: “Elevate your AI and machine learning analytics with knowledge graphs.”