Glossary

ONTOFORCE's concept explorer

The world of data, AI, and knowledge graphs is full of complex terminology, and even familiar terms can mean different things in different contexts. Our ONTOFORCE Glossary is designed to make that language clearer.

Your guide to the language of data and knowledge discovery

Here, you’ll find concise explanations of the key concepts behind data integration, semantic technologies, knowledge discovery, and more — all within the context of life sciences and the DISQOVER platform.

How to use the concept explorer

Browse alphabetically or search for a specific term. Each entry provides a clear definition, often with life sciences context and DISQOVER relevance.

A-L

Agentic AI

Agentic AI refers to systems composed of one or more autonomous agents that together can reason, plan, and carry out multi-step tasks or workflows. Unlike traditional AI or automation that reacts only when prompted, agentic AI exhibits agency.

Association canonical type

Association canonical types are canonical types that contain information on the nature of the typed links, for example the evidence on which the link is based or the source that contributed the link.

Controlled vocabulary

A controlled vocabulary is a standardized and curated list of terms used to describe and categorize data consistently across systems. Each term has a specific, agreed-upon meaning, which helps reduce ambiguity and ensure that everyone uses the same language when referring to concepts, entities, or attributes.

In the context of life sciences and knowledge management, controlled vocabularies—such as MeSH, SNOMED CT, or ChEBI—enable consistent data integration, search, and analysis across diverse datasets.

Canonical type

Canonical types represent a common concept for the group of instances they categorize. In DISQOVER, they are used to add a context to a search.

Component type

A component type is kind of operation that is available in DISQOVER’s Data Ingestion Engine and can be used to build a pipeline. So it will offer the possibility of introspection by looking at input predicates. Running an instance of this type (i.e. a component) can result in the creation of new columns but does not necessarily does so. It has options but does not have option values. It can give names and descriptions of option values. The same component type can be used many times in the same pipeline. Each of them will be a different component. A component is a special case of a action. Examples are import, re-alignment of resources, extract distinct,

Concept

A semantic type of data, like "clinical study" or "chemical" or "disease."

Data ingestion

Data ingestion is the process of moving data from one or more sources into a storage system or processing environment where it can be analyzed and used. It involves collecting, transferring, and loading data so it becomes available for downstream applications. There are two main approaches: batch ingestion, where data is moved at scheduled intervals in groups, and streaming ingestion, where data flows continuously in near real time. Choosing between the two depends on whether efficiency or immediacy is the priority.

Data integration

Data integration is the process of combining data from multiple sources into a unified, consistent view. It involves harmonizing different formats, structures, and terminologies so information can be accessed and analyzed together.

Data mapping

Data mapping is the process of defining how data elements from one source correspond to data elements in another. It establishes connections between different datasets, ensuring that information with similar meaning—such as “Drug Name” and “Compound Label”—is aligned correctly during integration.

Data, Information, Knowledge, Wisdom pyramid (DIKW)

The DIKW Pyramid is a conceptual model that illustrates how raw data is transformed into higher levels of understanding and actionable insight.

Data represents raw facts and observations without context.

Information is data that has been processed or organized to provide meaning.

Knowledge emerges when information is connected, interpreted, and understood in context.

Wisdom is the ability to apply knowledge effectively to make informed decisions or predictions.

Edge

In a knowledge graph, an edge represents the relationship or connection between two entities, known as nodes. Each edge defines how the nodes are related—for example, in the statement “A drug treats a disease”, the edge is treats.

ETL (Extract, transform, load)

ETL is a data management process used to collect data from multiple sources, prepare it for analysis, and store it in a target system such as a database or knowledge platform.

Extract: Retrieve data from different systems or formats.

Transform: Clean, standardize, and harmonize the data to ensure consistency and quality.

Load: Move the transformed data into a destination system for querying and analysis.

FAIR

FAIR is a set of guiding principles designed to improve the management, sharing, and reuse of data. FAIR ensures that data can be easily discovered, accessed under clear conditions, integrated across systems, and reused for future research and innovation.

Findable: Data are assigned unique identifiers and described with rich metadata.

Accessible: Data can be retrieved through standardized, secure protocols.

Interoperable: Data use shared vocabularies and formats to work seamlessly across systems.

Reusable: Data are well-documented and licensed for future use and analysis.

In the life sciences, adopting FAIR principles accelerates discovery by enabling collaboration, transparency, and efficient data integration. DISQOVER supports FAIR by connecting and harmonizing data across sources, making it easier for organizations to derive and share meaningful insights.

GraphQL

GraphQL is a query language and runtime for APIs that allows clients to request exactly the data they need. It provides a more efficient, flexible, and precise alternative to traditional REST APIs. Instead of receiving fixed data structures, users can define the shape of the response, making data retrieval faster and reducing unnecessary data transfer.

Instance

An instance is as searchable entity that is characterized by its properties. For example, "malaria" is an instance of "disease."

Interoperability

Interoperability is the ability of different systems, applications, and data sources to exchange, understand, and use information seamlessly. It ensures that data can move across platforms without loss of meaning or context.

Knowledge graph

Knowledge Graphs are a way of structuring information in graph form, by representing entities (e.g. people, places, objects) as nodes, and relationships between entities (e.g. being married to, being located in) as edges. A knowledge graph is a large network of interconnected entities. The connections are created based on the triples from knowledge bases. Facts are typically represented as ‘SPO’ triples: (Subject, Predicate, Object). Essentially, two nodes connected by a relationship form a fact. 

The knowledge graph allows you to define insights from there.

Boston (Subject) is a city in (Predicate) Massachusetts (Object).

Massachusetts (Subject) is a state in (Predicate) the USA (Object).

We can then conclude Boston (Subject) is located in (Predicate) the USA (Object)

Prior to using knowledge graphs, companies will have used relational databases and traditional methods, generating large datasets filled with thousands of rows/ columns which are difficult to work with and collate.

Large language models (LLMs)

Large language models (LLMs) are advanced artificial intelligence systems trained on vast amounts of text data to understand and generate human-like language. They use deep learning techniques—especially transformer architectures—to recognize patterns, infer meaning, and produce contextually relevant responses to prompts.

LLMs can summarize text, answer questions, generate natural language queries, and assist with data exploration.

M-Z

Master data management (MDM)

Master data management as a discipline, is the practice of creating and maintaining a single, consistent, and authoritative source of core business data, such as customers, products, suppliers, or locations, across an organization. MDM ensures that this critical data is accurate, standardized, and synchronized across different systems and applications, reducing duplication and inconsistency.

Metadata

Metadata is “data about data.” It provides descriptive information that explains the content, structure, source, and context of a dataset, making it easier to find, understand, and use.

Examples of metadata include details like a dataset’s title, creator, date of collection, data format, and applied standards or vocabularies.

In the life sciences, metadata plays a crucial role in ensuring data quality, traceability, and compliance with FAIR principles.

Natural language querying (NLQ)

Natural language querying is a way of interacting with data systems by asking questions in everyday human language, rather than using technical query languages like SQL or SPARQL. Powered by natural language processing (NLP) and machine learning, NLQ interprets the intent of the question, translates it into the appropriate backend query, and returns relevant results.

Node

A node is a fundamental unit within a graph that represents an entity or concept, such as a person, place, object, or idea. In knowledge graphs, nodes are connected to other nodes through edges (relationships), creating a network of linked information. Each node can carry attributes or properties that further describe the entity it represents.

In a knowledge graph about movies, a node could represent an actor, while edges connect that actor to nodes representing the films they have starred in.

Ontology

An ontology, within the context of semantic technology, is a formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts. It is used to reason about the properties of that domain and to enable knowledge sharing and reuse among computers and humans. Ontologies are a crucial component of semantic technology because they provide a structured framework that allows explicit specification of the meaning of terms and the relationships between them.

There are various ontologies in the life sciences realm, such as Gene Ontology, Sequence Ontology, and Medical Subject Headings (MeSH). These ontologies promote consistency in the preferred terms within the field. In doing so, they enable indexing of content and content retrieval through browsing or searching.

Predecessor

If component A is executed before component B, A is a predecessor of B. If there is a connector between A and B, A is a direct predecessor. If there are one or more components between them, A is an indirect predecessor.

Predicate

In data modeling and knowledge graphs, a predicate defines the relationship between two entities (also called subject and object). It expresses how one piece of information connects to another, for example, in the statement “A drug treats a disease”, treats is the predicate.

In semantic data terms each resource corresponds to one or more subjects which are linked together via a preferred URI relation. Each resource has multiple properties which are called predicates. In SQL database terms this corresponds to column names. In DISQOVER’s Data Ingestion Engine, a predicate can be stored in different columns, for example, if the same predicate is created by different components. Predicates can also contain values of different types, meaning they can contain literal values and URIs.

Provenance

Provenance refers to the record of the origin, history, and transformations of data. It tracks where data comes from, how it has been processed, and who or what has modified it over time. Provenance ensures transparency, trust, and reproducibility by allowing users to verify data quality, trace decision-making, and comply with regulatory requirements.

Retrieval augmented generation (RAG)

In its simplest form, RAG is a natural language processing (NLP) approach that allows an LLM to check the accuracy of its response against, for example, a knowledge graph, thereby supplementing its own internal knowledge with a more structured, flexible, and well-curated data source. There are four stages to the process:

Question - The user asks a question of the LLM, usually through a chat interface.
Key concepts - The LLM extracts the key concepts from that question and passes this to the knowledge graph.
Additional knowledge - The knowledge graph sends the text back to the LLM with any potential new input.
Augmented response - The LLM processes the additional information from the knowledge graph and creates a response.

RDF (Resource description framework)

RDF is a standard model for representing and exchanging data on the web. It structures information as a collection of triples—each consisting of a subject, predicate, and object—to describe relationships between entities in a machine-readable way.

Rest API

A REST API is a standardized way for applications to communicate over the web using simple HTTP methods. It enables systems to exchange data and perform operations without needing to share the same technology stack.

Semantic layer

A semantic layer is an abstraction layer that provides a unified and consistent representation of data across various sources and systems by using common formats and vocabularies. This layer can sit above the physical storage of data (such as databases, data lakes, or APIs) and allows applications and users to interact with data in a more meaningful and context-aware manner. In all, a semantic layer is the glue connecting all data with the business context it represents. 

Integrating data through a semantic data layer allows life sciences organizations to harness the full potential of both proprietary and publicly available information. By leveraging a semantic layer, organizations can unify disparate data sources to create a cohesive, context-rich view of data. The semantic layer acts as a bridge, translating complex datasets into a common language, thereby streamlining data analysis, accelerating research processes, and fostering innovation.  

Modern semantic layers often leverage knowledge graphs to add context and interoperability, enabling richer insights and seamless integration across tools and data ecosystems.

Semantic search

A search technique that goes beyond matching exact keywords and instead understands the meaning and context of words, phrases, and concepts. It leverages ontologies, knowledge graphs, and natural language processing to deliver results that are more accurate, relevant, and aligned with the user’s intent rather than just the literal query terms.

Example: Instead of treating “heart attack” and “myocardial infarction” as different, semantic search recognizes them as the same medical concept.

Semantic technology

In today’s world, semantic technology encompasses a suite of tools and methodologies designed to enhance the way computers understand and process the meaning of data, text, and web content - akin to human comprehension. At its core, semantic technology leverages data, ontologies, knowledge graphs, and natural language processing (NLP) to create a rich, interconnected framework that allows for more sophisticated data interpretation, retrieval, and analysis.

Semantic technology enables machines to understand the context and relationships within data, facilitating more accurate search results, data integration, and the automation of reasoning tasks. By imbuing data with meaning and making it machine-readable, semantic technology paves the way for advanced applications in artificial intelligence, information management, and beyond, transforming vast amounts of raw data into actionable knowledge.

Semantic web

The Semantic Web is an extension of the traditional web where data is enriched with meaning and context so that it can be understood not only by people but also by machines. Using standards such as RDF, OWL, and SPARQL, it structures information in a way that connects concepts, entities, and relationships across different sources. This creates a web of linked data that allows software agents and applications to integrate, search, and reason over information more intelligently.

According to the World Wide Web Consortium (W3C), "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries.”

Shadowed synonym

Shadowed synonyms are synonyms that are hidden when performing query expansion because they are encompassed by other, broader and often simpler synonyms. These synonyms are identified by the fact that they share a common sub-string.

SPARQL

SPARQL is a query language and protocol used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. It allows users to search across knowledge graphs and other linked data sources by expressing queries that target entities, relationships, and patterns, rather than just raw text or tables. SPARQL is a key standard of the Semantic Web, enabling interoperability and complex reasoning across distributed datasets.

Triple

Statements are generalized in the form of triples within knowledge bases; these triples will be categorized under different ontologies using an ontology extraction process that can harness the capabilities of natural language processing techniques as well. A triple is composed of a subject, the predicate, and its object. The subject and object are entities that are involved in a relationship defined by the predicate. Hence, for a statement such as ‘The Louvre is located in Paris’, we break this down in the following form of a triple for the knowledge base.

Subject : Louvre

Predicate : is located

Object : Paris

Triple store

A triple store is a specialized database designed to store and retrieve data structured as RDF triples—statements in the form of subject–predicate–object. Unlike traditional relational databases, triple stores are optimized for managing large-scale graphs of interconnected facts and relationships. They provide the foundation for knowledge graphs and support semantic technologies by enabling complex queries, often through SPARQL.

Uniform resource identifier (URI)

A URI is a sequence of characters that identifies an abstract or physical resource, usually connected to the internet. Each node in the graph is typically assigned a URI to ensure its distinctiveness across the entire knowledge graph, not just within a single data source. A good URI is one that is globally unique, distinctive from other data points.