A short glossary of the AI terms we use most often with Liverpool clients. Plain-English, no marketing. If a term you have heard from a vendor is not on this list and you would like an honest definition, email us — we will add it.
Agent (AI agent)
A software system that takes natural-language instruction, plans the steps required, calls external tools or APIs to carry them out, and returns the result. In our work, "agent" usually means a multi-step automated workflow with model-based decision points and a human in the loop where it matters. Most projects called "agents" by vendors are actually single-step LLM calls in a thin loop — worth checking what you are actually being sold.
Agentic automation
A workflow built around one or more AI agents, doing multi-step operational work end to end. Where it pays off: onboarding, reconciliation, compliance triage, document-heavy pipelines. Where it fails: when the underlying workflow should have been a deterministic rule engine plus three LLM calls. Agency is expensive; use it when you actually need it.
Confidence threshold
A configurable score below which an AI system's output is routed to a human reviewer rather than auto-actioned. Tuning this threshold is most of the work in any document intelligence or extraction project — set it too high and humans drown in review; too low and bad outputs slip through.
Conversational AI
A system that interacts with users through natural-language chat — support copilots, internal knowledge assistants, customer-facing FAQ bots. The good ones cite sources and refuse out-of-scope questions. The bad ones hallucinate confidently.
Eval set (evaluation set)
A curated collection of real inputs with known correct outputs, used to measure how well an AI system performs. Building the eval set is more important than building the model. If a consultancy cannot show you the eval set for a previous project, they are not running production AI.
Fine-tuning
Training a base language model on a domain-specific dataset to specialise it for a task. Useful when retrieval cannot solve the problem — usually because the task requires the model to learn a writing style, an internal convention, or a structured output format. Most problems we are asked to solve with fine-tuning are actually retrieval problems. See RAG vs fine-tuning for the long version.
Hallucination
When a generative AI system produces text that sounds plausible but is factually wrong. The single most common failure mode of poorly-built AI systems. Mitigated by retrieval grounding, citation, and explicit refusal behaviour for out-of-scope queries.
Human in the loop
A workflow where an AI system does some part of a task and a human reviews or signs off before the result becomes final. Almost every production AI system we ship has a human in the loop somewhere — for regulatory reasons, customer-experience reasons or simply because the cost of a mistake exceeds the cost of a review.
LLM (large language model)
The class of AI models behind systems like ChatGPT, Claude, Gemini. Trained on very large text corpora to predict the next token. Powerful for natural-language tasks; needs careful grounding to be reliable in production.
Observability
The dashboards, logs and metrics that show how an AI system is behaving in production — token cost, latency, retrieval recall, refusal rate, error rate, hallucination rate. Most demo systems are not observable; most production systems live or die by it.
Production AI
An AI system that is integrated, monitored, guarded against failure, documented and maintained — as opposed to a prototype or proof of concept. The gap between a demo and a production system is the entire engagement; that is where most AI pilots fail.
Prompt
The natural-language input you give to an LLM. Prompts can be hand-written, templated or generated by other systems. Prompt engineering is real but rarely the bottleneck — the bottleneck is almost always retrieval, eval, observability or scope.
Refusal behaviour
How an AI system responds when asked a question outside its scope or capability. Good refusals are explicit ("I cannot answer this — try X instead"). Bad refusals are confident hallucinations. Refusal behaviour is engineered, not emergent.
Retrieval (and retrieval-augmented generation, RAG)
Pulling relevant documents or data from a corpus and inserting them into the model's prompt so the answer is grounded in real, citable content. The single most-used technique in production AI consulting. Most problems are retrieval problems. See RAG vs fine-tuning for when to use which.
Scoping
The work of deciding what an AI project will and will not do — the metric to move, the data in and out of scope, the success criteria, the boundaries. Most failed AI projects failed in scoping, not in execution. See how to scope an AI project in a week.
Speakable
A schema.org property that tells voice assistants what parts of a page to read aloud. We use it on our posts and pillar to help voice-search results pick the right snippets.
Token
The unit a language model reads and writes — usually a fragment of a word. Pricing is per-token, latency scales with tokens, context windows are measured in tokens. The whole economics of an AI system reduces to "how many tokens per request" in the end.
Vector store
A specialised database that stores text (or other content) as numerical embeddings and supports fast similarity search. The retrieval layer in most RAG systems. Common choices: Postgres with `pgvector`, Pinecone, Weaviate, Qdrant. Choice rarely matters as much as how you chunk the source corpus.
If you would like to talk about any of the above in the context of your business, book a 30-minute discovery call. The fastest way to find out whether an AI project fits your operation is a 30-minute honest conversation.