
Entity extraction with LLMs – from document to knowledge graph
TL;DR: „Two years ago entity extraction was its own ML project. Today it's a well-shaped prompt with a JSON schema and 200 lines of pipeline code. The hard work isn't extraction — it's deduplication and resolution."
— Till FreitagWhat this is about
A knowledge graph is only as good as the pipeline that fills it. Until 2023 entity extraction was a niche field: spaCy, Stanford NER, custom training, many edge cases. With LLMs it has turned into a tractable engineering problem.
This article shows what a modern pipeline looks like, where it scrapes, and which tools take the work off your plate.
Four steps of an extraction pipeline
Documents
↓ Chunking (token limits, context window)
Chunks
↓ Extraction (LLM) (structured output)
Triplets (subject, relation, object)
↓ Deduplication (same entity, different wording)
↓ Resolution (Anna B. = Anna Becker = anna@acme)
Knowledge graphEach step has its own traps. Let's go through them.
Step 1: Chunking
Source documents are rarely well shaped. Contracts run 80 pages, email threads have 40 messages, PDFs hide tables. Classic strategies:
- Naive: fixed token sizes (e.g. 1,000 tokens, 200 overlap) — fine for prose, kills tables.
- Semantic: split at paragraph or heading boundaries — better, slower.
- Structural: markdown headers, HTML sections, or for PDFs layout-aware parsers like Unstructured or Docling.
Rule of thumb: prefer more overlap and smaller chunks. Hallucinated relationships almost always happen when a chunk mentions two entities without fully describing their relationship.
Step 2: Extraction with structured outputs
This is where the most has changed in the last two years. Instead of free-form prompts (with regex parsing), every serious pipeline today uses structured outputs — JSON Schema forcing the model to return well-formed data.
Minimal example with Pydantic + Instructor:
from pydantic import BaseModel, Field
from typing import Literal
import instructor
from openai import OpenAI
class Entity(BaseModel):
id: str = Field(description="kebab-case stable identifier")
type: Literal["person", "company", "product", "contract", "ticket"]
name: str
attributes: dict[str, str] = {}
class Relation(BaseModel):
source_id: str
target_id: str
label: str = Field(description="verb phrase, lower case")
confidence: float = Field(ge=0, le=1)
class ExtractionResult(BaseModel):
entities: list[Entity]
relations: list[Relation]
client = instructor.from_openai(OpenAI())
result = client.chat.completions.create(
model="gpt-4.1",
response_model=ExtractionResult,
messages=[{
"role": "system",
"content": "Extract entities and relationships. Use stable ids. "
"Set confidence honestly.",
}, {
"role": "user",
"content": chunk_text,
}],
)Three things that really matter here:
confidenceis mandatory. Models that mark everything 1.0 are broken — but if you don't ask, you don't get it.- Stable
ids. Otherwise "Anna Becker", "A. Becker" and "Ms Becker" land as three nodes. - Keep the schema small. The more fields you demand, the worse the quality. Two passes beat one fat pass.
Step 3: Deduplication
Reality: the same model extracts "Acme GmbH" and "Acme Group" from two chunks — two nodes, one company. Three pragmatic strategies, in order of cost:
- Exact match: normalize to lowercase, strip whitespace, drop trade suffixes (
gmbh,inc,ltd). Catches 60–70%. - Embedding-based: cosine similarity over entity embeddings + threshold. Catches another 20–25%.
- LLM reconciliation: pass candidate pairs to an LLM ("same entity? yes/no"). Catches the rest, expensive.
We usually run the pipeline in batch with all three, in that order.
Step 4: Entity resolution
Deduplication is intra-corpus. Resolution links to reality: "Anna Becker" is user_id=4471 in the CRM, "Acme GmbH" is customer_id=cust_8821.
That rarely works fully automatically. What works:
- A resolution layer with clear keys (email domain → company, full name + company → person).
- Human-in-the-loop for the first 5–10% of uncertain matches. The heuristic learns from there.
- Fallback nodes for unresolved entities instead of dropping them — they can be resolved later.
Tool comparison
| Tool | Strength | Weakness |
|---|---|---|
| LlamaIndex KG Index | end-to-end, Neo4j integration, fast prototype | little schema control |
| Instructor + OpenAI/Anthropic | full control, Pydantic typed | build the pipeline yourself |
| LangChain GraphIndex | lots of glue, many integrations | high abstraction, steep curve |
| Microsoft GraphRAG | community detection built in, batch-oriented | Python-only, opinionated |
| Unstructured + custom pipeline | best PDF/HTML parsing | assemble parts yourself |
Our default for prototypes: LlamaIndex KG Index. For production: Instructor + custom pipeline + Neo4j.
Most common failure modes
- Hallucinated entities: the model invents people or companies not in the chunk. → Strict prompts ("only what is verbatim mentioned"), discard
confidencebelow 0.7. - Too many relationships per chunk: the model wants to connect everything. → Set
max_triplets_per_chunk(10–15 is fine). - Inconsistent relation labels: "works at", "employed by", "employee of" → three edges. → Predefined relation vocabularies or LLM-based label mapping as post-processing.
- Schema drift over time: new document types → new fields. → Versioned schemas, regular sample audits.
Cost reality
For a typical enterprise setup (10,000 documents, ~5M tokens):
- Extraction with GPT-4.1 / Claude Sonnet 4.5: 80–200 USD one-off
- Embedding-based deduplication: 5–15 USD
- LLM reconciliation: 20–60 USD
- Incremental updates: ~1–3 USD per 100 new documents
That's less than a day of engineering. The expensive part is setting up the pipeline and modeling the data — not the tokens.
Conclusion
Entity extraction in 2026 isn't an ML problem anymore, it's an engineering problem. The hard work isn't in the prompt — it's in the three steps after: deduplication, resolution, schema discipline.
Build it cleanly once and any new document corpus becomes graph-ready in hours instead of weeks. That's why knowledge graphs are going mainstream again.
Related reading:
- What is a knowledge graph? – the concept behind it
- GraphRAG vs. vector RAG – what the graph is then used for
- Neo4j vs. Kuzu vs. Memgraph – where the extracted graph lives
- Knowledge graph as a service – when you don't want to build the pipeline yourself








