Document stack dissolving into data points and reassembling into a structured knowledge graph

    Entity extraction with LLMs – from document to knowledge graph

    30. Mai 20264 min read
    Till Freitag

    TL;DR: „Two years ago entity extraction was its own ML project. Today it's a well-shaped prompt with a JSON schema and 200 lines of pipeline code. The hard work isn't extraction — it's deduplication and resolution."

    — Till Freitag

    What this is about

    A knowledge graph is only as good as the pipeline that fills it. Until 2023 entity extraction was a niche field: spaCy, Stanford NER, custom training, many edge cases. With LLMs it has turned into a tractable engineering problem.

    This article shows what a modern pipeline looks like, where it scrapes, and which tools take the work off your plate.

    Four steps of an extraction pipeline

    Documents
       ↓ Chunking              (token limits, context window)
    Chunks
       ↓ Extraction (LLM)      (structured output)
    Triplets (subject, relation, object)
       ↓ Deduplication         (same entity, different wording)
       ↓ Resolution            (Anna B. = Anna Becker = anna@acme)
    Knowledge graph

    Each step has its own traps. Let's go through them.

    Step 1: Chunking

    Source documents are rarely well shaped. Contracts run 80 pages, email threads have 40 messages, PDFs hide tables. Classic strategies:

    • Naive: fixed token sizes (e.g. 1,000 tokens, 200 overlap) — fine for prose, kills tables.
    • Semantic: split at paragraph or heading boundaries — better, slower.
    • Structural: markdown headers, HTML sections, or for PDFs layout-aware parsers like Unstructured or Docling.

    Rule of thumb: prefer more overlap and smaller chunks. Hallucinated relationships almost always happen when a chunk mentions two entities without fully describing their relationship.

    Step 2: Extraction with structured outputs

    This is where the most has changed in the last two years. Instead of free-form prompts (with regex parsing), every serious pipeline today uses structured outputs — JSON Schema forcing the model to return well-formed data.

    Minimal example with Pydantic + Instructor:

    from pydantic import BaseModel, Field
    from typing import Literal
    import instructor
    from openai import OpenAI
    
    class Entity(BaseModel):
        id: str = Field(description="kebab-case stable identifier")
        type: Literal["person", "company", "product", "contract", "ticket"]
        name: str
        attributes: dict[str, str] = {}
    
    class Relation(BaseModel):
        source_id: str
        target_id: str
        label: str = Field(description="verb phrase, lower case")
        confidence: float = Field(ge=0, le=1)
    
    class ExtractionResult(BaseModel):
        entities: list[Entity]
        relations: list[Relation]
    
    client = instructor.from_openai(OpenAI())
    
    result = client.chat.completions.create(
        model="gpt-4.1",
        response_model=ExtractionResult,
        messages=[{
            "role": "system",
            "content": "Extract entities and relationships. Use stable ids. "
                       "Set confidence honestly.",
        }, {
            "role": "user",
            "content": chunk_text,
        }],
    )

    Three things that really matter here:

    1. confidence is mandatory. Models that mark everything 1.0 are broken — but if you don't ask, you don't get it.
    2. Stable ids. Otherwise "Anna Becker", "A. Becker" and "Ms Becker" land as three nodes.
    3. Keep the schema small. The more fields you demand, the worse the quality. Two passes beat one fat pass.

    Step 3: Deduplication

    Reality: the same model extracts "Acme GmbH" and "Acme Group" from two chunks — two nodes, one company. Three pragmatic strategies, in order of cost:

    • Exact match: normalize to lowercase, strip whitespace, drop trade suffixes (gmbh, inc, ltd). Catches 60–70%.
    • Embedding-based: cosine similarity over entity embeddings + threshold. Catches another 20–25%.
    • LLM reconciliation: pass candidate pairs to an LLM ("same entity? yes/no"). Catches the rest, expensive.

    We usually run the pipeline in batch with all three, in that order.

    Step 4: Entity resolution

    Deduplication is intra-corpus. Resolution links to reality: "Anna Becker" is user_id=4471 in the CRM, "Acme GmbH" is customer_id=cust_8821.

    That rarely works fully automatically. What works:

    • A resolution layer with clear keys (email domain → company, full name + company → person).
    • Human-in-the-loop for the first 5–10% of uncertain matches. The heuristic learns from there.
    • Fallback nodes for unresolved entities instead of dropping them — they can be resolved later.

    Tool comparison

    Tool Strength Weakness
    LlamaIndex KG Index end-to-end, Neo4j integration, fast prototype little schema control
    Instructor + OpenAI/Anthropic full control, Pydantic typed build the pipeline yourself
    LangChain GraphIndex lots of glue, many integrations high abstraction, steep curve
    Microsoft GraphRAG community detection built in, batch-oriented Python-only, opinionated
    Unstructured + custom pipeline best PDF/HTML parsing assemble parts yourself

    Our default for prototypes: LlamaIndex KG Index. For production: Instructor + custom pipeline + Neo4j.

    Most common failure modes

    1. Hallucinated entities: the model invents people or companies not in the chunk. → Strict prompts ("only what is verbatim mentioned"), discard confidence below 0.7.
    2. Too many relationships per chunk: the model wants to connect everything. → Set max_triplets_per_chunk (10–15 is fine).
    3. Inconsistent relation labels: "works at", "employed by", "employee of" → three edges. → Predefined relation vocabularies or LLM-based label mapping as post-processing.
    4. Schema drift over time: new document types → new fields. → Versioned schemas, regular sample audits.

    Cost reality

    For a typical enterprise setup (10,000 documents, ~5M tokens):

    • Extraction with GPT-4.1 / Claude Sonnet 4.5: 80–200 USD one-off
    • Embedding-based deduplication: 5–15 USD
    • LLM reconciliation: 20–60 USD
    • Incremental updates: ~1–3 USD per 100 new documents

    That's less than a day of engineering. The expensive part is setting up the pipeline and modeling the data — not the tokens.

    Conclusion

    Entity extraction in 2026 isn't an ML problem anymore, it's an engineering problem. The hard work isn't in the prompt — it's in the three steps after: deduplication, resolution, schema discipline.

    Build it cleanly once and any new document corpus becomes graph-ready in hours instead of weeks. That's why knowledge graphs are going mainstream again.


    Related reading:

    TeilenLinkedInWhatsAppE-Mail

    Related Articles

    Vector embedding cloud next to a structured knowledge graph
    May 29, 20264 min

    GraphRAG vs. Vector RAG – when similarity stops being enough

    Vector RAG is the default — but the moment questions go multi-hop, it falls apart. GraphRAG combines knowledge graphs wi…

    Read more
    Abstract visualization of a knowledge graph with nodes and connections
    May 27, 20264 min

    What Is a Knowledge Graph – and Why Is Everyone Talking About It?

    Knowledge graphs are suddenly everywhere – from Google to Palantir to every other AI agent startup. What's behind the hy…

    Read more
    Architecture diagram: central orchestrator agent connecting three specialised sub-agents (Sales, CRM, Ops) via TOOLS.md interfaces to operational enterprise systems
    April 30, 20267 min

    Enterprise-Grade Agentic Setup: Why an API Key Is Not an AI Strategy

    An API key on your website is child's play. An agentic setup with specialised sub-agents, TOOLS.md, clean system prompts…

    Read more
    Comparison of three agent runtime architectures for production deployments
    April 9, 20266 min

    Claude Managed Agents vs. LangGraph vs. CrewAI: Agent Runtimes for Production Compared

    Three paths to production agents: Anthropic's hosted runtime, LangGraph's graph orchestration, or CrewAI's role-based te…

    Read more
    Claude Managed Agents architecture – brain connected to multiple hands representing tools and sandboxes
    April 8, 20265 min

    Claude Managed Agents: Anthropic's Play to Own the Agent Runtime

    Anthropic launches Managed Agents in public beta – a hosted runtime that decouples the 'brain' from the 'hands.' No more…

    Read more
    Agent Skills Are Becoming an Industry Standard: What Teams Need to Know
    September 19, 20254 min

    Agent Skills Are Becoming an Industry Standard: What Teams Need to Know

    Agent Skills are reusable capabilities for AI agents – and they're becoming the new standard. What sets them apart from …

    Read more
    Enterprise AI agents connecting securely through the Gemini Enterprise Agent Marketplace
    May 28, 20263 min

    Google's Agent Marketplace Goes Live – And monday.com Is Already Inside

    Google just opened Gemini Enterprise to partner-built AI agents – and monday.com is one of the first in. What that means…

    Read more
    Visualization of interconnected notes with backlinks – a personal knowledge graph
    May 28, 20265 min

    Obsidian as a Personal Knowledge Graph – Why Notes With Backlinks Change Everything

    Obsidian is more than a note-taking app – it's a personal knowledge graph. Why markdown, backlinks, and local files are …

    Read more
    Pipeline schematic of a Dark Software Factory: a JIRA ticket in status \"Ready for Dev\" triggers parallel Claude Code sub-agents that produce a draft GitHub pull request, with a human review gate before merge
    April 30, 20266 min

    AI Agentic First at Groupon: What Ales Drabek's Dark Software Factory Teaches Us

    Ales Drabek, CTIO at Groupon, runs two patterns in production: Dark Software Factory and Speedboats. What that reveals a…

    Read more