Agents for Database Population & Entity Resolution

COMP5311 — Database Architecture and Implementation
HKUST • Spring 2026

“Toward agent-driven construction of structured databases from heterogeneous web sources”

The Problem

Structured data trapped in an unstructured web

Research opportunities are scattered across hundreds of university sites, funding portals, and aggregators.

  • Each source: different format, different schema, partial info
  • Manual approach: months of work, immediately stale
  • Traditional crawlers lack semantic understanding of page content

Fernandez et al., VLDB 2023:

“ER and data integration have hit a ceiling of automation due to insufficient semantic understanding. LLMs provide the grounding needed to surpass this ceiling.”

Source A — university site
Title: Summer ML Research Fellowship
Org:  Stanford University
URL:  stanford.edu/ml-fellowship
Source B — aggregator
Title: ML Summer Research Fellowship
Org:  Stanford
URL:  scholarships.com/stanford-ml
Source C — funding body
Title: Machine Learning Fellowship - Summer 2026
Org:  Stanford Univ.
URL:  nsf.gov/awards/stanford-ml

Same opportunity — 3 different representations

The Multi-Agent Pipeline

Hierarchical agent orchestration for web data extraction

LEVEL 1 Entity Discovery “What entities exist in this city?”
Input: “Tokyo”
Orchestrator Agent
Web search + reasoning
UTokyo
Waseda
Tokyo Tech
LEVEL 2 Program Discovery “What programs does this entity offer?” parallel
Search Agent
Search & crawl UTokyo
Found: 12 programs
Search Agent
Search & crawl Waseda
Found: 8 programs
Search Agent
Search & crawl Tokyo Tech
Found: 15 programs
LEVEL 3 Extraction “Extract full details from this program page” per program
Extract
Crawl page
LLM parse
Extract
Crawl page
LLM parse
Extract
Crawl page
LLM parse
Extract
Crawl page
LLM parse
20+ fields extracted per program — Pydantic validation
Supabase DB

Motivation for agents: Each level requires semantic reasoning — determining which links are relevant, classifying content as research opportunities, and mapping unstructured page content to a structured schema.

Live Demo

studyroam.com — a case study in agent-populated databases

studyroam.com

Switch to browser for live demo

131

Research internships found

4

Cities searched

20+

Fields per record

“All records were autonomously discovered, extracted, and structured.
However, a fundamental data quality challenge remains.”

The Challenge: Duplicates Everywhere

Same entity, different representations — this is entity resolution

Record A
title Summer ML Fellowship
org Stanford University
url stanford.edu/ml-fellow
deadline 2026-03-15
location Stanford, CA
Record B
title ML Summer Research Fellowship
org Stanford University
url careers.stanford.edu/ml
deadline 2026-03-15
location Stanford, California
Match Similar Different

Same real-world entity? High probability, but non-trivial to verify automatically.

  • Different titles (word order, abbreviation)
  • Different URLs (university vs aggregator)
  • Different descriptions (boilerplate)

Scale: Job posting studies show 50–80% duplication rates in aggregated listings

Zhao et al., WI-IAT 2021

Why it matters: duplicates degrade UX, inflate counts, waste storage & scraping budget

Entity Resolution — The Classical Pipeline

A well-studied 4-stage architecture

1. Block
Group into
candidates
N×M → K
2. Compare
Compute
similarity
feature eng.
3. Classify
Match or
non-match
the hard part
4. Merge
Cluster &
deduplicate
transitivity

Stage 2 in Detail: Per-Attribute Similarity

Given candidate tuple pair (t1, t2) with schema S = (title, org, deadline, location), compute a similarity score per attribute:

Attributet1 valuet2 valuesim(ai)Score
title Summer ML Fellowship ML Summer Research Fellowship Jaccard(3-gram) 0.71
org Stanford University Stanford University exact 1.00
deadline 2026-03-15 2026-03-15 exact 1.00
location Stanford, CA Stanford, California token overlap 0.67
similarity vectorx = [0.71, 1.00, 1.00, 0.67]   →   input to classifier (stage 3)

Fellegi-Sunter Framework (1969)

Each attribute’s similarity contributes additive evidence (log-likelihood ratios). Sum the scores → threshold into three decision regions:

Non-Match
Uncertain — Manual Review
Match
∑ sim(ai) low ∑ sim(ai) high

Fellegi & Sunter, JASA, 1969. Still the conceptual backbone of modern ER.

Key Insight

Blocking dominates runtime (O(N²) → O(K))

Matching dominates accuracy

The bottleneck: choosing the right similarity functions and the right classifier requires domain expertise and labeled training data.

Christophides et al., ACM Comp. Surveys 2021
Papadakis et al., ACM Comp. Surveys 2020

Modern ER — Two Breakthroughs Before LLMs

From hand-crafted features to learned representations

Ditto — Pre-Trained Language Models

VLDB 2021

Key idea: Skip manual similarity functions. Serialize both tuples into one token sequence and fine-tune a pre-trained Transformer (RoBERTa) as a binary classifier.

[CLS] [COL] title [VAL] instant immersion spanish deluxe 2.0
[COL] manf [VAL] topics [SEP]
[COL] title [VAL] instant immers spanish dlux 2
[COL] manf [VAL] NULL [SEP]
RoBERTa  →  [CLS] embedding  →  match / non-match

Why it works: The LM “knows” from pretraining that “deluxe” ≈ “dlux” in context — something Jaccard similarity would miss entirely.

Limitation: Still requires labeled tuple pairs for fine-tuning.

SIMPLE-EM — Weak Supervision

SIGMOD 2023

Key idea: Replace manual labeling with labeling functions (LFs) — user-written heuristics that vote match / non-match / abstain on each tuple pair.

LF1
Jaccard(title) > 0.8
match
LF2
org exact match
match
LF3
deadline differs
non-match
↓   ↓   ↓
Random Forest + EM
Noisy, conflicting votes → clean probabilistic labels
+ Transitivity constraints  (if A=B ∧ B=C ⇒ A=C)

Why it works: LFs are noisy individually, but the ensemble + EM recovers clean labels. Transitivity catches contradictions.

Limitation: Still requires hand-written heuristic functions.

The gap: Ditto needs labeled tuple pairs. SIMPLE-EM needs hand-written heuristics. Could a general-purpose LLM just… read two tuples and decide?

LLMs Enter Entity Resolution

The paradigm shift — from training to prompting

Vision (Fernandez et al., VLDB 2023): ER hit an “automation ceiling” — LLMs encode the commonsense ER needs. “S Indiana Ave” = “South Indiana Avenue” — no rules required.

Before — Ditto (2021)
Domain data
Label 1000s of pairs
Fine-tune BERT / RoBERTa
One model per dataset
After — Narayan (2023)
Domain data
Write ~10 demonstrations
Prompt frozen GPT-3
One model for everything

Narayan et al. VLDB 2023

DatasetDitto (finetuned)GPT-3 (k=10)
Fodors-Zagats100100
iTunes-Amazon97.198.2
Amazon-Google75.663.5

Frozen LLM with 10 examples beats finetuned BERT on 4/7 datasets. But expensive per-pair.

↓ Press down for cost optimization & clustering

Making LLM-ER Practical

BATCHER ICDE 2024

Batch 8 pairs per API call + covering-based demo selection

STANDARD
Task desc
Demo 1–3
Q: one pair
× 500,000 calls
BATCHER
Task desc
Demo 1–2 (auto)
Q1: pair_1 → Y/N
Q2: pair_2 → Y/N
Q8: pair_8 → Y/N
× 62,500 calls (8× fewer)
DatasetStandardBatchSaving
Walmart-Amazon67.578.94.3×
Abt-Buy65.785.84.6×

Surprise: batching improves accuracy.

LLM-CER SIGMOD 2026

Skip pairwise — give LLM 9 records, ask it to cluster by entity.

PAIRWISE
r1=r2? → Yes
r1=r3? → No
r2=r3? → No
r1=r4? → …
O(N²) calls
LLM-CER
“Group these 9 records by entity”
1,2,3 4,5 6,7,8,9
Merge clusters hierarchically
Guardrail: reject bad clusters
DatasetMethodAccuracyAPI CallsCost
CoraPairwise0.8830,200$0.67
CoraLLM-CER0.90279$0.03

100× fewer API calls, better accuracy.

Limitation: LLMs reason over provided data only. They cannot visit URLs, verify claims at source, or resolve ambiguity through external evidence gathering.

The Open Frontier — Agent-Verified ER

From field comparison to source verification

Traditional ER

Record A
title, org, deadline, desc
Record B
title, org, deadline, desc
↔ field-by-field comparison
Field-by-field similarity
Jaccard, Levenshtein, embeddings
(static data only)
Match / No match

Agent-Verified ER

Record A
title, org
🔗 url
Record B
title, org
🔗 url
🌐 Visit URL
📄 Read page
🌐 Visit URL
📄 Read page
🧠 Agent reasons over LIVE data
Cross-references pages, follows redirects
Match / No match (grounded in source)

↓ Press down for cost tiers, failure modes & research gap

Cost/Quality Escalation

Cheapest first — only escalate what the previous tier can’t resolve

TierMethodCost/pairRelative CostCatches
Deterministic URL + title+org hash ~$0.00
~40%
Statistical Fellegi-Sunter, Jaccard, overlap ~$0.01
~30%
LLM GPT/Gemini field comparison ~$0.05
~20%
Agent Visit URL, read page, verify ~$0.10
~10%

Real Failure Modes Where Only Agents Help

⚠ Same program, different URLs
“DAAD RISE” at daad.de, uni-heidelberg.de, scholarshipportal.com
Agent visits all three → confirms same program

⚠ Title variation
“SURF” vs “Caltech SURF Program” vs “SURF – Summer Research at Caltech”
Agent visits page → sees one program

⚠ Org aliases
“ETH Zurich” vs “Swiss Federal Institute of Technology”
Agent web-searches → resolves identity

⚠ LLM over-merging
“Stanford ML Fellowship” vs “Stanford NLP Internship” — different programs!
Agent checks both pages → different PIs → separates

🚨 The Research Gap

No full paper at SIGMOD, VLDB, or ICDE on agent-based entity resolution. No study of cost/accuracy tradeoffs. No benchmark for cross-source ER where agents must visit URLs. Every ER pipeline hits a residual of 5–15% ambiguous pairs that field comparison cannot resolve. Agent verification is the natural next tier — and the research community hasn’t touched it.

Takeaways & Future Directions

1

Autonomous agents enable scalable construction of structured databases from heterogeneous web sources

2

Entity resolution emerges as the primary quality bottleneck in agent-populated databases — superseding extraction and crawling

3

LLM-based ER is a rapidly growing area at top DB venues — yet existing work assumes curated inputs, not LLM-generated data

4

Agent-verified ER remains unexplored in the literature — no published work on source-level verification for matching

Open Research Questions

Evaluation methodology
Standardized benchmarks and metrics for agent-driven ER pipelines

Cost–quality tradeoffs
Principled escalation policies from statistical to agent-based verification

Multi-agent architectures
Task-specialized agents for heterogeneous verification strategies

Online entity resolution
Incremental ER over continuously crawled, evolving web data

“As agents automate database construction, the research frontier shifts
from data extraction to data quality assurance.”

References

[1] Fernandez et al. “How LLMs Will Disrupt Data Management.” PVLDB 16(11), 2023

[2] Fellegi & Sunter. “A Theory for Record Linkage.” JASA 64(328), 1969

[3] Christophides et al. “End-to-End ER for Big Data.” ACM Comp. Surveys 53(6), 2021

[4] Papadakis et al. “Blocking and Filtering for ER.” ACM Comp. Surveys 53(2), 2020

[5] Thirumuruganathan et al. “DL for Blocking in EM.” PVLDB 14(11), 2021

[6] Li et al. “Ditto: Deep EM with Pre-Trained LMs.” PVLDB 14(1), 2021

[7] Wu et al. “Ground Truth Inference for Weakly Supervised EM.” SIGMOD 2023

[8] Papadakis et al. “Critical Re-evaluation of ER Benchmarks.” ICDE 2024

[9] Narayan et al. “Can FMs Wrangle Your Data?” PVLDB 16(4), 2023

[10] Fan et al. “Cost-Effective ICL for ER (BATCHER).” ICDE 2024

[11] Fu et al. “In-context Clustering ER (LLM-CER).” SIGMOD 2026

VenuePapers
SIGMODSIMPLE-EM, LLM-CER
VLDB/PVLDBDeepBlocker, Ditto, Narayan, Fernandez
ICDEPapadakis benchmark, BATCHER
ACM C. SurveysChristophides, Papadakis blocking