Agents for Database Population & Entity Resolution

COMP5311 — Database Architecture and Implementation
HKUST • Spring 2026

“Toward agent-driven construction of structured databases from heterogeneous web sources”

The Problem

Structured data trapped in an unstructured web

Research opportunities are scattered across hundreds of university sites, funding portals, and aggregators.

Each source: different format, different schema, partial info
Manual approach: months of work, immediately stale
Traditional crawlers lack semantic understanding of page content

Fernandez et al., VLDB 2023:

“ER and data integration have hit a ceiling of automation due to insufficient semantic understanding. LLMs provide the grounding needed to surpass this ceiling.”

Source A — university site

Title: Summer ML Research Fellowship
Org: Stanford University
URL: stanford.edu/ml-fellowship

Source B — aggregator

Title: ML Summer Research Fellowship
Org: Stanford
URL: scholarships.com/stanford-ml

Source C — funding body

Title: Machine Learning Fellowship - Summer 2026
Org: Stanford Univ.
URL: nsf.gov/awards/stanford-ml

Same opportunity — 3 different representations

The Multi-Agent Pipeline

Hierarchical agent orchestration for web data extraction

LEVEL 1 Entity Discovery “What entities exist in this city?”

Input: “Tokyo”
Orchestrator Agent
Web search + reasoning

▽

UTokyo

Waseda

Tokyo Tech

…

LEVEL 2 Program Discovery “What programs does this entity offer?” parallel

Search Agent

Search & crawl UTokyo

Found: 12 programs

Search Agent

Search & crawl Waseda

Found: 8 programs

Search Agent

Search & crawl Tokyo Tech

Found: 15 programs

LEVEL 3 Extraction “Extract full details from this program page” per program

Extract

Crawl page

LLM parse

Extract

Crawl page

LLM parse

Extract

Crawl page

LLM parse

…

Extract

Crawl page

LLM parse

20+ fields extracted per program — Pydantic validation

▽

Supabase DB

Motivation for agents: Each level requires semantic reasoning — determining which links are relevant, classifying content as research opportunities, and mapping unstructured page content to a structured schema.

Live Demo

studyroam.com — a case study in agent-populated databases

studyroam.com

Switch to browser for live demo

131

Research internships found

Cities searched

20+

Fields per record

“All records were autonomously discovered, extracted, and structured.
However, a fundamental data quality challenge remains.”

The Challenge: Duplicates Everywhere

Same entity, different representations — this is entity resolution

Record A

title	Summer ML Fellowship
org	Stanford University
url	stanford.edu/ml-fellow
deadline	2026-03-15
location	Stanford, CA

Record B

title	ML Summer Research Fellowship
org	Stanford University
url	careers.stanford.edu/ml
deadline	2026-03-15
location	Stanford, California

Match Similar Different

Same real-world entity? High probability, but non-trivial to verify automatically.

Different titles (word order, abbreviation)
Different URLs (university vs aggregator)
Different descriptions (boilerplate)

Scale: Job posting studies show 50–80% duplication rates in aggregated listings

Zhao et al., WI-IAT 2021

Why it matters: duplicates degrade UX, inflate counts, waste storage & scraping budget

Entity Resolution — The Classical Pipeline

A well-studied 4-stage architecture

1. Block
Group into
candidates
N×M → K

→

2. Compare
Compute
similarity
feature eng.

→

3. Classify

Match or
non-match

the hard part

→

4. Merge

Cluster &
deduplicate

transitivity

Stage 2 in Detail: Per-Attribute Similarity

Given candidate tuple pair (t₁, t₂) with schema S = (title, org, deadline, location), compute a similarity score per attribute:

Attribute	t₁ value	t₂ value	sim(a_i)	Score
title	Summer ML Fellowship	ML Summer Research Fellowship	Jaccard(3-gram)	0.71
org	Stanford University	Stanford University	exact	1.00
deadline	2026-03-15	2026-03-15	exact	1.00
location	Stanford, CA	Stanford, California	token overlap	0.67

similarity vector x = [0.71, 1.00, 1.00, 0.67] → input to classifier (stage 3)

Fellegi-Sunter Framework (1969)

Each attribute’s similarity contributes additive evidence (log-likelihood ratios). Sum the scores → threshold into three decision regions:

Non-Match

Uncertain — Manual Review

Match

∑ sim(a_i) low ∑ sim(a_i) high

Fellegi & Sunter, JASA, 1969. Still the conceptual backbone of modern ER.

Key Insight

Blocking dominates runtime (O(N²) → O(K))

Matching dominates accuracy

The bottleneck: choosing the right similarity functions and the right classifier requires domain expertise and labeled training data.

Christophides et al., ACM Comp. Surveys 2021
Papadakis et al., ACM Comp. Surveys 2020

Modern ER — Two Breakthroughs Before LLMs

From hand-crafted features to learned representations

Ditto — Pre-Trained Language Models

VLDB 2021

Key idea: Skip manual similarity functions. Serialize both tuples into one token sequence and fine-tune a pre-trained Transformer (RoBERTa) as a binary classifier.

[CLS] [COL] title [VAL] instant immersion spanish deluxe 2.0
[COL] manf [VAL] topics [SEP]
[COL] title [VAL] instant immers spanish dlux 2
[COL] manf [VAL] NULL [SEP]

↓

RoBERTa → [CLS] embedding → match / non-match

Why it works: The LM “knows” from pretraining that “deluxe” ≈ “dlux” in context — something Jaccard similarity would miss entirely.

Limitation: Still requires labeled tuple pairs for fine-tuning.

SIMPLE-EM — Weak Supervision

SIGMOD 2023

Key idea: Replace manual labeling with labeling functions (LFs) — user-written heuristics that vote match / non-match / abstain on each tuple pair.

LF₁
Jaccard(title) > 0.8
→ match

LF₂
org exact match
→ match

LF₃
deadline differs
→ non-match

↓ ↓ ↓

Random Forest + EM

Noisy, conflicting votes → clean probabilistic labels

↓

+ Transitivity constraints (if A=B ∧ B=C ⇒ A=C)

Why it works: LFs are noisy individually, but the ensemble + EM recovers clean labels. Transitivity catches contradictions.

Limitation: Still requires hand-written heuristic functions.

The gap: Ditto needs labeled tuple pairs. SIMPLE-EM needs hand-written heuristics. Could a general-purpose LLM just… read two tuples and decide?

LLMs Enter Entity Resolution

The paradigm shift — from training to prompting

Vision (Fernandez et al., VLDB 2023): ER hit an “automation ceiling” — LLMs encode the commonsense ER needs. “S Indiana Ave” = “South Indiana Avenue” — no rules required.

Before — Ditto (2021)

Domain data

↓

Label 1000s of pairs

↓

Fine-tune BERT / RoBERTa

↓

One model per dataset

After — Narayan (2023)

Domain data

↓

Write ~10 demonstrations

↓

Prompt frozen GPT-3

↓

One model for everything

Narayan et al. VLDB 2023

Dataset	Ditto (finetuned)	GPT-3 (k=10)
Fodors-Zagats	100	100
iTunes-Amazon	97.1	98.2
Amazon-Google	75.6	63.5

Frozen LLM with 10 examples beats finetuned BERT on 4/7 datasets. But expensive per-pair.

↓ Press down for cost optimization & clustering

Making LLM-ER Practical

BATCHER ICDE 2024

Batch 8 pairs per API call + covering-based demo selection

STANDARD

Task desc

Demo 1–3

Q: one pair

× 500,000 calls

→

BATCHER

Task desc

Demo 1–2 (auto)

Q1: pair_1 → Y/N

Q2: pair_2 → Y/N

…

Q8: pair_8 → Y/N

× 62,500 calls (8× fewer)

Dataset	Standard	Batch	Saving
Walmart-Amazon	67.5	78.9	4.3×
Abt-Buy	65.7	85.8	4.6×

Surprise: batching improves accuracy.

LLM-CER SIGMOD 2026

Skip pairwise — give LLM 9 records, ask it to cluster by entity.

PAIRWISE

r1=r2? → Yes

r1=r3? → No

r2=r3? → No

r1=r4? → …

O(N²) calls

→

LLM-CER

“Group these 9 records by entity”

↓

1,2,3 4,5 6,7,8,9

Merge clusters hierarchically
Guardrail: reject bad clusters

Dataset	Method	Accuracy	API Calls	Cost
Cora	Pairwise	0.88	30,200	$0.67
Cora	LLM-CER	0.90	279	$0.03

100× fewer API calls, better accuracy.

Limitation: LLMs reason over provided data only. They cannot visit URLs, verify claims at source, or resolve ambiguity through external evidence gathering.

The Open Frontier — Agent-Verified ER

From field comparison to source verification

Traditional ER

Record A
title, org, deadline, desc

Record B
title, org, deadline, desc

↔ field-by-field comparison

↓

Field-by-field similarity
Jaccard, Levenshtein, embeddings
(static data only)

↓

Match / No match

Agent-Verified ER

Record A
title, org
🔗 url

Record B
title, org
🔗 url

↓

🌐 Visit URL
📄 Read page

↓

🧠 Agent reasons over LIVE data
Cross-references pages, follows redirects

↓

Match / No match (grounded in source)

↓ Press down for cost tiers, failure modes & research gap

Cost/Quality Escalation

Cheapest first — only escalate what the previous tier can’t resolve

Tier	Method	Cost/pair	Catches
Deterministic	URL + title+org hash	~$0.00	~40%
Statistical	Fellegi-Sunter, Jaccard, overlap	~$0.01	~30%
LLM	GPT/Gemini field comparison	~$0.05	~20%
Agent	Visit URL, read page, verify	~$0.10	~10%

Real Failure Modes Where Only Agents Help

⚠ Same program, different URLs
“DAAD RISE” at daad.de, uni-heidelberg.de, scholarshipportal.com
Agent visits all three → confirms same program

⚠ Title variation
“SURF” vs “Caltech SURF Program” vs “SURF – Summer Research at Caltech”
Agent visits page → sees one program

⚠ Org aliases
“ETH Zurich” vs “Swiss Federal Institute of Technology”
Agent web-searches → resolves identity

⚠ LLM over-merging
“Stanford ML Fellowship” vs “Stanford NLP Internship” — different programs!
Agent checks both pages → different PIs → separates

🚨 The Research Gap

No full paper at SIGMOD, VLDB, or ICDE on agent-based entity resolution. No study of cost/accuracy tradeoffs. No benchmark for cross-source ER where agents must visit URLs. Every ER pipeline hits a residual of 5–15% ambiguous pairs that field comparison cannot resolve. Agent verification is the natural next tier — and the research community hasn’t touched it.

Takeaways & Future Directions

Autonomous agents enable scalable construction of structured databases from heterogeneous web sources

Entity resolution emerges as the primary quality bottleneck in agent-populated databases — superseding extraction and crawling

LLM-based ER is a rapidly growing area at top DB venues — yet existing work assumes curated inputs, not LLM-generated data

Agent-verified ER remains unexplored in the literature — no published work on source-level verification for matching

Open Research Questions

Evaluation methodology
Standardized benchmarks and metrics for agent-driven ER pipelines

Cost–quality tradeoffs
Principled escalation policies from statistical to agent-based verification

Multi-agent architectures
Task-specialized agents for heterogeneous verification strategies

Online entity resolution
Incremental ER over continuously crawled, evolving web data

“As agents automate database construction, the research frontier shifts
from data extraction to data quality assurance.”

References

[1] Fernandez et al. “How LLMs Will Disrupt Data Management.” PVLDB 16(11), 2023

[2] Fellegi & Sunter. “A Theory for Record Linkage.” JASA 64(328), 1969

[3] Christophides et al. “End-to-End ER for Big Data.” ACM Comp. Surveys 53(6), 2021

[4] Papadakis et al. “Blocking and Filtering for ER.” ACM Comp. Surveys 53(2), 2020

[5] Thirumuruganathan et al. “DL for Blocking in EM.” PVLDB 14(11), 2021

[6] Li et al. “Ditto: Deep EM with Pre-Trained LMs.” PVLDB 14(1), 2021

[7] Wu et al. “Ground Truth Inference for Weakly Supervised EM.” SIGMOD 2023

[8] Papadakis et al. “Critical Re-evaluation of ER Benchmarks.” ICDE 2024

[9] Narayan et al. “Can FMs Wrangle Your Data?” PVLDB 16(4), 2023

[10] Fan et al. “Cost-Effective ICL for ER (BATCHER).” ICDE 2024

[11] Fu et al. “In-context Clustering ER (LLM-CER).” SIGMOD 2026

Venue	Papers
SIGMOD	SIMPLE-EM, LLM-CER
VLDB/PVLDB	DeepBlocker, Ditto, Narayan, Fernandez
ICDE	Papadakis benchmark, BATCHER
ACM C. Surveys	Christophides, Papadakis blocking