COMP5311 — Database Architecture and Implementation
HKUST • Spring 2026
“Toward agent-driven construction of structured databases from heterogeneous web sources”
Structured data trapped in an unstructured web
Research opportunities are scattered across hundreds of university sites, funding portals, and aggregators.
Fernandez et al., VLDB 2023:
“ER and data integration have hit a ceiling of automation due to insufficient semantic understanding. LLMs provide the grounding needed to surpass this ceiling.”
Same opportunity — 3 different representations
Hierarchical agent orchestration for web data extraction
Motivation for agents: Each level requires semantic reasoning — determining which links are relevant, classifying content as research opportunities, and mapping unstructured page content to a structured schema.
studyroam.com — a case study in agent-populated databases
studyroam.com
Switch to browser for live demo
131
Research internships found
4
Cities searched
20+
Fields per record
“All records were autonomously discovered, extracted, and structured.
However, a fundamental data quality challenge remains.”
Same entity, different representations — this is entity resolution
| title | Summer ML Fellowship |
| org | Stanford University |
| url | stanford.edu/ml-fellow |
| deadline | 2026-03-15 |
| location | Stanford, CA |
| title | ML Summer Research Fellowship |
| org | Stanford University |
| url | careers.stanford.edu/ml |
| deadline | 2026-03-15 |
| location | Stanford, California |
Same real-world entity? High probability, but non-trivial to verify automatically.
Scale: Job posting studies show 50–80% duplication rates in aggregated listings
Zhao et al., WI-IAT 2021
Why it matters: duplicates degrade UX, inflate counts, waste storage & scraping budget
A well-studied 4-stage architecture
Given candidate tuple pair (t1, t2) with schema S = (title, org, deadline, location), compute a similarity score per attribute:
| Attribute | t1 value | t2 value | sim(ai) | Score |
|---|---|---|---|---|
| title | Summer ML Fellowship | ML Summer Research Fellowship | Jaccard(3-gram) | 0.71 |
| org | Stanford University | Stanford University | exact | 1.00 |
| deadline | 2026-03-15 | 2026-03-15 | exact | 1.00 |
| location | Stanford, CA | Stanford, California | token overlap | 0.67 |
Each attribute’s similarity contributes additive evidence (log-likelihood ratios). Sum the scores → threshold into three decision regions:
Fellegi & Sunter, JASA, 1969. Still the conceptual backbone of modern ER.
Blocking dominates runtime (O(N²) → O(K))
Matching dominates accuracy
The bottleneck: choosing the right similarity functions and the right classifier requires domain expertise and labeled training data.
Christophides et al., ACM Comp. Surveys 2021
Papadakis et al., ACM Comp. Surveys 2020
From hand-crafted features to learned representations
VLDB 2021
Key idea: Skip manual similarity functions. Serialize both tuples into one token sequence and fine-tune a pre-trained Transformer (RoBERTa) as a binary classifier.
Why it works: The LM “knows” from pretraining that “deluxe” ≈ “dlux” in context — something Jaccard similarity would miss entirely.
Limitation: Still requires labeled tuple pairs for fine-tuning.
SIGMOD 2023
Key idea: Replace manual labeling with labeling functions (LFs) — user-written heuristics that vote match / non-match / abstain on each tuple pair.
Why it works: LFs are noisy individually, but the ensemble + EM recovers clean labels. Transitivity catches contradictions.
Limitation: Still requires hand-written heuristic functions.
The gap: Ditto needs labeled tuple pairs. SIMPLE-EM needs hand-written heuristics. Could a general-purpose LLM just… read two tuples and decide?
The paradigm shift — from training to prompting
Vision (Fernandez et al., VLDB 2023): ER hit an “automation ceiling” — LLMs encode the commonsense ER needs. “S Indiana Ave” = “South Indiana Avenue” — no rules required.
| Dataset | Ditto (finetuned) | GPT-3 (k=10) |
|---|---|---|
| Fodors-Zagats | 100 | 100 |
| iTunes-Amazon | 97.1 | 98.2 |
| Amazon-Google | 75.6 | 63.5 |
Frozen LLM with 10 examples beats finetuned BERT on 4/7 datasets. But expensive per-pair.
↓ Press down for cost optimization & clustering
Batch 8 pairs per API call + covering-based demo selection
| Dataset | Standard | Batch | Saving |
|---|---|---|---|
| Walmart-Amazon | 67.5 | 78.9 | 4.3× |
| Abt-Buy | 65.7 | 85.8 | 4.6× |
Surprise: batching improves accuracy.
Skip pairwise — give LLM 9 records, ask it to cluster by entity.
| Dataset | Method | Accuracy | API Calls | Cost |
|---|---|---|---|---|
| Cora | Pairwise | 0.88 | 30,200 | $0.67 |
| Cora | LLM-CER | 0.90 | 279 | $0.03 |
100× fewer API calls, better accuracy.
Limitation: LLMs reason over provided data only. They cannot visit URLs, verify claims at source, or resolve ambiguity through external evidence gathering.
From field comparison to source verification
↓ Press down for cost tiers, failure modes & research gap
Cheapest first — only escalate what the previous tier can’t resolve
| Tier | Method | Cost/pair | Relative Cost | Catches |
|---|---|---|---|---|
| Deterministic | URL + title+org hash | ~$0.00 | ~40% | |
| Statistical | Fellegi-Sunter, Jaccard, overlap | ~$0.01 | ~30% | |
| LLM | GPT/Gemini field comparison | ~$0.05 | ~20% | |
| Agent | Visit URL, read page, verify | ~$0.10 | ~10% |
⚠ Same program, different URLs
“DAAD RISE” at daad.de, uni-heidelberg.de, scholarshipportal.com
Agent visits all three → confirms same program
⚠ Title variation
“SURF” vs “Caltech SURF Program” vs “SURF – Summer Research at Caltech”
Agent visits page → sees one program
⚠ Org aliases
“ETH Zurich” vs “Swiss Federal Institute of Technology”
Agent web-searches → resolves identity
⚠ LLM over-merging
“Stanford ML Fellowship” vs “Stanford NLP Internship” — different programs!
Agent checks both pages → different PIs → separates
No full paper at SIGMOD, VLDB, or ICDE on agent-based entity resolution. No study of cost/accuracy tradeoffs. No benchmark for cross-source ER where agents must visit URLs. Every ER pipeline hits a residual of 5–15% ambiguous pairs that field comparison cannot resolve. Agent verification is the natural next tier — and the research community hasn’t touched it.
Autonomous agents enable scalable construction of structured databases from heterogeneous web sources
Entity resolution emerges as the primary quality bottleneck in agent-populated databases — superseding extraction and crawling
LLM-based ER is a rapidly growing area at top DB venues — yet existing work assumes curated inputs, not LLM-generated data
Agent-verified ER remains unexplored in the literature — no published work on source-level verification for matching
Evaluation methodology
Standardized benchmarks and metrics for agent-driven ER pipelines
Cost–quality tradeoffs
Principled escalation policies from statistical to agent-based verification
Multi-agent architectures
Task-specialized agents for heterogeneous verification strategies
Online entity resolution
Incremental ER over continuously crawled, evolving web data
“As agents automate database construction, the research frontier shifts
from data extraction to data quality assurance.”
[1] Fernandez et al. “How LLMs Will Disrupt Data Management.” PVLDB 16(11), 2023
[2] Fellegi & Sunter. “A Theory for Record Linkage.” JASA 64(328), 1969
[3] Christophides et al. “End-to-End ER for Big Data.” ACM Comp. Surveys 53(6), 2021
[4] Papadakis et al. “Blocking and Filtering for ER.” ACM Comp. Surveys 53(2), 2020
[5] Thirumuruganathan et al. “DL for Blocking in EM.” PVLDB 14(11), 2021
[6] Li et al. “Ditto: Deep EM with Pre-Trained LMs.” PVLDB 14(1), 2021
[7] Wu et al. “Ground Truth Inference for Weakly Supervised EM.” SIGMOD 2023
[8] Papadakis et al. “Critical Re-evaluation of ER Benchmarks.” ICDE 2024
[9] Narayan et al. “Can FMs Wrangle Your Data?” PVLDB 16(4), 2023
[10] Fan et al. “Cost-Effective ICL for ER (BATCHER).” ICDE 2024
[11] Fu et al. “In-context Clustering ER (LLM-CER).” SIGMOD 2026
| Venue | Papers |
|---|---|
| SIGMOD | SIMPLE-EM, LLM-CER |
| VLDB/PVLDB | DeepBlocker, Ditto, Narayan, Fernandez |
| ICDE | Papadakis benchmark, BATCHER |
| ACM C. Surveys | Christophides, Papadakis blocking |