MCP vs RAG vs NLWeb vs HTML: A Comparison of the Effectiveness and Efficiency of Different Agent Interfaces to the Web (Technical Report)
Problem Statement
LLM-based web agents are increasingly deployed for tasks like product search and checkout, but researchers have been developing HTML, RAG, MCP, and NLWeb interfaces in isolation without a unified benchmark. This lack of direct comparison makes it impossible to objectively assess trade-offs in accuracy, token consumption, latency, and cost. Practitioners have no empirical basis for choosing the right interface architecture for their specific use case.
Key Novelty
- First unified testbed with four simulated e-shops exposing identical products via HTML, MCP, and NLWeb interfaces simultaneously, enabling apples-to-apples comparison
- Comprehensive evaluation spanning four interface types (HTML, RAG, MCP, NLWeb) across task complexity levels from simple search to multi-step checkout with four different LLM backends
- Quantified cost-performance trade-off analysis showing RAG with GPT 5 mini as an optimal efficiency-accuracy compromise, providing actionable guidance for practitioners
Evaluation Highlights
- F1 score rises from 0.67 (HTML) to 0.75–0.77 (RAG/MCP/NLWeb), with best configuration RAG+GPT5 achieving F1=0.87 and completion rate=0.79; token usage drops from ~241k (HTML) to 47k–140k per task
- Runtime per task drops from 291 seconds for HTML to 50–62 seconds for structured interfaces, representing a ~5–6x speedup that directly impacts user-facing latency and API costs
Breakthrough Assessment
Methodology
- Construct a controlled testbed of four simulated e-commerce shops, each exposing its product catalog through three server-side interfaces (HTML pages, MCP API endpoints, NLWeb natural-language query interface) to ensure identical underlying data across all conditions
- Develop four specialized agent architectures—HTML browser agent, RAG agent (pre-crawled vector index), MCP API agent, and NLWeb query agent—each optimized for its respective interface, and define a task suite ranging from simple lookups to complex complementary/substitute product queries and multi-step checkout
- Evaluate all agent-interface combinations using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as LLM backends, measuring F1 score, task completion rate, token usage, runtime, and estimated API cost to produce a multi-dimensional comparison
System Components
Four synthetic online stores each offering identical product data through HTML, MCP, and NLWeb interfaces, enabling controlled cross-interface comparisons
Baseline agent that navigates raw HTML pages via DOM parsing and link following, mimicking traditional web automation
Agent that queries a pre-crawled vector index of shop content using dense retrieval, augmenting LLM context with retrieved passages
Agent that communicates with shops via structured Web API calls using the Model Context Protocol, receiving structured JSON responses
Agent that issues natural-language queries directly to a shop's NLWeb endpoint, receiving semantically processed natural-language responses
Harness for running all agent-interface combinations across GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4, recording F1, completion rate, tokens, runtime, and cost
Results
| Metric | HTML (Baseline) | Best Non-HTML Interface | Delta |
|---|---|---|---|
| F1 Score (avg) | 0.67 | 0.75–0.77 (RAG/MCP/NLWeb avg); 0.87 (RAG+GPT5 best) | +0.08 avg; +0.20 best |
| Task Completion Rate | Baseline | 0.79 (RAG+GPT5 best) | Substantial improvement |
| Token Usage per Task | ~241,000 | 47,000–140,000 | ~1.7x–5x reduction |
| Runtime per Task (seconds) | 291s | 50–62s | ~5–6x speedup |
Key Takeaways
- Avoid raw HTML browsing for production LLM web agents whenever possible—RAG, MCP, or NLWeb interfaces deliver 5–6x lower latency, up to 5x fewer tokens, and meaningfully higher F1 with minimal additional infrastructure investment
- For maximum accuracy, pair a RAG interface with GPT 5 (F1=0.87, completion=0.79); for cost-sensitive deployments, RAG with GPT 5 mini offers the best performance-per-dollar trade-off among all tested configurations
- The choice of interaction interface is at least as impactful as the choice of underlying LLM model, meaning teams building web agents should prioritize interface architecture decisions early rather than defaulting to HTML and relying solely on model upgrades to improve performance
Abstract
Large language model agents are increasingly used to automate web tasks such as product search, offer comparison, and checkout. Current research explores different interfaces through which these agents interact with websites, including traditional HTML browsing, retrieval-augmented generation (RAG) over pre-crawled content, communication via Web APIs using the Model Context Protocol (MCP), and natural-language querying through the NLWeb interface. However, no prior work has compared these four architectures within a single controlled environment using identical tasks. To address this gap, we introduce a testbed consisting of four simulated e-shops, each offering its products via HTML, MCP, and NLWeb interfaces. For each interface (HTML, RAG, MCP, and NLWeb) we develop specialized agents that perform the same sets of tasks, ranging from simple product searches and price comparisons to complex queries for complementary or substitute products and checkout processes. We evaluate the agents using GPT 4.1, GPT 5, GPT 5 mini, and Claude Sonnet 4 as underlying LLM. Our evaluation shows that the RAG, MCP and NLWeb agents outperform HTML on both effectiveness and efficiency. Averaged over all tasks, F1 rises from 0.67 for HTML to between 0.75 and 0.77 for the other agents. Token usage falls from about 241k for HTML to between 47k and 140k per task. The runtime per task drops from 291 seconds to between 50 and 62 seconds. The best overall configuration is RAG with GPT 5 achieving an F1 score of 0.87 and a completion rate of 0.79. Also taking cost into consideration, RAG with GPT 5 mini offers a good compromise between API usage fees and performance. Our experiments show the choice of the interaction interface has a substantial impact on both the effectiveness and efficiency of LLM-based web agents.